This is minimal documentation on the Clusters software package. This package performs various analyses generally aimed at detecting clusters of sequences or patterns of overrepresented tuples (Nmers). Some of the software described here is highly experimental and may contain serious bugs. Those programs whose description below reads "description" have not yet been ported from OpenVMS to Solaris. Caveat emptor!
Most of these programs are invoked by typing:
Most of these programs will not write over an existing file. Attempts to do so will result in a message similar to this one:
program_name.subroutine: [1018] can't find 'old' file logical unit 10, named 'file_that_already_exists.txt' Abort (core dumped)
The file types which are used by these programs are:
| nmer lists | A list of Nmers of a fixed size and their positions in a sequence |
| sorted nmer lists | An Nmer list which has been sorted |
| range lists | A list of start/end positions for named ranges |
The clusters programs are as listed below.
| a_bin_dist | Prompt | Poisson analysis of a sorted Nmer list |
| a_spacing | Prompt | Histogram distances between successive NMERs in a sorted nmer list |
| a_vs_a | Prompt | Calculates the clustering values within a sorted nmer list using an heuristic methoc |
| a_vs_a2 | Prompt | Calculates the clustering values within a sorted nmer list in terms of "closeness" and significance |
| a_vs_a3 | Prompt | Dotplots through GCG graphics Nmers from a list which fall within ranges. Can do genomic scale dotplots. |
| a_vs_a4 | Prompt | Histogram distances between all Nmers in a sorted nmer list |
| a_vs_a5 | Prompt | Search for evidence of positional phasing in a sorted nmer list |
| a_vs_b | Prompt | Description |
| a_vs_b2 | Prompt | Description |
| a_vs_b3 | Prompt | Description |
| a_vs_range | Prompt | Counts Nmers falling in bins in a sorted range list |
| a_vs_range2 | Prompt | Probability of entries in a sorted nmer list falling in bins in a sorted range list |
| col_to_gcg | Prompt | Convert a table formatted data to a GCG enzyme file |
| complement_nmer | Prompt | Complements nmers in a fixed position(ie, start at column 1, size 6) |
| cores | GCG | Searches for patterns by symmetry, such QQQQnnnnQQQQ, where QQQQ can be any 4mer |
| dna_run_stats | GCG | Analyze the copy number coverage of a sequence and how it would have shotgun assembled |
| filter_content | Prompt | Selects lines from files based on tests of two Nmers for rotamer, conjugate, simple, and 100% AT. |
| filter_nmerlist | Prompt | Filter from a Sorted Nmer List entries that are too close together or are in specific positions |
| filter_nmerlist2 | Prompt | Filter sorted Nmer lists leaving only entries whose relative positions satisfy specified criteria. |
| find_close | Prompt | Extract rows from a file where a column is numerically close |
| from_binary_fosn | Prompt | Description |
| histo_column | Prompt | Generate a histogram from a column of numbers |
| histo_column_binary | Prompt | Description |
| least_nmer | Prompt | Retain in a list an Nmers only if it is lexically <= its complement. |
| make_binary_fosn | Prompt | Description |
| make_nmer_list | GCG | Converts one or more GCG formatted file into an NMER list |
| make_nmer_list_binary | GCG | Description |
| make_nmer_profile | Prompt | Description |
| make_random_dna | GCG | Use Markov Chain methods to make DNA with statistical properties similar to the input |
| make_ranges | Prompt | Converts a file of (name,begin,end) triplets into a Range_List. |
| markovtest | GCG | Calculate probability of subsequences being found within a set of sequences |
| nmer | GCG | Count tuples (Nmers) within a DNA or protein sequence. Probability estimates. |
| nfindpatterns | GCG | Variant of GCG findpatterns that outputs hits in a format that can be read by the CLUSTERS programs. |
| no_holes | Prompt | Fill in missing records in a sorted NMER list |
| occupation | Prompt | Calculates the length of a sequence or sequences that is/are covered by a defined set of patterns |
| report_usage | Prompt | Analyzes copy number information from a sorted Nmer list derived from a single sequence |
| sum_nmer_list | Prompt | Process a sorted nmer list to yield the number of times each tuple was observed |
| test_against_nmer_profile | Prompt | Description |
| test_shotgun | Prompt | Simulates shotgun sequencing projects and generates cost estimates |
| unique_nmer | Prompt | Converts Nmers to a unique form, eliminating conjugates and rotamers |
| zero_palindrome_nmer | Prompt | Detects the "higher" complement form in an Nmer list and overwrites its data field |
On VMS you will also use: SORT SEARCH On Unix you will also use: sort grep
table of contentsPerforms a Poisson analysis starting from a sorted Nmer list. It finds first the bin (you specify size) that contains 6 sites, nulls those out, finds 5 sites, etc. down to 2. It then estimates the number of 1 site bins as
Histogram distances between SUCCESSIVE NMERs in a sorted nmer list. ( Compare to A_VS_A4, which histograms distances between ALL NMERs in such a list.) The histogram always has 11 bins. The first 10 are of a user specified width, the 11th is for everything larger than that. Output looks like:
Sequence Count Chi2: 2 4 6 8 10 12 14 16 18 20 > 20 AAGTT 6 13.7 2 0 1 0 0 0 0 0 0 0 2 Columns are: 1. Nmer sequence 2. Number of Nmers in calculation 3. Chi**2 value over histogram, first 10 bins ONLY!!! Where EXPECTED is the average of the first 10 bins. (This is intended to detect peaks, not compare to a Poisson distribution.) 4-13 10 histogram bins. First line shows highest value in the bin 14 11th histogram bin = all values higher than biggest bin.table of contents
Calculates the occupancy of up to 14 ranges from a sorted Nmer_list and a Ranges_LIST. First it reads in the ranges, then it processes each type of NMER from the NMER_LIST. If an Nmer falls within a range the count for that range is incremented. The KMIN parameter can be set by the user so that only output lines for NMERS with >= KMIN ranges occupied are printed. The format for a RANGES_LIST is more or less like that for an NMER_LIST except that each range has only two entries (begin,end) of range. Range names can be up to 8 characters long - longer ones are truncated automatically. Do not forget to put an extra space on each line after the name. Note that ranges do not interact with each other, so they can be overlapping or in arbitrary order - they are each scored independently.
a_vs_range queries are: Input the name of Sorted Nmer file to process Input the name of the Sorted range file to process Input the name of the output file Please specify the minimum number of ranges which must be occupied for an output line. The value must be between 0 and 3 Example: cat > dmwhite_3.ranges <<EOD 000001460 part1 000004460 part1 000004461 part2 000008460 part2 000008461 part3 000014245 part3 EOD a_vs_range <<EOD dmwhite.sorted dmwhite_3.ranges dmwhite_a_vs_range.txt 3 EOD head -3 dmwhite_a_vs_range.txt Nmer Tested part1 part2 part3 AAGTT 53 13 20 13 AATTT 80 27 15 31 The output consists of 12 spaces Nmer name 8 spaces The total number of Nmers tested 8 spaces The number of Nmers in range 1 8 spaces The number of Nmers in range 2, etc.table of contents
This program reads a SORTED NMER_LIST and a sorted range list both in the same format (except that the range list does not have .NMER. and .STEP. fields.) It sees how many of each type of NMER fall into a named range, then calculates the probability of that having happened by chance (Poisson, using expected given target size of range, vs. observed in target.)
Ranges may be made up of multiple pieces, ie a list of exons or whatever. The number of nmers that fall within any named range is calculated, and then from that the probability of having that many vs the expected. Ie, if the full sequence is length L, and the range is length R (even though it may be in many pieces), and N nmers are tested, where M fall in a range. Then the expected number is (R/L)*N and the observed is M. A cumulative Poisson is applied to calculate the probabilities of this.
a_vs_range2 queries are:
Input the name of Sorted Nmer file to process
Input the name of the Sorted range file to process
Input the name of the output file
You may specify a single probability test to exclude
some output lines. Only Nmer/Range combinations
that pass the test will be output
0 Obs. > Exp., P<= Pmin
1 Obs. < Exp., P<= Pmin
2 Obs. > Exp. OR Obs. < Exp., P<= Pmin
Pmin must be in this range:
0.0 <= Pmin <= 1.0
In order to calculate probabilities the program
needs to know the total length of the sequence(s)
from which the Nmers and ranges were drawn.
Example:
cat > dmwhite_3.ranges <<EOD
000001460 part1
000004460 part1
000004461 part2
000008460 part2
000008461 part3
000014245 part3
EOD
a_vs_range2 <<EOD
dmwhite.sorted
dmwhite_3.ranges
dmwhite_a_vs_range2.txt
0
.01
14245
EOD
cat dmwhite_a_vs_range2.txt
Nmer Range Test Hits Exp lP:O>E lP:O<E [lP=-log10(P)]
AATTT part1 80 27 16.854 2.100 0.003
CGCGT part1 11 6 2.317 2.012 0.004
GCCCG part3 24 18 9.747 2.255 0.002
GCCTT part1 18 9 3.792 2.243 0.002
GTACT part2 12 10 3.370 3.123 0.000
GTTTG part2 32 18 8.986 2.622 0.001
TCCGG part3 26 19 10.559 2.212 0.003
TGGCC part3 46 29 18.681 2.018 0.004
TTCGA part2 31 16 8.705 2.085 0.004
Limitations:
No more than 100 named ranges per run
No more than 10000 total subranges per run (except
where they fold down to one, as described next)
Note, if a series of subranges for a particular name
are specified so that
subrange(i-1).end = subrange(i).begin
then all of these subranges will be compressed down into
a single subrange. This can be used with a series of points
that can go into a_vs_b2, for instance, and still work in
this program.
table of contents
This program calculates statistics for the 1,2, and 3 closest neighbors from a Sorted Nmer List.
A_VS_A queries are: Input the name of the file to process Input the name of the output file Also process non Nmer marked locations Y/N [N] You may specify a minimum allowed distance between sites This distance will be NMER + value A value between -NMER+1 and 0 allows some overlap A value greater than 0 allows no overlap A value less than or equal to -NMER is nonsense Example: make_nmer_list -infile=GB_IN:DMWHITE -nmer=5 -out=killme.txt -default sort -t\n -k0.10 -k0.0,0.9 killme.txt > killme2.txt a_vs_a <<EOD killme2.txt dmwhite_a_vs_a.txt N 5 EOD head -7 dmwhite_a_vs_a.txt Sequence *N* Count Mean Std.Dev. Sum d Sum d*d RMS Mean AAGTT *1* 53 126 144. 6696 1943854. 26. AAGTT *2* 53 346 361. 18352 13238560. 69. AAGTT *3* 53 506 419. 26840 22863394. 90. AATTT *1* 69 127 112. 8776 1972120. 20. AATTT *2* 69 283 195. 19585 8157917. 41. AATTT *3* 69 436 252. 30107 17511569. 61.This calculates the clustering values for a sorted NMER_LIST. The numbers in the output are left to right:
Seq Up to 12 bases of sequence (or tag names) *n* distance to 1,2, or 3 closest Nmer of the same type N number of such distances measured Mean Mean of closest *n* distances for that Nmer SD Standard deviation of this value Sum of *n* distances (raw data from which Mean was calculated) Sum**2 ditto RMS Mean (Sqrt of Sum**2)/N
There are four kinds of output lines, marked with *N*,*1*,*2*,*3*. The first is a header line, the rest are the *n* distances. If you run HISTO_COLUMN on this file exclude all but the *n* that you are interested in.
table of contentsReads a sorted NMER_LIST and calculates from it a value and a significance for the "incremental" minimum distances observed. The method is essentially this:
1. Let there be N points in a length of L.
2. For i = 1->N
A. d = measure the minimum distance from i to another point
B. calculate the probability p(i) of a minimum distance <=d given
the N-1 points and random placement of the point i. This
is relatively straightforward. Just break up all the
intervals and stack them up like sticks. Since the
probability of being at <= distance is just = distance
on each stick, the integral is simple.
3. P mean = sum i = 1->N of p(i) / N.
Step 3B should result in a set of p(i) values that are
uniformly distributed in the range 0<->1. Now consider the case of
dropping a point X into the interval 0<->1 at random, all points
are equally likely, so the expected value is:
integral (0->1) of xdx = 1/2.
The variance of this is just:
integral (0>1) of (x-1/2)(x-1/2)dx = 1/12
Since each point "i" is independent, the std. dev. of the mean
for N points is just sqrt( 1 / (N*12) ). There is a program
testflat.for that should be in this directory which demonstrates this.
This provides a handy test for the significance of Pmean, which is
just (Pmean -1/2)/stdev.
One complication. Because the same distribution (more or less) is
used for each point's substitution each distance is essentially tested
twice against the same distribution. This has the effect of making
the observed Variance sqrt (1/6N) rather than sqrt(1/12N).
The nice thing about this one is that it
both calculates a "closeness" number (the average probability of
observing each Nmer as close as it is to the others) and a "significance"
number (observed - 1/2 / standard-deviation). The Standard deviation
is approximated (to about 10%) by sqrt(1/6N), where N is the number of
positions tested for each Nmer.
A_VS_A2 queries are:
Input the name of the file to process
Input the name of the output file
Also process non Nmer marked locations Y/N [N]
Please enter the length of the full sequence
You may specify a minimum allowed distance between sites
This distance will be NMER + value
A value between -NMER+1 and 0 allows some overlap
A value greater than 0 allows no overlap
A value less than or equal to -NMER is nonsense
Example:
make_nmer_list -infile=GB_IN:DMWHITE -nmer=5 -out=killme.txt -default
sort -t\n -k0.10 -k0.0,0.9 killme.txt > killme2.txt
a_vs_a2 <<EOD
killme2.txt
dmwhite_a_vs_a2.txt
N
14245
1
EOD
head -3 dmwhite_a_vs_a2.txt
Sequence N >Minover PMean #variances
AAGTT 53 53 .4152254 1.5117
AATTT 80 70 .5868977 1.7809
The output columns are:
Sequence Up to 12 bases of sequence (or tag names)
N number of sites present
>Minover number of sites tested (some are rejected if they are too close)
PMean mean probability of each site being as close as observed to the
others.
#variances |PMeanP -1/2|/standard-deviation
table of contents
This does dotplots based on sorted Nmer lists. It takes as input a sorted numer list, and a ranges file (for format see A_VS_RANGE). It outputs .FIGURE files which are dotplots of all nmers that are in one range (SELECT) vs. all other ranges (and itself). That is, for each NMER type, all instances are plotted. There can be up to 20 ranges, which may overlap. The primary advantage of this program over other DOTPLOT programs is that one can generate sorted lists using tools like NFINDPATTERNS and FILTERPATTERNS and then generate dotplots based on that. It can also make a dotplot of very large sequences, which cannot be done with COMPARE. For instance, one could search a sequence for a set of complicated patterns which correspond to some DNA motif, and then DotPlot those. This may be useful in picking out gene structures where significant sequence divergence has occurred throughout, but some fixed set of motifs has been conserved.
The program can be CPU intensive, run it from within a .COM file in a batch queue or with reduced priority. On VMS use:
To view the output file, use the GCG FIGURE program after first configuring your GCG grapics environment. (SETPLOT,SHOWPLOT,POSTSCRIPT,TEKTRONIX, etc.)
a_vs_a3 queries: Input the name of the file to process The output files have names: PREFIX_NN_MM.FIGURE,Input the PREFIX now Please enter the length of the full sequence Input the name of the Sorted range file to process There are N ranges One of these must be the reference range Please enter a number between 1 and N: NMER lists may contain entries which begin with a space. Options: 1 Include these in the processing 0 Do not include these Set the diagonal filter, which suppresses dots within N spaces of each other. Options: N Only include points greater than N spaces apart in the sequence 0 Include even self-self points on the diagonal The dotplot may be numbered either from absolute coordinates or from 1, where 1 corresponds to the first base of each range. Options: 1 Number each range from 1 0 (or any number but 1) number with absolute coordinates Example: (Note that the spaces after the names in the ranges file are mandatory! In this example the resulting files are sent to a postscript printer.) cat > dmwhite_3.ranges <<EOD 000001460 part1 000004460 part1 000004461 part2 000008460 part2 000008461 part3 000014245 part3 EOD a_vs_a3 <<EOD dmwhite.sorted dmwhite 14245 dmwhite_3.ranges 1 0 0 1 EOD postscript laserwriter "|lpr -Psomeprinter" figure dmwhite_part1_part2.figure figure dmwhite_part1_part3.figuretable of contents
Histogram distances between NMERs in a sorted nmer list. The histogram always has 11 bins. The first 10 are of a user specified width, the 11th is for everything larger than that. Output looks like:
Sequence Count Chi2: 2 4 6 8 10 12 14 16 18 20 > 20 AAGTT 12 27.0 0 0 3 0 6 0 0 3 0 3 0 Columns are: 1. Nmer sequence 2. Number of Nmers in calculation 3. Chi**2 value over histogram, first 10 bins ONLY!!! Where EXPECTED is the average of the first 10 bins. (This is intended to detect peaks, not compare to a Poisson distribution.) 4-13 10 histogram bins. First line shows maximum value for that bin 14 11th histogram bin = all values higher than biggest bin. a_vs_a4 queries: Input the name of the file to process Input the name of the output file Also process non Nmer marked locations Y/N [N] You may specify a minimum allowed distance between sites This distance will be NMER + value A value between -NMER+1 and 0 allows some overlap A value greater than 0 allows no overlap A value less than or equal to -NMER is nonsense There are always 10 histogram bins. Each bin is WIDTH wide, where WIDTH is greater than or equal to 1 Example: a_vs_a4 <<EOD dmwhite.sorted dmwhite_a_vs_a4.txt N 1 10 EOD cat dmwhite_a_vs_a4.txt head -2 dmwhite_a_vs_a4.txt</B> Sequence Count Chi2: 6- 15 25 35 45 55 65 75 85 95 105 > AAGTT 53 11.4 1 6 1 2 5 1 3 1 2 3 1353table of contents
UNTESTED, UNFINISHED, DO NOT USE YET Examine a sorted nmer list. For each position j=1->min(step size,jmax-window+1) a score is calculated over a window which is the sum of position(i)*sin(i*period + phase), for i=1->Window size. For each nmer the highest score is returned.
table of contents
This calculates clustering values for each (A,B) pair of Nmers from a
sorted NMER_LIST. It uses essentially the same method as A_vs_A2.
Briefly, for each Nmer in A, find the closest Nmer in B, that distance is
d. Find the fraction of space in the test sequence that is d or closer,
that is p. Average the resulting p values. It can be shown in closed
mathematical form that Pmean expected is 1/2, and the standard deviation of
the mean is SQRT(1/12*N), N = number in list A. (The number in list B
doesn't change this calculation!) If A is
The program has two modes. In the first it just compares all A vs all B. This can easily take several hours. In the second mode, you can add a set of known sites (Promoters, enhancers) etc. to the sorted Nmer list. Note that several of each might exist in a sequence, and each type should have the same name. These entries look like:
9000 Enhancer
10005 Enhancer
12345 Enhancer
^ ^ Mandatory spaces, DON'T FORGET THE TRAILING SPACE!!!
^ Position in the sequence
Be sure to sort the list again, so that these are up at the front or the program will fail mysteriously. Mode 2 asks, "is this Nmer associated with this site, and coversely, is this site associated with this Nmer". The second part will only tell you something if you have several instances of the feature.
When you run A_VS_B2 it will create a .COM file which you should then submit to a batch queue for the actual run. Mode 1 runs typically take only a few minutes, but Mode 2 runs can take many hours. ONLY ONE RUN AT A TIME - THE PROGRAM CREATEs A SCRATCH FILE AND MAY GET CONFUSED OTHERWISE.
The output looks like:
Sequence A Sequence B Number USED in A = N Number in B Mean P significance (|MeanP-1/2|/sqrt(1/12N)table of contents
This calculates clustering values for each (A,B) pair of Nmers from a sorted NMER_LIST. It uses a form of integration. For each Nmer of type A it constructs either a hat function or a triangle function whose width you specify. It then counts up (weighted) the number of Nmers of type B within that function. User specifies:
Type of function: hat or triangle Size of function: an integer, like 1000 Criteria for a score to be saved: Minimum score: result of hat/triangle function Maximum score: " Minium number of pairs scoredtable of contents
This is an offshoot of NMER. It allows arbitrary "core" sequences in fixed geometries to be searched for. It is VERY hard to use, be sure that you test your core pattern on a positive control before you run it. To use it, first construct a pattern file which should look like this:
*A,8,0 <-core "A" size 8, offset 0
*B,9,10 <-core "B" size 9, offset 10
-A,1,20 <- inverted A, mismatch allowed = 1, offset 20
+AB,1,-10 <- direct A or B, mismatch allowed = 1, offset -10
CG,0,11 <- C or G at offset 11, no allowed mismatches
AT,1,12 <- A or T at offset 12, one allowed mismatch
=,1,0 <- Cumulative mismatch over all pattern elements = 1
This must be the VERY LAST line, if present.
The above is a form of logical AND. NOT and OR can be accomplished with these commands:
!,x,y <- PRECEDES Y lines NOT containing *,=,!,|
and means "exactly X of the following Y conditions
are false", unless X = 0, where it means "ALL are
false."
|,x,y <- PRECEDES Y lines NOT containing *,=,!,|
and means "exactly X of these Y are true". The error
reported towards cumulative error is the smallest
one found in the group. If X is zero, then
it means "one or more of these Y are true."
Note that the meaning of "Cumulative mismatch" is DIFFERENT for a region of AND logic and a region of OR logic. For AND each statement adds to the cumulative mismatch. However, for each SERIES of OR statements (the block folling a | statement) the cumulative mismatch is only increased by the minimum mismatch found. That is, if there are 10 statements in an OR block and all need match, and all do, and the first one has 2 mismatches and the other 9 have 3 each, then cumulative mismatch only increases by 2.
Note that these are equivalent:
!,1,0
ACG,0,10
and
!,3,0
A,0,10
C,0,10
G,0,10
These are traversed top to bottom. There can be extra spaces in the file, but blank lines and comments (like above) may not be present. All cores must be declared at the beginning of the file in the order A,B,C...->Z. The offsets are all with respect from the first base of the first core.
The CORES program reads this file and then applies it to a sequence or sequences. In the example given it would first check for an inverted "A" core at +20. If >1 mismatch it would move on to the next position. If <=1 it would try the next pattern element, the direct "B" core at offset -10. If that condition passed it would check for a C or G at 11, and then A or T at 12 and lastly, check to see if the cumulative mismatch score was in bounds. If *ALL* conditions have been satisfied it stores the sequence of the two cores like: AAAAAAAABBBBBBBBB.
-Width This command line qualifier makes sure that there is enough room to store the cores and their counts. /Width must be a multiple of 4, and it must be greater than or equal to 4 + sum_cores_length + 1. For the example above this is 4 + 8 + 9 + 1 = 22, so /Width = 24. Width defaults to 12 (wide enough for a single 7mer core). -mismatch Sets the cumulative mismatch limit. The value in the pattern file, if present, takes precedence. Defaults to 0. -cutoff Sets a lower bound on the number of times a core must appear to be written out. Defaults to 1.table of contents
This can take the output of the A_VS_A or any other program which emits sequence data in column format and converts it to a GCG sites file. You can then use that sites file with Mapplot to graphically show where the clustered sites are. This program can conditionally include some patterns and exclude others, based on the values in other columns. Columns must be in fixed positions, not tab separated.
col_to_gcg queries are: Input the name of the file to process Input the name of the output file Enter as many lines of comments as you would like End each line with a <return> End the last line with <return><return> Please enter the locations of the columns and their widths in this format: Begin, Width (0,0 terminates entry) The first entry MUST be the SEQUENCE location Enter up to 100 strings, which if found anywhere on a line cause that line to be ignored. These will be CASE SENSITIVE A blank line terminates entry of these strings If you enter no conditions all lines are processed that are not "ignored". You must supply a condition, column number, and value. All tests are on INTEGERS!!!!! The conditions that you may apply are |Code| Logic 1 equal to 2 not equal to 3 less than 4 less than or equal to 5 greater than 6 greater than or equal to Specify which of the numbered columns to place in the description field (after the !) in the output file. The columns that you specify will be separated by spaces and concatenated in the order in which you name them here. Total width must be <86 characters. Enter one column number per line (0 terminates entry): Example: (Note: first run the a_vs_a2 example to produce dmwhite_a_vs_a2.txt) which looks like this: Sequence N >Minover PMean #variances AAGTT 53 53 .4152254 1.5117 AATTT 80 70 .5868977 1.7809 etc. 12345678901234567890123456789012345678901234567890 with columns: 1 1,5 2 14,5 3 20,5 4 27,7 (because only integers may be tested) 5 35,8 (this can't be tested, it's a float) ) col_to_gcg << EOD dmwhite_a_vs_a2.txt dmwhite_gcg_enzyme.txt This is comment one This is comment two This is the last comment 1,5 27,7 0,0 > 4,2,4000000 0,0,0 1 2 0 EOD head -6 dmwhite_gcg_enzyme.txt This is comment one This is comment two This is the last comment NAME 0 SEQUENCE 0 ! COMMENTS .. AGGCT 0 AGGCT 0 ! AGGCT 3162714 AGGGT 0 AGGGT 0 ! AGGGT 3142303
The amount of white space in dmwhite_gcg_enzyme.txt was reduced so that the important pieces could be more easily seen.
table of contentsThis program converts all nmers in a fixed position size (ie, start at column 1, size 6) to their complements.
Prompts: Input the name of the file to process Input the name of the output file Beginning column for the Nmer in each record Length of the Nmer in each record Example: cat >killme.txt <<EOD AAGC 12 AGCT 1 ATTT 2 TTCT 4 EOD complement_nmer <<EOD >/dev/null killme.txt killme2.txt 4 4 EOD cat killme2.txt AAGC 12 AGCT 1 AAAT 2 AGAA 4table of contents
This program attempts to determine the lengths and frequencies of regions that are repeated 1,2,3 or more times in a single sequence. It outputs a map of the repeat usage for each sequence, and also a summary of repeat frequency and size over all sequences. The algorithm paints sequences with the number of times each strand degenerate Nmer at that position is found anywhere in the sequence. Summarizes the resulting run lengths. This gives the distribution of repeat lengths and also the gaps between unique sequence segments. (WARNING! You must have enough disk quota to hold 2*Seq.Length*Nmer bytes.) This program can take a long time, since it needs to sort files containing one Nmer per base in potentially very large files. Useful for figuring out how many SPLITs and CONTIGs a bunch of sequences would have if they had been assembled solely by shotgun methods (and assuming perfect reads on each sequence.)
Syntax: $ DNA_RUN_STATS [-INfile=]Primate:* -NMER=-Default
Required Parameters:
-NMER={number} Length of the Nmers to analyze (typically 50)
Local Data Files: None
[-OUTfile=] output file name
Optional Parameters:
-BEGin Starting base
-END Ending base
-SPLit=400 Repeat size that splits a contig, default shown
-MAXNruns=400 Runs of N larger than this abort entry
-FULL Output all runs to file
-SUMMARY Summarize files and results to screen
In the output, a SPLIT entry represents a repeated (or maybe N run) that is >= the size of a SPLIT. Contigs listed are those around the splits. !N entries take the opposite extreme and treat N's as valid sequence, so they cannot contribute to a split. Some Genbank entries have long runs of N, and those are eliminated by the default /MAXNruns value, if you want to count them as splits, set /MAXNruns higher than the largest N run you expect to find.
The output file looks like this (see report_usage for an explanation of the column's contents):
Run length report, total bases: 4622, sequence: GB_PR:HSATP1AX2
Begin End Length Copy Type Gap
1 357 357 1 0 0
358 408 51 0 0 0
409 1191 783 1 0 51
1192 1195 4 2 538 0
1196 1273 78 1 0 4
1274 1275 2 2 2525 0
1276 1323 48 1 0 2
1324 1333 10 2 1595 0
1334 1366 33 1 0 10
1367 1370 4 2 538 0
1371 1429 59 1 0 4
1430 1431 2 2 2525 0
Splits are gaps greater than the split limit, here 400. The !N entries treat N as unique sequence, rather than as part of a gap. The >=2 entry is the sum of all gaps containing repeated sequence, whether that is repeated once or 8 times.
Limitations: Nmers repeated more than 58 times cannot be handled properly, so use large Nmers to avoid this. If it is a problem, modify the SHORTMAX variable in the program.
table of contentsThis program can be used to filter sorted Nmer lists. It restricts overlap or proximity each type of Nmer. For instance, if the Nmers are 6mers, a value of 0 allows them to be adjacent, but not overlapping, a value of -5 lets them overlap except for one base, and a value of 5 means that there must be at least 5 bases between nearest instances. The filter will remove instances from the list that fail this rule. For instance, for a 6mer, value 0, if the positions were originally:
filter_nmerlist can also remove ranges of positions - useful for taking at exons for instance. Specify positions like "lower,upper" and terminate the list with "0,0". The order of the range doesn't matter. This is handy when working from CDS in Genbank entries that can be on either strand.
filter_nmerlist queries are:
Input the name of the file to process
Input the name of the output file
You may specify a minimum allowed distance between sites
This distance will be NMER + value
A value between -NMER+1 and -1 allows some overlap
A value of zero allows the sites to be adjacent
A value greater than 0 allows no overlap
A value less than or equal to -NMER is nonsense
You may specify up to 64000 ranges within which Nmers
will not be allowed. The format is:
lower, upper
Order doesn't matter: 1,10 and 10,1 both filter out 1<->10
Terminate the list with "0,0"
Example:
cat >test_filter <<EOD
6 .NMER.
1048576 .STEP.
1AAAAAA
2AAAAAA
3AAAAAA
4AAAAAA
5AAAAAA
6AAAAAA
7AAAAAA
8AAAAAA
20AAAAAA
30AAAAAA
EOD
filter_nmerlist <<EOD
test_filter
test_filter_out.txt
0
9,20
0,0
EOD
cat test_filter_out.txt
6 .NMER.
1048576 .STEP.
1AAAAAA
7AAAAAA
30AAAAAA
table of contents
This program can be used to filter sorted Nmer lists for entries that satisfy conditions based on their relative positions. For instance, it can be used to restrict the list based on the spacing between N1,N2,N3 - which are the positions of three consecutive Nmers of the same type such that:
10 < N2 - N1 & lt;20 and 10 < N3 - N2 & lt;20 and 22 < N3 - N1 & lt;32
To specify the above enter the following at the prompt:
1,2,10,20
2,3,10,20
1,3,22,32
0,0,0,0
filter_nmerlist2 queries are:
Input the name of the file to process
Input the name of the output file
You may specify up to 10 distance criteria
For each you must specify four values as
A,B,MIN,MAX
A = criteria position of one Nmer
B = criteria position of the other Nmer (A < B)
MIN = minimum distance between them (>=0)
MAX = maximum distance between them (<=2000000000)
The MIN,MAX distances are measured between the
left edges, ie, min=1 would allow them to overlap
except at the leftmost base.
Let N1,N2,N3 be the positions of three consecutive
NMERs of the same type.
Example criteria which means 10<(N2-N1)<20, 30<(N3-N2)<35
1,2,10,20
2,3,30,35
Example criteria which means same plus 40<(N3-N1)<50
1,2,10,20
2,3,30,35
1,3,40,50
To terminate list of criteria enter:
0,0,0,0
Example:
cat >test_filter <<EOD
6 .NMER.
1048576 .STEP.
10AAAAAA
20AAAAAA
30AAAAAA
40AAAAAA
104AAAAAA
210AAAAAA
220AAAAAA
240AAAAAA
EOD
filter_nmerlist2 <<EOD
test_filter
test_filter_out.txt
1,2,10,20
2,3,10,20
1,3,15,25
0,0,0,0
EOD
cat test_filter_out.txt
6 .NMER.
1048576 .STEP.
10AAAAAA
20AAAAAA
30AAAAAA
Note that any Nmer that satisfies these criteria will be written out, including cases where for instance, two or three overlapping N2's exist. To reduce these to just one (if that is desired) run the output of filter_nmerlist2 through filter_nmerlist. > table of contents
Many pairwise tests primarily pick up rotamer, conjugate, simple, and AT rich pairs. This program can be used to filter these Nmers out of output files, if said files have records of the form:
1. Rotamers 2. Conjugates 3. 100% AT 4. Simple (repeats of 1,2, or 3 bases) 5. Apply conditions 1-4 above except at N bases
In each case, "1" is the default to enable that text, "0" disables it. Tests 3 and 4 can be applied to the first, second, both, or either Nmers, with the latter being the default.
filer_content queries are: Input the name of the file to process Input the name of the file to receive the records kept Input the name of the file to receive the records rejected 1 Reject rotamers0 Keep rotamers 1 Reject conjugate Nmers 0 Keep conjugate Nmers Action for "simple" Nmers 0 Keep all 1 Reject if at either position 2 Reject if at both positions 3 Reject if at first position 4 Reject if at second position Action for 100% A/T Nmers 0 Keep all 1 Reject if at either position 2 Reject if at both positions 3 Reject if at first position 4 Reject if at second position Enter N for the following: Keep if satisfy above except at N or fewer bases (N=0 All lines kept) (N=1 Lines with up to 1 mismatch kept) Example: cat >test_content <<EOD GCTGCT CTGCTG a rotamer pair GCTGCT CTGCTA a rotamer pair with 1 mismatch GCTGCT AGCAGC a complemantary pair GCTGCT AGCAGC a complemantary pair with 1 mismatch AAAAAA GCTGCT first sequence is 100% AT AGCAGC ATATAT second sequence is 100% AT GGGGGG ACGTAT first sequence is simple GTGTGT ACGTAT first sequence is simple GCTGCT ACGTAT first sequence is simple ACGAGG CCGTAT not rotamer, complement, or simple EOD filter_content <<EOD test_content test_content_keep.txt test_content_reject.txt 1 1 1 1 1 EOD cat test_content_keep.txt ACGAGG CCGTAT not rotamer, complement, or simple cat test_content_reject.txt GCTGCT CTGCTG a rotamer pair GCTGCT CTGCTA a rotamer pair with 1 mismatch GCTGCT AGCAGC a complemantary pair GCTGCT AGCAGC a complemantary pair with 1 mismatch AAAAAA GCTGCT first sequence is 100% AT AGCAGC ATATAT second sequence is 100% AT GGGGGG ACGTAT first sequence is simple GTGTGT ACGTAT first sequence is simple GCTGCT ACGTAT first sequence is simple
That is, N is applied after the other tests, it means "reject if the test holds with up to this many mismatches." Lines that don't fit the "nmer nmer stuff" pattern go into the kept file. One trick, if you only want the rejects or the kept, direct the other file to "NLA0:", which is the null device. This eliminates the need to create a file that you don't want and then have to delete it later.
There is essentially no error checking in this program - typos will cause it to crash.
table of contentsSearches through a sorted file for numeric values which are closer than "k" apart. When such a pair is found the second such line is emitted. The first line is not emitted (unless it satisfies this condition for another pair.)
Prompts: Input the name of the file to process Input the name of the output file Numeric fields begin in which column of the record, must be greater than or equal to 1. Length of the Numeric field in each record Maximum difference between consecutive lines which is considered a match (>= 0) Begin processing lines at which file record? Example: 1 = first record of file Example: cat >killme.txt <<EOD 10001 A 10004 B 10101 C 10104 D 10201 E 10304 F EOD find_close <<EOD killme.txt killme2.txt 1 7 5 1 EOD matches found: 4 lines processed: 8 cat killme2.txt 10002 A2 10003 A3 10004 B 10104 Dtable of contents
Converts specific entries from a binary FOSN to a regular list file (FOSN). This program is designed to be run in batch mode - it does no error checking. Typically it would be run following some processing on a binary NMER file, which would result in a series of sequences, specified by their positions in the original list file, to retrieve. See MAKE_BINARY_FOSN and MAKE_NMER_LIST_BINARY
Inputs are:
1. The name of the binary fosn
2. The name of the file containing a list of records to retrieve (
terminates with a 0 (zero) or an end of file).
3. The name of the output file.
Example:
$ create test.order
1
2
10
^Z
$ run clusters:from_binary_fosn
human.lis_bin
test.order
test.out
$ type test.out
..
gb_est:whatever_for_1
gb_est:whatever_for_2
gb_est:whatever_for_10
table of contents
Run this after SUM_NMER_LIST, A_VS_A, or any other program that emits columns of numbers.. It generates a histogram from a column of numbers, optionally skipping certain lines and binning buckets together. In the following example a histogram with bins of width 10 is generated based on the 4mer composition of a sequence.
Prompts:
Input the name of the file to process
Input the name of the output file
The column's starting position, in characters:
The column's width, in characters:
Bin size (>0!!):
Enter up to 5 strings, which if found anywhere
on a line cause that line to be ignored.
These will be CASE SENSITIVE
A blank line terminates entry of these strings
String =
Example:
nmer -infile=gb_in:dmwhite -nmer=4 -markov=2 -out=killme.txt -cutoff=1 -default
head killme.txt
Expected generated by: Markov 2/1
Direction analyzed: Forward
Required direct repeats of Nmer core: 1
Only observed >= 1 AND Prob <= 1.00999999 shown
Nmer Obs Exp P (tail)
AAAA 231 175.632 0.00003737
AAAC 108 100.697 0.24606484
AAAG 97 90.417 0.25781465
AAAT 143 132.281 0.18627548
AACA 86 88.200 0.43496028
histo_column <<EOD
killme.txt
killme2.txt
24
5
10
e
EOD
cat killme2.txt
0, 0
10, 0
20, 16
30, 51
40, 66
50, 45
60, 33
70, 11
80, 13
90, 5
100, 4
110, 4
120, 4
130, 0
140, 1
150, 0
160, 1
170, 0
180, 0
190, 0
200, 0
210, 0
220, 0
230, 1
240, 0
250, 1
table of contents
This should be run on sorted binary data. It histograms the number of times runs of given lengths are found in the data. Optionally, it will output those that occur more than CUTOFF times (after first converting them to DNA sequence.) For instance, if run on a sorted BINARY NMER list this can tell you that there are 120 Nmers that appear 10 times.
table of contentsThis program converts all nmers in a fixed position size (ie, start at column 1, size 6) to their complements, then compares the two. Only if the original NMER is lexically less than or equal to its conjugate it is sent to the output file. The purpose of the exercise it to strip out redundant lines for those analyses which give the same score for AAAAT and ATTTT. This is more or less the inverse of NO_HOLES.
Prompts: Input the name of the file to process Input the name of the output file Beginning column for the Nmer in each record Length of the Nmer in each record Example: cat >killme.txt <<EOD AAGC 12 AGCT 1 ATTT 2 TTCT 4 EOD least_nmer <<EOD >/dev/null killme.txt killme2.txt 4 4 EOD cat killme2.txt AAGC 12 AGCT 1 table of contents
make_binary_fosn
Converts a list file (FOSN) to a binary list file. The latter consists of 20 byte fixed length records. The binary FOSN can be rapidly random accessed to retrieve GCG database entry names given the order number of an NMER (see MAKE_NMER_LIST_BINARY). Example:
Converts one or more GCG formatted file into an NMER list. An NMER list just shows the location of each Nmer. Warning, the resulting file will be roughly 12X larger than the original sequence, so make sure that you have enough diskspace.
Example:
make_nmer_list -infile=genbank:dmwhite -nmer=6 -out=killme.txt -default
head -5 killme.txt
6 .NMER.
1048576 .STEP.
1GAATTC
2AGAATT
rm killme.txt
make_nmer_list -infile=genbank:dmwhite -nmer=6 -out=killme.txt -feat <<EOD
fred
10
jane
100
EOD
head -5 killme.txt
6 .NMER.
1048576 .STEP.
10 FRED
100 JANE
1GAATTC
table of contents
Makes a binary NMER list. Each 8 byte record consists of a 4 byte integer encoding up to a 16 mer, followed by a 4 byte integer listing the order number of the sequence that that NMER came from. No other positional information is present. Warning, the resulting file will be (8 X the number of bases in the original sequence) bytes long.
table of contentsThis calculates a binary Nmer profile. This file contains the number of times each Nmer is observed at a given position in a sequence. Then the NMER profile can be run against a sequence using TEST_AGAINST_NMER_PROFILE.
Prompts: INPUT filename. A sorted NMER list LENGTH length of the longest sequence which went into the profile TYPE 1=nucleotide, 2=peptide Restrictions on profiles are: which largest Nmer longest sequence nucleotide 6 200 peptide 3 100table of contents
This program converts triplets of (name,begin,end), one per line into range/subranges for use with a_vs_range2 and also a_vs_b2 and such.
Prompts:
INPUT filename. Contains a series of records like:
name, begin, end
(if begin > end) then the values are swapped)
_OFFSET,val,0
add val to all subsequent positions
_STEP,val,0
like offset
_FILL,val,0
Fill all ranges with enough subranges so that
the distance between the ends of each subrange
are <= VAL (ie, end - begin + 1 /N <= val)
Example:
cat <<EOD >range_input.txt
jane,100,200
fred,1,10
_OFFSET,1000,0
ethel,1,10
_FILL,20,0
windows,300,400
EOD
make_ranges <<EOD >/dev/null
range_input.txt
range_output.txt
EOD
cat range_output.txt
100 jane
200 jane
1 fred
10 fred
1001 ethel
1010 ethel
1300 windows
1320 windows
1320 windows
1340 windows
1340 windows
1361 windows
1361 windows
1381 windows
1381 windows
1400 windows
The output file is both a RANGES list and an NMER list. It does NOT contain the .step. and .nmer. fields since it is intended to be appended to an existing list (which already has that) and sorted.
Since some fields may be out of order, this should be sorted before being appended to an nmer list. This is because nmer lists are usually only sorted on the names because the positions are in order going in - that isn't the case here - see how "jane" is out of order with respect to "fred". Use a command like:
on Unix:
% sort range_output.txt > range_output_inorder.txt
on VMS:
$ sort/key=(pos:10,size:30)/key=(pos:1,size:9)/stable -
range_output.txt ange_output_inorder.txt
table of contents
Make a piece of random dna based on some sample. Output can be of any size up to the GCG limit. It can also accumulate statistics over a collection of DNA using the -DUMP=dna.stat switch, and then read that in again using -INStat=dna.stat. The statistic dump file is not ver large, and saves having to read through the collection again.
Examples:
# generate 50 bp with same statistics as dmwhite - directly
make_random_dna -infile=genbank:dmwhite -out=rand.seq -length=50 -default
cat rand.seq
Length: 50 September 17, 2001 15:23 Type: N Check: 1969 ..
1 AGGAGTATGA TGACAAATTA TCCGAAACAG CTAACGTTTC GTCAAGGTTC
# generate 50 bp with same statistics as dmwhite - through a statistics file
make_random_dna -infile=genbank:dmwhite -dump=dmwhite.stat -default
head -5 dmwhite.stat
Transition table generated by Make_Random_DNA
3Mer-> G A T C ..
GGG 0.29798 0.23737 0.20707 0.25758
GGA 0.27717 0.34783 0.18478 0.19022
GGT 0.39706 0.19118 0.25735 0.15441
make_random_dna -instat=dmwhite.stat -out=rand.seq -length=50 -default
cat rand.seq
Length: 50 September 17, 2001 15:25 Type: N Check: 9393 ..
1 ACACTAAAAC CGGTCAAATT GCAATCTGTA TGCCCAGGGC CCCCAAGGGG
table of contents
This calculates expected and prob. of observed/expected from a set of subsequences (<= 19 AA or BP long, 50000 lines limit) against a set of sequences. It uses Markov chain methods to predict the expected number of each subsequence in the set of sequences. It is similar to NMER except that NMER generates this calculation for all tuples of a given size, to check a single long tuple (for instance, a 16mer) use MARKOVTEST instead. The format for the subsequences is to put them into a file like this:
!comment, this is ignored ACGTCGCT,17 TCTCTCT,12 !comment, this is ignored
No leading or internal spaces in the subsequence,number part!!! The number shown is the number of "observed" for this subsequence. This must be known ahead of time. If it isn't known, then it defaults to zero, so expected may be calculated, but Prob will be for zero observed versus expected.
Examples i
cat <<EOD >dmwhite.patterns
TTTCATTGTC,5
CAAATTGTCT,23
AACTGCGACC,0
GCGGCG,1
GCGGCG,2
GCGGCG,3
GCGGCG,4
GCGGCG,5
GCGGCG,6
GCGGCG,7
GCGGCG,8
GCGGCG,9
GCGGCG,10
GCGGCG,11
GCGGCG,12
GCGGCG,13
GCGGCG,14
GCGGCG,15
TGTTGT,30
TGTTT,100
TGTT,100
TGT,100
EOD
markovtest -infile=gb_in:dmwhite -data=dmwhite.patterns -cutoff=0 \
-markov=2 -outfile=dmwhite.markovtest.txt -default
cat dmwhite.markovtest.txt
Expected generated by: Markov 2/1
Direction analyzed: Forward
Only observed >= 0 shown
Nmer Obs Exp P (tail)
TTTCATTGTC 5 0.032 0.00000000
CAAATTGTCT 23 0.031 0.00000000
AACTGCGACC 0 0.011 0.98928922
GCGGCG 1 2.064 0.38888121
GCGGCG 2 2.064 0.65927798
GCGGCG 3 2.064 0.34072202
GCGGCG 4 2.064 0.15466189
GCGGCG 5 2.064 0.05864090
GCGGCG 6 2.064 0.01899767
GCGGCG 7 2.064 0.00535840
GCGGCG 8 2.064 0.00133616
GCGGCG 9 2.064 0.00029832
GCGGCG 10 2.064 0.00006026
GCGGCG 11 2.064 0.00001109
GCGGCG 12 2.064 0.00000191
GCGGCG 13 2.064 0.00000030
GCGGCG 14 2.064 0.00000006
GCGGCG 15 2.064 0.00000000
TGTTGT 30 3.965 0.00000000
TGTTT 100 27.232 0.00000000
TGTT 100 75.823 0.00450516
TGT 100 211.116 0.00000000
#
# to pick up statistics over many files, use something like this
#
markovtest '-infile=gb_in:dm*' -data=dmwhite.patterns -cutoff=0 \
-markov=2 -outfile=dmwhite.markovtest.txt -default
Markovtest should be run in batch mode if infile is a big chunk of a database!!!
table of contentsThis is like the GCG program COMPOSITION except that it allows N in the range 1 to 10. It will work on multiple files using any of the GCG conventions (ie, GB:* or @mylist.fil). It tests the significance of observed vs. expected (from the composition or Markov chain method) by a single tailed summation of the Poisson distribution assuming that the calculated expected is the mean. So, if expected = 100 and observed is 150 it is the terms from 150->infinity, if observed = 50 it is the terms from 0->50. Warning: if obs/expected are >20000 or so the probability calculation may crash the program, so don't run Nmer <=3 on huge sequences! Nmer also outputs as the last line, the chi**2 value for all obs,expected pairs, with observerd >= cutoff. This can be used to test the goodness of fit of the Markov chain model used versus the actual data.
Use -check to see the various command line options. -DNAfast makes a huge difference if you can use it. NMER will also work on proteins.
Example: nmer -infile=gb_in:dmwhite -nmer=6 -markov=3 -out=dmwhite.nmer6.txt -cutoff=15 -default head -10 dmwhite.nmer6.txt Expected generated by: Markov 3/2 Direction analyzed: Forward Required direct repeats of Nmer core: 1 Only observed >= 15 AND Prob <= 1.00999999 shown Nmer Obs Exp P (tail) AAAAAA 39 39.414 0.51605129 AAAAAC 15 18.244 0.26790133 AAAAAT 26 22.736 0.27331400 AAAACT 18 12.145 0.06870717 AAAAGT 16 11.529 0.12352026 nmer -infile=gb_in:dmwhite -nmer=6 -markov=3 -out=dmwhite.nmer6.txt -repeat=2 -cutoff=2 -default cat dmwhite.nmer6.txt Expected generated by: Markov 3/2 Direction analyzed: Forward Required direct repeats of Nmer core: 2 Only observed >= 2 AND Prob <= 1.00999999 shown Nmer Obs Exp P (tail) AAAAAA 4 0.183 0.00004005 ACACAC 4 0.003 0.00000000 ATATAT 9 0.009 0.00000000 CACACA 5 0.004 0.00000000 CTCTCT 3 0.002 0.00000000 TATATA 8 0.007 0.00000000 TCTCTC 3 0.002 0.00000000 TTTTTT 8 0.232 0.00000000 >chi**2: 39417.425781 entries:8table of contents
This is a modified version of findpatterns that outputs a set of hits in a format that can be read by the CLUSTERS programs. This provides a simple way of generating lists of positions of degenerate patterns, for instance, one could find all locations of every entry in the TFD and then look for clustering. The output, like that from make_nmer_list, must be sorted before use (see SORT). It has a few special switches:
-Offset Add this to all positions
-Step Add this to the offset for each set of hits in a
multisequence analysis. (This allows some programs
to analyze one sequence at a time, and yet do statistics
over all.) Step must be bigger than the biggest sequence
that you analyze.
-NMER Some CLUSTERS programs can filter for overlaps, but they
need to know how big a pattern is, and all patterns must
be the same size. You can use this switch to tell the
machine what size to use for all patterns
-PREFIX Specify a string. If the patterns are entered on the command
line, or they are not given names in a data file, then they
will be named " Pat1"," Pat2", etc. by default. Prefix lets you
change " Pat" to something else. In particular, you
may or may not want to have the leading " ", which is used
by a_vs_b2 to distinguish between special sites and regular
sites.
Example:
nfindpatterns -infile=gb_in:dmwhite \
-pattern=ATATATATATAT,CTCTCTCTCTCTCT \
-out=dmwhite.nfindpatterns.txt -default
cat dmwhite.nfindpatterns.txt
6 .NMER.
262144 .STEP.
3184Pat1
5728Pat1
5730Pat1
5732Pat1
5734Pat1
5736Pat1
5738Pat1
5740Pat1
5742Pat1
14066Pat2
14068Pat2
Where Pat1 is ATATATATATAT, Pat2 is CTCTCTCTCTCTCT.
If there are more than just a couple of patterns to
check put them in a patterns file and load them with
-data=file.patterns.txt. The advantage of doing this
is that it also allows more meaningful names to be used
for the matches (compared to Pat1, Pat2, etc.) Example:
cat >killme.pattern <<EOD
blah blah
Name Offset Pattern Overhang Documentation ..
AT_REP 0 ATATATATATAT 0 !
CT_REP 0 CTCTCTCTCTCTCT 0 !
EOD
nfindpatterns -infile=gb_in:dmwhite \
-data=killme.pattern \
-out=dmwhite.nfindpatterns.txt -default
cat dmwhite.nfindpatterns.txt
6 .NMER.
262144 .STEP.
3184AT_REP
5728AT_REP
5730AT_REP
5732AT_REP
5734AT_REP
5736AT_REP
5738AT_REP
5740AT_REP
5742AT_REP
14066CT_REP
14068CT_REP
table of contents
no_holes
Fills in missing records in a sorted NMER list. Many programs, CORES, for
instance, only output NMERs that match certain criteria, not the full list
of all possible NMERs. While this is fine for many purposes, it does make
it quite difficult to compare the results of runs which happened to find
different sets of NMERS. NO_HOLES can be used to postprocess a partial
sorted NMER list into a complete sorted NMER list. The NMERs may begin in
any column abd can be any size. There may be leading and trailing non-NMER
records in the file. NMERs are converted to upper case before being
compared, and that can result in mixed cases (see below)
Example:
cat >testseq.dat <<EOD
1
2
3
4
5
ac 2
ag 3
AT 4
CC 2
CG 3
CT 4
GA 1
GC 2
GG 3
GT 4
TA 1
TC 2
trailing line 1
trailing line 2
trailing line 3
EOD
no_holes <<EOD >/dev/null
testseq.dat
newtestseq.dat
4
2
6
NEW
___
EOD
cat newtestseq.dat
1
2
3
4
5
___AA NEW
___ac 2
___ag 3
___AT 4
___CA NEW
___CC 2
___CG 3
___CT 4
___GA 1
___GC 2
___GG 3
___GT 4
___TA 1
___TC 2
___TG NEW
___TT NEW
trailing line 1
trailing line 2
trailing line 3
table of contents
occupation
This is a modified version of findpatterns that measures
the area covered by the patterns searched for. That is, it tells
what fraction of the DNA or Protein is comprised of the patterns entered.
Optionally, it can plot this density, or the locations of these sites.
To see all command line options run the program with -check. Some
switches of note:
-WINdow Specifies a window size for plot
-STEP Bases/amino acids to shift between windows
-DOCount If set, then the plot shows the density of sites instead
of their coverage.
Example:
# set up graphics
xwindows color
occupation -infile=gb_in:dmwhite \
-out=dmwhite.atoccup.txt\
-pattern=at -window=100 -default
# the plot comes up on the graphics device
cat dmwhite.atoccup.txt
! run on: September 17, 2001 16:04
!
! Patterns searched:
! 1 at
!Results follow:
! Name Cover Length Fraction
DMWHITE 2136 14245 0.14994735
table of contents
report_usage
This reads a sorted Nmer list (that *MUST* have been derived from a single
sequence). Each Nmer must be large enough (at least size 20 for small sequences,
30 or more for larger ones) so that it would normally occur only once by chance
within the sequence.It reports how often each Nmer appears in the sequence
(once, twice, etc.), how much sequence each covers (sort of), how big the
longest run is and where it is. This program indicates retrospectively
how difficult a region would have been to assemble via shotgun methods.
TEST_SHOTGUN is a related program.
prompts:
Input the name of Sorted Nmer file to process
Input the name of the output file
Input the minimum length of nonunique
sequence that would split a contig in two.
Example:
make_nmer_list -infile=GB_IN:DMU31961 -nmer=30 -out=killme.txt -default
sort -t\n -k0.10 -k0.0,0.9 killme.txt > killme2.txt
report_usage <<EOD >killme4.txt
killme2.txt
killme3.txt
300
EOD
One output file gives a blow by blow summary of the various types of DNA
that occur, their copy number and how often they are found. In the following
table the columns are:
- Begin. start of segment of DNA
- End. end of that segment of DNA
- Length. length of that segment of DNA
- Copy. copy number, 1 means single copy, 2 means 2 copies
- Type. A name assigned to that type of DNA, 0 for single copy
- Gap. Gap between single copy sgements (minus the Nmer size)
cat killme3.txt
Nmer Usage report
Begin End Length Copy Type Gap
1 26012 26012 1 0 0
26013 26014 2 2 164141 0
26015 26044 30 1 0 2
26045 26046 2 2 164141 0
26047 49027 22981 1 0 2
49028 49037 10 5 265728 0
49038 68561 19524 1 0 10
68562 68644 83 2 267201 0
68645 68672 28 1 0 83
68673 68755 83 2 267201 0
68756 338234 269479 1 0 83
The data sent to stdout, redirected above to killme4.txt, contains
a global view of the possible assembly problems with this sequence.
- Repeat. Summary for copy number. 1 = single copy, 2 = two copies
- Coverage. Amount of DNA covered by this copy number
- Segments. Segments of DNA covered by this copy number
- Avg. Average length = coverage/segments
- Longest. Longest segment
- Where. Location of Longest
cat killme4.txt
Summary of runs observed in sequence
Using Nmer size: 30
Using Split size: 300
Repeat Coverage Segments Avg. Longest Where
1 338054 6 56342 269479 68756
2 170 4 42 83 68562
5 10 1 10 10 49028
>=2 180 5 36 83 68562
splits 0 0 0 0 0
contigs 338234 1 338234 338234 1
In this example the DNA region has no serious obstructions to shotgun
cloning - there are no gaps larger than 300 bp ("splits") and amount of
multicopy sequence is very small.
table of contents
sort
Most programs require that NMER_LISTs be in sorted order.
Unix% sort -t\n -k0.10 -k0.0,0.9 nmer.list > nmer.sorted
OpenVMS$ SORT/STABLE/KEY=(POS:10,SIZE:64)/Proc=tag nmer.list nmer.sorted
(Ignore the warning message about invalid key.)
You will also want to sort the output of A_VS_A, putting the most clustered
Nmers at the top. The command for doing this on the mean of the closest
values is:
Unix% sort -t\n -k0.26,0.34 a_vs_a.out > a_vs_a.sorted
OpenVMS$ SORT/STABLE/KEY=(POS:26,SIZE:9) a_vs_a.out a_vs_a.sorted
To put a particularly messy list (both positions and nmers out of order)
use this:
Unix% sort -t\n -k0.10 -k0.0,0.9 nmer.list > nmer.sorted
OpenVMS$ SORT/STABLE/KEY=(POS:10,SIZE:64,NUMBER:1) -
/KEY=(POS:1,SIZE:9,NUMBER:2) -
/Proc=tag nmer.list nmer.sorted
Note that many files on OpenVMS have an invalid RMS attribute set which is why
/PROC=tag is specified above. If you happen to know the length of the
longest line in a file, and the file's format is stream-lf (use ANAL/RMS
to find out the file format), it is generally faster to do:
OpenVMS$ set file/attrib=mrs:20 filename
then sort
Where "20" would be replaced as appropriate for the file
table of contents
sum_nmer_list
Run this next on the SORTED Nmer_list. The output will tell you how many
of each Nmer are present.
Prompts:
Also process non Nmer marked locations Y/N [N]
Input the name of the file to process
Input the name of the output file
Example:
make_nmer_list -infile=genbank:dmwhite -nmer=6 -out=killme.txt -default
sort -t\n -k0.10 -k0.0,0.9 killme.txt > killme2.txt
sum_nmer_list <<EOD >/dev/null
N
killme2.txt
killme3.txt
EOD
head -5 killme3.txt
10,AAATTT
1,AACGTT
30,AACTTT
3,AAGCTT
3,AAGGTT
table of contents
test_against_nmer_profile
Scan a binary NMER profile down a plain text or FASTA sequence, and report
the score at each position. The score is the average over the window size
of the profile (set by the longest sequence) for the weights of each Nmer
observed. That is, if AGCT is at position 3, it looks up position 3 in the
profile, for Nmer AGCT, sees that it was present 21 times in the input
sequences at that position, and also sees that 1000 input sequences had
valid Nmers at that position, so it adds 21/1000 to the score. When the whole
window is done it divides by the window length. (That is, scores are
normalized for length and number of input sequences, so that scores are
always between 0 and 1.)
There are no prompts. The program expects to find the following:
bnprof the name of the file holding the binary Nmer profile
testseq the name of the sequence to test. Nucleotide or
peptide. Plain text or fasta, lines no longer than 256
characters, no spaces, numbers or other detritus
results the name of the output file
table of contents
test_shotgun
TEST_SHOTGUN simulates sequencing projects. It can be used to figure
out the degree of completion of a project, and/or to figure out what a
project would cost. Projects up to 1 Mbp are simulated directly, over
that they are simulated in a 1 Mbp region and then scaled to the final
size. The program can handle multiple "phases", each with a different
sequencing protocol. In any given phase, at least ONE clone must be
added - note that for very large projects this scales too, so that adding
1 clone to the human genome (3 Gb) shows up in the output as 3000 clones.
Prompts:
Input the name of the output file
Input one text line that describes this project (<=60 characters)
Input the amount of project DNA, in base pairs
Input the cost of obtaining an oligonucleotide primer
Input the amount of sequence recoverable from a
primer walking sequencing step (,cost)
Input the size of clones that will be used to close
gaps between Mesh/Contigs (,cost). (NOTE. if clones
are to be selected by hybridization with a single
probe then this size should be 1/2 the actual size
of the clones as each clone will be randomly positioned
with respect to that probe so, on average, only 1/2 of
each clone will be new sequence.)
Input the minimum overlap to merge a Mesh/Contig
AND for overlap in primer walking
DNA sequencing project phase: 1
Input the size of each clone (,cost per clone)
Input the amount of sequence from each end(,cost)
Input the number of such clones to process
0 exits, any other number to add more fragments
Example:
test_shotgun <<EOD
killme.txt
This is a made up project
100000
10.00
500,5.00
10000,2.00
50
2000,0.50
500,5.00
200
0
EOD
cat killme.txt
Project: This is a made up project
Project constants
Genome size (base pairs): 100000.
Scaling from internal subsequence: 1.00
Required overlap for Mesh merge: 50
Bases of sequence from a custom primer: 450
Bases in clones used to close mesh gaps: 10000
Cost to obtain a clone outside a Mesh/contig: 2.00
Cost to make a custom primer: 10.00
Cost to sequence from a custom primer: 5.00
Phase : 1
Number of clones added: 200
Size of each clone added: 2000
Bases of sequence from each end primer: 500
Number of genome units of coverage: 4.00
Number of genome units of sequence: 2.00
Unit cost: clone isolation and preparation 0.50
Unit cost: end primer sequencing: 5.00
Total phase cost for clone preparation: 100.00
Total phase cost for end sequencing: 2000.00
Project status after phase: 1
Summary of coverage fractions:
included in clones: 0.9757300
sequenced: 0.8643800
both directions: 0.3757600
multiply, one direction: 0.1975600
once: 0.2910600
gaps: 0.1356200
IntraMesh: 0.1113500
InterMesh: 0.0242700
Mesh information
number (Meshes): 4
max size (clones): 130
avg. size (clones): 50
max size (bp): 62224
avg. size (bp): 25463
Contig information
number (Contigs): 76
max size (sequences): 31
avg. size (sequences): 5
max size (bp): 4322
avg. size (bp): 1140
IntraMesh Gaps (inside clones)
fraction: 0.1113500
number (internal gaps): 60
max size (bp): 782
avg. size: (bp) 185
InterMesh Gaps (between clones)
fraction: 0.0242700
number (external gaps): 3
max size (bp): 1093
avg. size (bp): 809
minimum new clones to close: 3
Regions sequenced more than once, one direction only
fraction: 0.1975600
number (regions): 82
max size (bp): 1415
avg. size (bp): 240
Regions sequenced both directions
fraction: 0.3757600
number (regions): 98
max size (bp): 1203
avg. size (bp): 383
Regions sequenced once only
fraction: 0.2910600
number (regions): 175
max size (bp): 500
avg. size (bp): 166
Sequencing steps required to close
IntraMesh gaps (X): 64
InterMesh gaps (Y): 7
IntraMesh, one strand -> both strands(Z): 191
close all, both strands (Z+2*(X+Y)): 333
Total costs
Shotgun clone isolation and preparation (A): 100.00
Shotgun clone end sequencing (B): 2000.00
Custom primer synthesis (C+D+E): 3330.00
IntraMesh gaps (C): 1280.00
InterMesh gaps (D): 140.00
IntraMesh 1->2 strands (E): 1910.00
Custom primer sequencing (F+G+H): 1665.00
IntraMesh gaps (F): 640.00
InterMesh gaps (G): 70.00
IntraMesh 1->2 strands (H): 955.00
InterMesh clone isolation (I): 6.00
Grand Total cost (A+B+C+D+E+F+G+H+I): 7101.00
table of contents
unique_nmer
This program converts all nmers in a fixed position size (ie, start at
column 1, size 6) to a "unique" form. This merges all rotamers and
conjugate nmers into a single form. This assumes no base changes - it
cannot merge nmers that differ in unique forms by a single base into a
single form. For instance, all of the following forms become: TGCAGC
GCAGCT,CAGCTG,(and other rotamers)
GCTGCT,CTGCTG (and other rotamer/conjugates)
Prompts:
Input the name of the file to process
Input the name of the output file
Beginning column for the Nmer in each record
Length of the Nmer in each record
Example:
# the first 8 lines are all rotamer/conjugates for GGGC
cat >killme.txt <<EOD
GGGC
GGCG
GCGG
CGGG
CCCG
CCGC
CGCC
GCCC
ATAT
EOD
unique_nmer <<EOD >/dev/null
killme.txt
killme2.txt
1
4
EOD
cat killme2.txt
GGGC
GGGC
GGGC
GGGC
GGGC
GGGC
GGGC
GGGC
TATA
table of contents
search
Search is a VMS utility. For the equivalent function on unix, do "man grep".
If you find a bucket in a HISTO_COLUMN output that interests you (for
instance, on one of the tails), you can find the matching NMER by
searching for that value in the SUM_NMER file or the A_VS_A file. Also
use search against GENMOREDATA:TFSITES.DAT to see if a particular sequence
is a known transcription factor.
table of contents
zero_palindrome_nmer
Sometimes palindromic sequences show up twice in analyses, with the
second appearance being in a separate column. In this instance, use
this program to reset the value to 0.
Prompts:
Input the name of the output file
output file
starting column for nmers
Length of the Nmer in each record
Begin processing Nmer lines at which file record?
text which will overwrite the value part of the record
column where this overwrite should occur
Example:
nmer -infile=genbank:dmwhite -nmer=2 -out=killme.txt -markov=1 -both -default
zero_palindrome_nmer <<EOD >/dev/null
killme.txt
killme2.txt
1
2
6
0
23
EOD
cat killme2.txt
Expected generated by: Markov 1/0
Direction analyzed: BothStrands
Required direct repeats of Nmer core: 1
Only observed >= 0 AND Prob <= 1.00999999 shown
Nmer Obs Exp P (tail)
AA 2868 2283.463 0.00000000
AC 1506 1749.192 0.00000000
AG 1556 1749.321 0.00000133
AT 0 1141.729 0.01440653
CA 1894 1749.192 0.00032663
CC 1479 1341.900 0.00011998
CG 0 669.077 0.04123657
CT 1556 1749.321 0.00000133
GA 1550 1749.321 0.00000064
GC 0 669.077 0.00000000
GG 1479 1341.900 0.00011998
GT 1506 1749.192 0.00000000
TA 0 1141.729 0.00000000
TC 1550 1749.321 0.00000064
TG 1894 1749.192 0.00032663
TT 2868 2283.463 0.00000000
>chi**2: 611.164185 entries:16
table of contents
nmer lists
Most of these programs work with sorted Nmer lists which are derived from the regular
nmer list described here. An Nmer is a term we sometimes
use for a sequence of length N (the more technical mathematical name is a tuple).
An Nmer list is a text file (although some programs use a binary nmer list, which is similar
but can't be viewed as normal text) that begins with one or more header records and then
is followed by a long list of Nmers and their positions in the sequence. Here is an
example:
6 .NMER. The size of the nmers in this file
1048576 .STEP. If M=1->K sequences, the positions for sequence M are (M-1)*.STEP.+offset
100 EcoRI An annotated site
1GAATTC An Nmer, its position is a 9 digit integer - which is
2AGAATT big enough for any real sequence. This is followed by the sequence
3CAGAAT of the Nmer itself.
4TTCTGC
5TCTGCT Note that each line ends with a SPACE, which is hard to see
6GAGCAG but essential for some of the programs to work properly.
7TGCTCC
8TGGAGC Note that AAAAAT and ATTTTT would both be stored as AAAAAT
9CTGGAG
It's possible to put site annotation into this type of file as well. When doing so
the name of the site must begin with a space. This fixed format allows the files
to be easily sorted on the Nmer field.
table of contents
sorted nmer lists
To generate a sorted nmer list from a regular nmer list use the appropriate sort command
for the operating system you are using. For Unix that command would be:
sort -t\n -k0.10 -k0.0,0.9 nmer_list.txt > sorted_nmer_list.txt
When applied to the nmer list in the previous example the result would be:
6 .NMER.
1048576 .STEP.
100 EcoRI
2AGAATT
3CAGAAT
1GAATTC
6GAGCAG
5TCTGCT
7TGCTCC
8TGGAGC
4TTCTGC
To see better why the lists are sorted a bigger example is best employed
make_nmer_list -infile=GB_IN:DMWHITE -nmer=6 -out=killme.txt -default
cat <<EOD >>killme.txt
100 EcoRI
2000 EcoRI
1500 HindIII
2100 HindIII
EOD
sort -t\n -k0.10 -k0.0,0.9 killme.txt > killme2.txt
head -10 killme2.txt
6 .NMER.
1048576 .STEP.
100 EcoRI
2000 EcoRI
1500 HindIII
2100 HindIII
783AAATTT
3404AAATTT
3471AAATTT
4084AAATTT
The sort command says to use a newline as a field separator - and
since there won't be one in any line the whole line becomes a single
field. The first key says to sort lexically (by character order)
starting at the 10th character on the line (the Nmer) and the
second key says to sort within that on characters 1 through 9 (the position
data).
range lists
These are similar in format to Nmer lists but lack the .STEP. and .NMER. lines. Each range
is specified by a consecutive pair of positions with the same name.
example:
6960 0-500up
7460 0-500up
7460 transcribed
13428 transcribed
7686 CDS
7757 CDS
10853 CDS
11153 CDS
In this example the ranges are: 0-500up: 6960-7460, transcribed: 7460-13428, CDS: 7686-7757,
10853-11153. As with an Nmer list the position is set by 9 digit integer, followed by the name (which
usually begins with a space to distinguish it from an Nmer) and followed by a space. The
program Make_Ranges generates these files from a set of commands.
table of contents