Clusters programs documentation

This is minimal documentation on the Clusters software package. This package performs various analyses generally aimed at detecting clusters of sequences or patterns of overrepresented tuples (Nmers). Some of the software described here is highly experimental and may contain serious bugs. Those programs whose description below reads "description" have not yet been ported from OpenVMS to Solaris. Caveat emptor!

Most of these programs are invoked by typing:

and then answering the prompts. The exceptions are programs like NMER and MAKE_NMER_LIST which have GCG interfaces and will accept command line arguments. These are indicated in the following list by a GCG or Prompt. Like all other GCG programs the command must be given once (per session) before the programs will run.

Most of these programs will not write over an existing file. Attempts to do so will result in a message similar to this one:

  program_name.subroutine: [1018] can't find 'old' file
  logical unit 10, named 'file_that_already_exists.txt'
  Abort (core dumped)

The file types which are used by these programs are:

nmer lists A list of Nmers of a fixed size and their positions in a sequence
sorted nmer lists An Nmer list which has been sorted
range lists A list of start/end positions for named ranges

The clusters programs are as listed below.

a_bin_dist Prompt Poisson analysis of a sorted Nmer list
a_spacing Prompt Histogram distances between successive NMERs in a sorted nmer list
a_vs_a Prompt Calculates the clustering values within a sorted nmer list using an heuristic methoc
a_vs_a2 Prompt Calculates the clustering values within a sorted nmer list in terms of "closeness" and significance
a_vs_a3 Prompt Dotplots through GCG graphics Nmers from a list which fall within ranges. Can do genomic scale dotplots.
a_vs_a4 Prompt Histogram distances between all Nmers in a sorted nmer list
a_vs_a5 Prompt Search for evidence of positional phasing in a sorted nmer list
a_vs_b Prompt Description
a_vs_b2 Prompt Description
a_vs_b3 Prompt Description
a_vs_range Prompt Counts Nmers falling in bins in a sorted range list
a_vs_range2 Prompt Probability of entries in a sorted nmer list falling in bins in a sorted range list
col_to_gcg Prompt Convert a table formatted data to a GCG enzyme file
complement_nmer Prompt Complements nmers in a fixed position(ie, start at column 1, size 6)
cores GCG Searches for patterns by symmetry, such QQQQnnnnQQQQ, where QQQQ can be any 4mer
dna_run_stats GCG Analyze the copy number coverage of a sequence and how it would have shotgun assembled
filter_content Prompt Selects lines from files based on tests of two Nmers for rotamer, conjugate, simple, and 100% AT.
filter_nmerlist Prompt Filter from a Sorted Nmer List entries that are too close together or are in specific positions
filter_nmerlist2 Prompt Filter sorted Nmer lists leaving only entries whose relative positions satisfy specified criteria.
find_close Prompt Extract rows from a file where a column is numerically close
from_binary_fosn Prompt Description
histo_column Prompt Generate a histogram from a column of numbers
histo_column_binary Prompt Description
least_nmer Prompt Retain in a list an Nmers only if it is lexically <= its complement.
make_binary_fosn Prompt Description
make_nmer_list GCG Converts one or more GCG formatted file into an NMER list
make_nmer_list_binary GCG Description
make_nmer_profile Prompt Description
make_random_dna GCG Use Markov Chain methods to make DNA with statistical properties similar to the input
make_ranges Prompt Converts a file of (name,begin,end) triplets into a Range_List.
markovtest GCG Calculate probability of subsequences being found within a set of sequences
nmer GCG Count tuples (Nmers) within a DNA or protein sequence. Probability estimates.
nfindpatterns GCG Variant of GCG findpatterns that outputs hits in a format that can be read by the CLUSTERS programs.
no_holes Prompt Fill in missing records in a sorted NMER list
occupation Prompt Calculates the length of a sequence or sequences that is/are covered by a defined set of patterns
report_usage Prompt Analyzes copy number information from a sorted Nmer list derived from a single sequence
sum_nmer_list Prompt Process a sorted nmer list to yield the number of times each tuple was observed
test_against_nmer_profile Prompt Description
test_shotgun Prompt Simulates shotgun sequencing projects and generates cost estimates
unique_nmer Prompt Converts Nmers to a unique form, eliminating conjugates and rotamers
zero_palindrome_nmer Prompt Detects the "higher" complement form in an Nmer list and overwrites its data field

On VMS you will also use: SORT SEARCH On Unix you will also use: sort grep

table of contents

a_bin_dist

Performs a Poisson analysis starting from a sorted Nmer list. It finds first the bin (you specify size) that contains 6 sites, nulls those out, finds 5 sites, etc. down to 2. It then estimates the number of 1 site bins as

The number of 0 site bins is where This information is output along with the sequence in the first row. The mean value for bin occupancy is then calculated, and assuming Poisson statistics, the estimated number of bins of each type is output as the next row. Lastly, using the expected value from each cell in row 2 for bins with 2 or more occupants as the mean (expected) value for Poisson distribution (NOT the one in row 2!!!) the probability of obtaining by chance the OBSERVED number is calculated. This is output in the last row as -log10(P), ie, 1 = 0.1 prob, 2 = .02 prob. Example, expected = 0.5, observed = n, then prob. of observing 2 is 1.0 - each poisson term is (0.5**i)(e**-0.5)/i!, and the probability is = (sum of {poisson term(0) -> poisson term (n-1)}). For terms up to 15 the exact formula is used, above that, Stirling's approximation is employed. WARNING: For sites with very many members and large bins the "Unoccupied bins" category can become NEGATIVE - this is NONSENSICAL and the analysis should not be considered reliable for such sites!!! table of contents

a_spacing

Histogram distances between SUCCESSIVE NMERs in a sorted nmer list. ( Compare to A_VS_A4, which histograms distances between ALL NMERs in such a list.) The histogram always has 11 bins. The first 10 are of a user specified width, the 11th is for everything larger than that. Output looks like:

Sequence        Count     Chi2:        2        4        6        8       10       12       14       16       18       20 >     20
AAGTT               6     13.7         2        0        1        0        0        0        0        0        0        0        2

Columns are:
1. Nmer sequence
2. Number of Nmers in calculation
3. Chi**2 value over histogram, first 10 bins ONLY!!!  Where EXPECTED is 
   the average of the first 10 bins.  (This is intended to detect peaks, 
   not compare to a Poisson distribution.)
4-13 10 histogram bins.  First line shows highest value in the bin
14 11th histogram bin = all values higher than biggest bin.
table of contents

a_vs_range

Calculates the occupancy of up to 14 ranges from a sorted Nmer_list and a Ranges_LIST. First it reads in the ranges, then it processes each type of NMER from the NMER_LIST. If an Nmer falls within a range the count for that range is incremented. The KMIN parameter can be set by the user so that only output lines for NMERS with >= KMIN ranges occupied are printed. The format for a RANGES_LIST is more or less like that for an NMER_LIST except that each range has only two entries (begin,end) of range. Range names can be up to 8 characters long - longer ones are truncated automatically. Do not forget to put an extra space on each line after the name. Note that ranges do not interact with each other, so they can be overlapping or in arbitrary order - they are each scored independently.

   
a_vs_range queries are:   
 Input the name of Sorted Nmer file to process
 Input the name of the Sorted range file to process
 Input the name of the output file
 Please specify the minimum number of ranges which
   must be occupied for an output line.
   The value must be between 0 and   3
   
Example:


cat > dmwhite_3.ranges <<EOD
000001460 part1 
000004460 part1 
000004461 part2 
000008460 part2 
000008461 part3 
000014245 part3 
EOD
a_vs_range <<EOD
dmwhite.sorted
dmwhite_3.ranges
dmwhite_a_vs_range.txt
3
EOD
head -3 dmwhite_a_vs_range.txt
Nmer          Tested   part1   part2   part3
AAGTT             53      13      20      13
AATTT             80      27      15      31

The output consists of
  12 spaces    Nmer name
   8 spaces    The total number of Nmers tested
   8 spaces    The number of Nmers in range 1
   8 spaces    The number of Nmers in range 2, etc.

table of contents

a_vs_range2

This program reads a SORTED NMER_LIST and a sorted range list both in the same format (except that the range list does not have .NMER. and .STEP. fields.) It sees how many of each type of NMER fall into a named range, then calculates the probability of that having happened by chance (Poisson, using expected given target size of range, vs. observed in target.)

Ranges may be made up of multiple pieces, ie a list of exons or whatever. The number of nmers that fall within any named range is calculated, and then from that the probability of having that many vs the expected. Ie, if the full sequence is length L, and the range is length R (even though it may be in many pieces), and N nmers are tested, where M fall in a range. Then the expected number is (R/L)*N and the observed is M. A cumulative Poisson is applied to calculate the probabilities of this.


a_vs_range2 queries are:

Input the name of Sorted Nmer file to process
Input the name of the Sorted range file to process
Input the name of the output file
You may specify a single probability test to exclude
 some output lines.  Only Nmer/Range combinations
 that pass the test will be output
   0   Obs. > Exp., P<= Pmin
   1   Obs. < Exp., P<= Pmin
   2   Obs. > Exp. OR Obs. < Exp., P<= Pmin
Pmin must be in this range:
  0.0 <= Pmin <= 1.0
In order to calculate probabilities the program
  needs to know the total length of the sequence(s)
  from which the Nmers and ranges were drawn.


Example:

cat > dmwhite_3.ranges <<EOD
000001460 part1 
000004460 part1 
000004461 part2 
000008460 part2 
000008461 part3 
000014245 part3 
EOD
a_vs_range2 <<EOD
dmwhite.sorted
dmwhite_3.ranges
dmwhite_a_vs_range2.txt
0
.01
14245
EOD
cat dmwhite_a_vs_range2.txt
Nmer                Range    Test    Hits      Exp   lP:O>E   lP:O<E [lP=-log10(P)]
AATTT               part1      80      27   16.854    2.100    0.003
CGCGT               part1      11       6    2.317    2.012    0.004
GCCCG               part3      24      18    9.747    2.255    0.002
GCCTT               part1      18       9    3.792    2.243    0.002
GTACT               part2      12      10    3.370    3.123    0.000
GTTTG               part2      32      18    8.986    2.622    0.001
TCCGG               part3      26      19   10.559    2.212    0.003
TGGCC               part3      46      29   18.681    2.018    0.004
TTCGA               part2      31      16    8.705    2.085    0.004

      
      Limitations:
        No more than 100 named ranges per run
        No more than 10000 total subranges per run (except
          where they fold down to one, as described next)

      Note, if a series of subranges for a particular name
        are specified so that
            subrange(i-1).end = subrange(i).begin
        then all of these subranges will be compressed down into
        a single subrange.  This can be used with a series of points
        that can go into a_vs_b2, for instance, and still work in
        this program.
table of contents

a_vs_a

This program calculates statistics for the 1,2, and 3 closest neighbors from a Sorted Nmer List.


A_VS_A queries are:
Input the name of the file to process
Input the name of the output file
Also process non Nmer marked locations Y/N [N]
You may specify a minimum allowed distance between sites
 This distance will be NMER + value
   A value between -NMER+1  and 0 allows some overlap
   A value greater than 0 allows no overlap
   A value less than or equal to -NMER is nonsense
   
   
Example:


make_nmer_list -infile=GB_IN:DMWHITE -nmer=5 -out=killme.txt -default
sort -t\n -k0.10 -k0.0,0.9 killme.txt > killme2.txt
a_vs_a <<EOD
killme2.txt
dmwhite_a_vs_a.txt
N
5
EOD
head -7 dmwhite_a_vs_a.txt
Sequence     *N*    Count     Mean Std.Dev.    Sum d     Sum d*d  RMS Mean
AAGTT        *1*       53      126     144.     6696     1943854.      26.
AAGTT        *2*       53      346     361.    18352    13238560.      69.
AAGTT        *3*       53      506     419.    26840    22863394.      90.
AATTT        *1*       69      127     112.     8776     1972120.      20.
AATTT        *2*       69      283     195.    19585     8157917.      41.
AATTT        *3*       69      436     252.    30107    17511569.      61.

This calculates the clustering values for a sorted NMER_LIST. The numbers in the output are left to right:



  Seq      Up to 12 bases of sequence (or tag names)
  *n*      distance to 1,2, or 3 closest Nmer of the same type
   N       number of such distances measured
  Mean     Mean of closest *n* distances for that Nmer
  SD       Standard deviation of this value
  Sum      of *n* distances (raw data from which Mean was calculated)
  Sum**2   ditto
  RMS Mean (Sqrt of Sum**2)/N

There are four kinds of output lines, marked with *N*,*1*,*2*,*3*. The first is a header line, the rest are the *n* distances. If you run HISTO_COLUMN on this file exclude all but the *n* that you are interested in.

table of contents

a_vs_a2

Reads a sorted NMER_LIST and calculates from it a value and a significance for the "incremental" minimum distances observed. The method is essentially this:

   1.  Let there be N points in a length of L.
   2.  For i = 1->N
         A.  d = measure the minimum distance from i to another point
         B.  calculate the probability p(i) of a minimum distance <=d given
             the N-1 points and random placement of the point i.  This 
             is relatively straightforward.  Just break up all the 
             intervals and stack them up like sticks.  Since the 
             probability of being at <= distance is just = distance
             on each stick, the integral is simple.
   3.  P mean = sum i = 1->N of p(i) / N.

   Step 3B should result in a set of p(i) values that are 
   uniformly distributed in the range 0<->1.  Now consider the case of
   dropping a point X into the interval 0<->1 at random, all points
   are equally likely, so the expected value is:

      integral (0->1) of xdx = 1/2.

   The variance of this is just:

      integral (0>1) of (x-1/2)(x-1/2)dx = 1/12

   Since each point "i" is independent, the std. dev. of the mean
   for N points is just sqrt(  1  /  (N*12)  ).  There is a program
   testflat.for that should be in this directory which demonstrates this.
   This provides a handy test for the significance of Pmean, which is
   just (Pmean -1/2)/stdev.

   One complication.  Because the same distribution (more or less) is
   used for each point's substitution each distance is essentially tested
   twice against the same distribution.  This has the effect of making
   the observed Variance sqrt (1/6N) rather than sqrt(1/12N).

The nice thing about this one is that it both calculates a "closeness" number (the average probability of observing each Nmer as close as it is to the others) and a "significance" number (observed - 1/2 / standard-deviation). The Standard deviation is approximated (to about 10%) by sqrt(1/6N), where N is the number of positions tested for each Nmer.

A_VS_A2 queries are:
Input the name of the file to process
Input the name of the output file
Also process non Nmer marked locations Y/N [N]
Please enter the length of the full sequence 
You may specify a minimum allowed distance between sites
 This distance will be NMER + value
   A value between -NMER+1  and 0 allows some overlap
   A value greater than 0 allows no overlap
   A value less than or equal to -NMER is nonsense
   
   
Example:


make_nmer_list -infile=GB_IN:DMWHITE -nmer=5 -out=killme.txt -default
sort -t\n -k0.10 -k0.0,0.9 killme.txt > killme2.txt
a_vs_a2 <<EOD
killme2.txt
dmwhite_a_vs_a2.txt
N
14245
1
EOD
head -3 dmwhite_a_vs_a2.txt
Sequence     N  >Minover   PMean  #variances
AAGTT           53    53 .4152254   1.5117
AATTT           80    70 .5868977   1.7809

The output columns are:

 Sequence    Up to 12 bases of sequence (or tag names)
 N           number of sites present
 >Minover        number of sites tested (some are rejected if they are too close)
 PMean       mean probability of each site being as close as observed to the
              others.
 #variances  |PMeanP -1/2|/standard-deviation
table of contents

a_vs_a3

This does dotplots based on sorted Nmer lists. It takes as input a sorted numer list, and a ranges file (for format see A_VS_RANGE). It outputs .FIGURE files which are dotplots of all nmers that are in one range (SELECT) vs. all other ranges (and itself). That is, for each NMER type, all instances are plotted. There can be up to 20 ranges, which may overlap. The primary advantage of this program over other DOTPLOT programs is that one can generate sorted lists using tools like NFINDPATTERNS and FILTERPATTERNS and then generate dotplots based on that. It can also make a dotplot of very large sequences, which cannot be done with COMPARE. For instance, one could search a sequence for a set of complicated patterns which correspond to some DNA motif, and then DotPlot those. This may be useful in picking out gene structures where significant sequence divergence has occurred throughout, but some fixed set of motifs has been conserved.

The program can be CPU intensive, run it from within a .COM file in a batch queue or with reduced priority. On VMS use:

On Unix use:

To view the output file, use the GCG FIGURE program after first configuring your GCG grapics environment. (SETPLOT,SHOWPLOT,POSTSCRIPT,TEKTRONIX, etc.)

a_vs_a3 queries:
 Input the name of the file to process
 The output files have names: PREFIX_NN_MM.FIGURE,Input the PREFIX now
 Please enter the length of the full sequence 
 Input the name of the Sorted range file to process
 There are   N ranges
   One of these must be the reference range
   Please enter a number between 1 and   N:
 NMER lists may contain entries which begin with
  a space.  Options:
   1  Include these in the processing
   0  Do not include these
 Set the diagonal filter, which suppresses dots within N spaces of each other.
  Options:
   N  Only include points greater than N spaces apart in the sequence
   0  Include even self-self points on the diagonal
 The dotplot may be numbered either from absolute coordinates
  or from 1, where 1 corresponds to the first base of each range.
  Options:
   1  Number each range from 1
   0  (or any number but 1) number with absolute coordinates



Example: 
(Note that the spaces after the names in the ranges file are
mandatory!  In this example the resulting files are sent to
a postscript printer.)

cat > dmwhite_3.ranges <<EOD
000001460 part1 
000004460 part1 
000004461 part2 
000008460 part2 
000008461 part3 
000014245 part3 
EOD
a_vs_a3 <<EOD
dmwhite.sorted
dmwhite
14245  
dmwhite_3.ranges
1
0
0
1
EOD
postscript laserwriter "|lpr -Psomeprinter"
figure dmwhite_part1_part2.figure
figure dmwhite_part1_part3.figure

table of contents

a_vs_a4

Histogram distances between NMERs in a sorted nmer list. The histogram always has 11 bins. The first 10 are of a user specified width, the 11th is for everything larger than that. Output looks like:

Sequence        Count     Chi2:        2        4        6        8       10       12       14       16       18       20 >     20
AAGTT              12     27.0         0        0        3        0        6        0        0        3        0        3        0

Columns are:
1. Nmer sequence
2. Number of Nmers in calculation
3. Chi**2 value over histogram, first 10 bins ONLY!!!  Where EXPECTED is 
   the average of the first 10 bins.  (This is intended to detect peaks, 
   not compare to a Poisson distribution.)
4-13 10 histogram bins.  First line shows maximum value for that bin
14 11th histogram bin = all values higher than biggest bin.


a_vs_a4 queries:
 Input the name of the file to process
 Input the name of the output file
 Also process non Nmer marked locations Y/N [N]
 You may specify a minimum allowed distance between sites
  This distance will be NMER + value
   A value between -NMER+1  and 0 allows some overlap
   A value greater than 0 allows no overlap
   A value less than or equal to -NMER is nonsense
 There are always 10 histogram bins.
   Each bin is WIDTH wide, where WIDTH is greater than
   or equal to 1 
 
Example:

a_vs_a4 <<EOD
dmwhite.sorted
dmwhite_a_vs_a4.txt
N
1
10
EOD
cat dmwhite_a_vs_a4.txt
head -2 dmwhite_a_vs_a4.txt</B>
Sequence        Count     Chi2: 6-    15       25       35       45       55       65       75       85       95      105        >
AAGTT              53     11.4         1        6        1        2        5        1        3        1        2        3     1353



table of contents

a_vs_a5

UNTESTED, UNFINISHED, DO NOT USE YET Examine a sorted nmer list. For each position j=1->min(step size,jmax-window+1) a score is calculated over a window which is the sum of position(i)*sin(i*period + phase), for i=1->Window size. For each nmer the highest score is returned.

table of contents

a_vs_b2

This calculates clustering values for each (A,B) pair of Nmers from a sorted NMER_LIST. It uses essentially the same method as A_vs_A2. Briefly, for each Nmer in A, find the closest Nmer in B, that distance is d. Find the fraction of space in the test sequence that is d or closer, that is p. Average the resulting p values. It can be shown in closed mathematical form that Pmean expected is 1/2, and the standard deviation of the mean is SQRT(1/12*N), N = number in list A. (The number in list B doesn't change this calculation!) If A is 0.0 or you will run out of disk space for Nmer=6. Reason, the output file would be 2048*2048*80 =~ 320 Mbytes! 3.0 (3 standard deviations) is a good place to start. You can also specify a range of MeanP that will be output. Probably this should be 0.0 <-> 0.5, since one isn't generally interested in what is far from what.

The program has two modes. In the first it just compares all A vs all B. This can easily take several hours. In the second mode, you can add a set of known sites (Promoters, enhancers) etc. to the sorted Nmer list. Note that several of each might exist in a sequence, and each type should have the same name. These entries look like:


   9000 Enhancer  
  10005 Enhancer  
  12345 Enhancer  
       ^        ^ Mandatory spaces, DON'T FORGET THE TRAILING SPACE!!!
   ^              Position in the sequence

Be sure to sort the list again, so that these are up at the front or the program will fail mysteriously. Mode 2 asks, "is this Nmer associated with this site, and coversely, is this site associated with this Nmer". The second part will only tell you something if you have several instances of the feature.

When you run A_VS_B2 it will create a .COM file which you should then submit to a batch queue for the actual run. Mode 1 runs typically take only a few minutes, but Mode 2 runs can take many hours. ONLY ONE RUN AT A TIME - THE PROGRAM CREATEs A SCRATCH FILE AND MAY GET CONFUSED OTHERWISE.

The output looks like:


   Sequence A
   Sequence B
   Number USED in A  = N
   Number in B
   Mean P
   significance  (|MeanP-1/2|/sqrt(1/12N)
table of contents

a_vs_b3

This calculates clustering values for each (A,B) pair of Nmers from a sorted NMER_LIST. It uses a form of integration. For each Nmer of type A it constructs either a hat function or a triangle function whose width you specify. It then counts up (weighted) the number of Nmers of type B within that function. User specifies:

   Type of function:  hat or triangle
   Size of function:  an integer, like 1000
   Criteria for a score to be saved:
   Minimum score:     result of hat/triangle function
   Maximum score:     "
   Minium number of pairs scored
table of contents

cores

This is an offshoot of NMER. It allows arbitrary "core" sequences in fixed geometries to be searched for. It is VERY hard to use, be sure that you test your core pattern on a positive control before you run it. To use it, first construct a pattern file which should look like this:

  *A,8,0          <-core "A" size 8, offset 0
  *B,9,10         <-core "B" size 9, offset 10
  -A,1,20         <- inverted A, mismatch allowed = 1, offset 20
  +AB,1,-10       <- direct A or B, mismatch allowed = 1, offset -10
  CG,0,11         <- C or G at offset 11, no allowed mismatches
  AT,1,12         <- A or T at offset 12, one allowed mismatch
  =,1,0           <- Cumulative mismatch over all pattern elements = 1
                     This must be the VERY LAST line, if present.

The above is a form of logical AND. NOT and OR can be accomplished with these commands:

  !,x,y          <- PRECEDES Y lines NOT containing *,=,!,|
                     and means "exactly X of the following Y conditions
                     are false", unless X = 0, where it means "ALL are 
                     false."

  |,x,y           <- PRECEDES Y lines NOT containing *,=,!,|
                     and means "exactly X of these Y are true".  The error
                     reported towards cumulative error is the smallest
                     one found in the group.  If X is zero, then 
                     it means "one or more of these Y are true."

Note that the meaning of "Cumulative mismatch" is DIFFERENT for a region of AND logic and a region of OR logic. For AND each statement adds to the cumulative mismatch. However, for each SERIES of OR statements (the block folling a | statement) the cumulative mismatch is only increased by the minimum mismatch found. That is, if there are 10 statements in an OR block and all need match, and all do, and the first one has 2 mismatches and the other 9 have 3 each, then cumulative mismatch only increases by 2.

Note that these are equivalent:

  !,1,0
  ACG,0,10

    and

  !,3,0
  A,0,10
  C,0,10
  G,0,10

These are traversed top to bottom. There can be extra spaces in the file, but blank lines and comments (like above) may not be present. All cores must be declared at the beginning of the file in the order A,B,C...->Z. The offsets are all with respect from the first base of the first core.

The CORES program reads this file and then applies it to a sequence or sequences. In the example given it would first check for an inverted "A" core at +20. If >1 mismatch it would move on to the next position. If <=1 it would try the next pattern element, the direct "B" core at offset -10. If that condition passed it would check for a C or G at 11, and then A or T at 12 and lastly, check to see if the cumulative mismatch score was in bounds. If *ALL* conditions have been satisfied it stores the sequence of the two cores like: AAAAAAAABBBBBBBBB.

 -Width   This command line qualifier makes sure that there is enough room
   to store the cores and their counts.  /Width must be a multiple of 4,
   and it must be greater than or equal to 4 + sum_cores_length + 1.  For the
   example above this is 4 + 8 + 9 + 1 = 22, so /Width = 24.  Width defaults 
   to 12 (wide enough for a single 7mer core).

 -mismatch   Sets the cumulative mismatch limit.  The value in the pattern
   file, if present, takes precedence.  Defaults to 0.

 -cutoff     Sets a lower bound on the number of times a core must appear
   to be written out.  Defaults to 1.
table of contents

col_to_gcg

This can take the output of the A_VS_A or any other program which emits sequence data in column format and converts it to a GCG sites file. You can then use that sites file with Mapplot to graphically show where the clustered sites are. This program can conditionally include some patterns and exclude others, based on the values in other columns. Columns must be in fixed positions, not tab separated.

col_to_gcg queries are:

Input the name of the file to process
Input the name of the output file
Enter as many lines of comments as you would like
   End each line with a <return>
   End the last line with <return><return>
Please enter the locations of the columns and
 their widths in this format:
 
   Begin, Width      (0,0 terminates entry)
 
  The first entry MUST be the SEQUENCE location
Enter up to  100 strings, which if found anywhere
 on a line cause that line to be ignored.
 These will be CASE SENSITIVE
 A blank line terminates entry of these strings
 If you enter no conditions all lines are processed that
  are not "ignored".  You must supply a condition, column
  number, and value.  All tests are on INTEGERS!!!!!
  
  The conditions that you may apply are
  
  |Code|    Logic
  1       equal to
  2       not equal to
  3       less than
  4       less than or equal to
  5       greater than
  6       greater than or equal to
Specify which of the numbered columns to place in the
  description field (after the !) in the output file.
  The columns that you specify will be separated by
  spaces and concatenated in the order in which you
  name them here. Total width must be <86 characters.
  Enter one column number per line (0 terminates entry):
  
  
Example:

(Note: first run the a_vs_a2 example to produce dmwhite_a_vs_a2.txt)
which looks like this:

Sequence     N  >Minover   PMean  #variances
AAGTT           53    53 .4152254   1.5117
AATTT           80    70 .5868977   1.7809
etc.
12345678901234567890123456789012345678901234567890
with columns:
1   1,5
2  14,5
3  20,5
4  27,7    (because only integers may be tested)
5  35,8    (this can't be tested, it's a float)
)


col_to_gcg << EOD
dmwhite_a_vs_a2.txt
dmwhite_gcg_enzyme.txt
This is comment one
This is comment two
This is the last comment

1,5
27,7
0,0
>

4,2,4000000
0,0,0
1
2
0
EOD
head -6 dmwhite_gcg_enzyme.txt
  This is comment one
  This is comment two
  This is the last comment
 NAME          0 SEQUENCE   0 ! COMMENTS ..
 AGGCT         0 AGGCT      0 !  AGGCT 3162714                                                  
 AGGGT         0 AGGGT      0 !  AGGGT 3142303

The amount of white space in dmwhite_gcg_enzyme.txt was reduced so that the important pieces could be more easily seen.

table of contents

complement_nmer

This program converts all nmers in a fixed position size (ie, start at column 1, size 6) to their complements.

Prompts:
 Input the name of the file to process
 Input the name of the output file
 Beginning column for the Nmer in each record
 Length of the Nmer in each record
 
Example:
cat >killme.txt <<EOD
   AAGC  12
   AGCT  1
   ATTT  2
   TTCT  4
EOD
complement_nmer <<EOD >/dev/null
killme.txt
killme2.txt
4
4
EOD
cat killme2.txt
   AAGC  12
   AGCT  1
   AAAT  2
   AGAA  4
table of contents

dna_run_stats

This program attempts to determine the lengths and frequencies of regions that are repeated 1,2,3 or more times in a single sequence. It outputs a map of the repeat usage for each sequence, and also a summary of repeat frequency and size over all sequences. The algorithm paints sequences with the number of times each strand degenerate Nmer at that position is found anywhere in the sequence. Summarizes the resulting run lengths. This gives the distribution of repeat lengths and also the gaps between unique sequence segments. (WARNING! You must have enough disk quota to hold 2*Seq.Length*Nmer bytes.) This program can take a long time, since it needs to sort files containing one Nmer per base in potentially very large files. Useful for figuring out how many SPLITs and CONTIGs a bunch of sequences would have if they had been assembled solely by shotgun methods (and assuming perfect reads on each sequence.)

  Syntax: $ DNA_RUN_STATS [-INfile=]Primate:* -NMER=-Default

  Required Parameters:

  -NMER={number}                  Length of the Nmers to analyze (typically 50)

  Local Data Files: None
  [-OUTfile=]                     output file name

  Optional Parameters:

  -BEGin                          Starting base
  -END                            Ending base
  -SPLit=400                      Repeat size that splits a contig, default shown
  -MAXNruns=400                   Runs of N larger than this abort entry 
  -FULL                           Output all runs to file
  -SUMMARY                        Summarize files and results to screen

In the output, a SPLIT entry represents a repeated (or maybe N run) that is >= the size of a SPLIT. Contigs listed are those around the splits. !N entries take the opposite extreme and treat N's as valid sequence, so they cannot contribute to a split. Some Genbank entries have long runs of N, and those are eliminated by the default /MAXNruns value, if you want to count them as splits, set /MAXNruns higher than the largest N run you expect to find.

The output file looks like this (see report_usage for an explanation of the column's contents):

  Run length report, total bases:      4622, sequence: GB_PR:HSATP1AX2
    Begin      End   Length     Copy     Type      Gap
        1      357      357        1        0        0
      358      408       51        0        0        0
      409     1191      783        1        0       51
     1192     1195        4        2      538        0
     1196     1273       78        1        0        4
     1274     1275        2        2     2525        0
     1276     1323       48        1        0        2
     1324     1333       10        2     1595        0
     1334     1366       33        1        0       10
     1367     1370        4        2      538        0
     1371     1429       59        1        0        4
     1430     1431        2        2     2525        0
     

Where:

   copy = number of times that Nmer was observed (0 = it consists
       of N's, 1 is unique, 2 is twice, etc.)
   type = assigned type for the Nmer.  This is set to 0 for unique
       sequences, but not for repeats.  So that in the above example
       you can see that at 1274 and 1430 a repeat of length 2 with
       type 2525 occurs.  (This means that the two 50 mers at 1274 and
       1430 are the same, and those at 1275 and 1431 are the same.)
   Gap = space between unique sequences, composed of any combination of
       repeated sequences (or Ns)

At the bottom of the file there is a summary which looks like (this data
is for example only!!!):

    Using Split size:               400
   Repeat Coverage Segments     Avg.  Longest    Where
        0    61043      953       64      595     8648 in GB_PAT:HSFACTIXG
        1 11533321     4327     2665    56953    37365 in GB_PR:HUMTCRADCV
        2    75203     1622       46     1167     7013 in GB_PR:HSU07978
        3     6577      408       16      140     2150 in GB_PR:HUMGHCSA

      >=2   149719     2466       60     1543   123912 in GB_PR:HUMRETBLAS
   splits    28220       38      742     1543   123912 in GB_PR:HUMRETBLAS
!N splits    10000        8      541     1543   123912 in GB_PR:HUMRETBLAS
  contigs 11655662     1951     5974   198285        1 in GB_PR:HSEVMHC
!Ncontigs 11673882     1981     6002   198285        1 in GB_PR:HSEVMHC

Splits are gaps greater than the split limit, here 400. The !N entries treat N as unique sequence, rather than as part of a gap. The >=2 entry is the sum of all gaps containing repeated sequence, whether that is repeated once or 8 times.

Limitations: Nmers repeated more than 58 times cannot be handled properly, so use large Nmers to avoid this. If it is a problem, modify the SHORTMAX variable in the program.

table of contents

filter_nmerlist

This program can be used to filter sorted Nmer lists. It restricts overlap or proximity each type of Nmer. For instance, if the Nmers are 6mers, a value of 0 allows them to be adjacent, but not overlapping, a value of -5 lets them overlap except for one base, and a value of 5 means that there must be at least 5 bases between nearest instances. The filter will remove instances from the list that fail this rule. For instance, for a 6mer, value 0, if the positions were originally:

the resulting list will be paired down to:

filter_nmerlist can also remove ranges of positions - useful for taking at exons for instance. Specify positions like "lower,upper" and terminate the list with "0,0". The order of the range doesn't matter. This is handy when working from CDS in Genbank entries that can be on either strand.


filter_nmerlist queries are:

Input the name of the file to process
Input the name of the output file
You may specify a minimum allowed distance between sites
 This distance will be NMER + value
 A value between -NMER+1  and -1 allows some overlap
 A value of zero allows the sites to be adjacent
 A value greater than 0 allows no overlap
 A value less than or equal to -NMER is nonsense
You may specify up to 64000 ranges within which Nmers
 will not be allowed.  The format is:
   lower, upper 
 Order doesn't matter: 1,10 and 10,1 both filter out 1<->10
 Terminate the list with "0,0"


Example:

cat >test_filter <<EOD
        6  .NMER. 
  1048576  .STEP. 
        1AAAAAA 
        2AAAAAA 
        3AAAAAA 
        4AAAAAA 
        5AAAAAA 
        6AAAAAA 
        7AAAAAA 
        8AAAAAA 
       20AAAAAA 
       30AAAAAA 
EOD
filter_nmerlist <<EOD
test_filter
test_filter_out.txt
0
9,20
0,0
EOD
cat test_filter_out.txt
        6  .NMER. 
  1048576  .STEP. 
        1AAAAAA 
        7AAAAAA 
       30AAAAAA 


table of contents

filter_nmerlist2

This program can be used to filter sorted Nmer lists for entries that satisfy conditions based on their relative positions. For instance, it can be used to restrict the list based on the spacing between N1,N2,N3 - which are the positions of three consecutive Nmers of the same type such that:

  10 < N2 - N1 & lt;20    and
  10 < N3 - N2 & lt;20    and
  22 < N3 - N1 & lt;32

To specify the above enter the following at the prompt:

  1,2,10,20
  2,3,10,20
  1,3,22,32
  0,0,0,0
  
  
filter_nmerlist2 queries are:

Input the name of the file to process
Input the name of the output file
You may specify up to   10 distance criteria
 For each you must specify four values as
   A,B,MIN,MAX 
 A    =   criteria position of one Nmer
 B    =   criteria position of the other Nmer (A < B)
 MIN  =   minimum distance between them (>=0)
 MAX  =   maximum distance between them (<=2000000000)
 The MIN,MAX distances are measured between the
 left edges, ie, min=1 would allow them to overlap
 except at the leftmost base.
 Let N1,N2,N3 be the positions of three consecutive
 NMERs of the same type.
 Example criteria which means 10<(N2-N1)<20, 30<(N3-N2)<35
  1,2,10,20
  2,3,30,35
 Example criteria which means same plus 40<(N3-N1)<50
  1,2,10,20
  2,3,30,35
  1,3,40,50
 To terminate list of criteria enter:
  0,0,0,0

Example:
  
cat >test_filter <<EOD
        6  .NMER. 
  1048576  .STEP. 
       10AAAAAA 
       20AAAAAA 
       30AAAAAA 
       40AAAAAA 
      104AAAAAA 
      210AAAAAA 
      220AAAAAA 
      240AAAAAA 
EOD
filter_nmerlist2 <<EOD
test_filter
test_filter_out.txt
1,2,10,20
2,3,10,20
1,3,15,25
0,0,0,0
EOD
cat test_filter_out.txt
        6  .NMER. 
  1048576  .STEP. 
       10AAAAAA 
       20AAAAAA 
       30AAAAAA 


  

Note that any Nmer that satisfies these criteria will be written out, including cases where for instance, two or three overlapping N2's exist. To reduce these to just one (if that is desired) run the output of filter_nmerlist2 through filter_nmerlist. table of contents


filter_content

Many pairwise tests primarily pick up rotamer, conjugate, simple, and AT rich pairs. This program can be used to filter these Nmers out of output files, if said files have records of the form:

It sorts the records into two files (KEEP/REJECT) based on whether the two Nmers are:

  1.  Rotamers
  2.  Conjugates
  3.  100% AT
  4.  Simple (repeats of 1,2, or 3 bases)
  5.  Apply conditions 1-4 above except at N bases

In each case, "1" is the default to enable that text, "0" disables it. Tests 3 and 4 can be applied to the first, second, both, or either Nmers, with the latter being the default.


filer_content queries are:

Input the name of the file to process
Input the name of the file to receive the records kept
Input the name of the file to receive the records rejected
1  Reject rotamers 
  0  Keep rotamers
1  Reject conjugate Nmers 
 0  Keep conjugate Nmers
Action for "simple" Nmers
   0  Keep all
   1  Reject if at either position
   2  Reject if at both positions
   3  Reject if at first position
   4  Reject if at second position
Action for 100% A/T Nmers
   0  Keep all
   1  Reject if at either position
   2  Reject if at both positions
   3  Reject if at first position
   4  Reject if at second position
 Enter N for the following:
   Keep if satisfy above except at N or fewer bases
    (N=0   All lines kept)
    (N=1   Lines with up to 1 mismatch kept)



Example:
 
cat >test_content <<EOD
GCTGCT  CTGCTG   a rotamer pair
GCTGCT  CTGCTA   a rotamer pair with 1 mismatch
GCTGCT  AGCAGC   a complemantary pair
GCTGCT  AGCAGC   a complemantary pair with 1 mismatch
AAAAAA  GCTGCT   first sequence is 100% AT 
AGCAGC  ATATAT   second sequence is 100% AT
GGGGGG  ACGTAT   first sequence is simple
GTGTGT  ACGTAT   first sequence is simple
GCTGCT  ACGTAT   first sequence is simple
ACGAGG  CCGTAT   not rotamer, complement, or simple
EOD
filter_content <<EOD
test_content
test_content_keep.txt
test_content_reject.txt
1
1
1
1
1
EOD
cat test_content_keep.txt
ACGAGG  CCGTAT   not rotamer, complement, or simple
cat  test_content_reject.txt
GCTGCT  CTGCTG   a rotamer pair
GCTGCT  CTGCTA   a rotamer pair with 1 mismatch
GCTGCT  AGCAGC   a complemantary pair
GCTGCT  AGCAGC   a complemantary pair with 1 mismatch
AAAAAA  GCTGCT   first sequence is 100% AT 
AGCAGC  ATATAT   second sequence is 100% AT
GGGGGG  ACGTAT   first sequence is simple
GTGTGT  ACGTAT   first sequence is simple
GCTGCT  ACGTAT   first sequence is simple

 

That is, N is applied after the other tests, it means "reject if the test holds with up to this many mismatches." Lines that don't fit the "nmer nmer stuff" pattern go into the kept file. One trick, if you only want the rejects or the kept, direct the other file to "NLA0:", which is the null device. This eliminates the need to create a file that you don't want and then have to delete it later.

There is essentially no error checking in this program - typos will cause it to crash.

table of contents

find_close

Searches through a sorted file for numeric values which are closer than "k" apart. When such a pair is found the second such line is emitted. The first line is not emitted (unless it satisfies this condition for another pair.)

Prompts:

Input the name of the file to process
Input the name of the output file  
Numeric fields begin in which column of the record,
 must be greater than or equal to 1.
Length of the Numeric field in each record
Maximum difference between consecutive lines
 which is considered a match (>= 0)
Begin processing lines at which file record?
  Example: 1 = first record of file


Example:


cat >killme.txt <<EOD
  10001 A
  10004 B
  10101 C
  10104 D
  10201 E
  10304 F
EOD

find_close <<EOD
killme.txt
killme2.txt
1
7
5
1
EOD
 matches found:         4
 lines processed:       8

cat killme2.txt
  10002 A2
  10003 A3
  10004 B
  10104 D
table of contents

from_binary_fosn

Converts specific entries from a binary FOSN to a regular list file (FOSN). This program is designed to be run in batch mode - it does no error checking. Typically it would be run following some processing on a binary NMER file, which would result in a series of sequences, specified by their positions in the original list file, to retrieve. See MAKE_BINARY_FOSN and MAKE_NMER_LIST_BINARY

Inputs are:

  1.  The name of the binary fosn
  2.  The name of the file containing a list of records to retrieve (
      terminates with a 0 (zero) or an end of file).
  3.  The name of the output file.

Example:
  $ create test.order
  1
  2
  10
  ^Z
  $ run clusters:from_binary_fosn
  human.lis_bin
  test.order
  test.out
  $ type test.out
           ..
  gb_est:whatever_for_1
  gb_est:whatever_for_2
  gb_est:whatever_for_10
table of contents

histo_column

Run this after SUM_NMER_LIST, A_VS_A, or any other program that emits columns of numbers.. It generates a histogram from a column of numbers, optionally skipping certain lines and binning buckets together. In the following example a histogram with bins of width 10 is generated based on the 4mer composition of a sequence.

Prompts:
 Input the name of the file to process
 Input the name of the output file
 The column's starting position, in characters:
 The column's width, in characters:
 Bin size (>0!!):
 Enter up to 5 strings, which if found anywhere
  on a line cause that line to be ignored.
  These will be CASE SENSITIVE
  A blank line terminates entry of these strings
 String = 

Example:
nmer -infile=gb_in:dmwhite -nmer=4 -markov=2 -out=killme.txt -cutoff=1 -default

head killme.txt
 Expected generated by:  Markov 2/1            
 Direction analyzed: Forward
 Required direct repeats of Nmer core: 1
 Only observed >= 1 AND Prob <= 1.00999999 shown
 Nmer                    Obs          Exp  P (tail)
AAAA                     231      175.632  0.00003737
AAAC                     108      100.697  0.24606484
AAAG                      97       90.417  0.25781465
AAAT                     143      132.281  0.18627548
AACA                      86       88.200  0.43496028

histo_column <<EOD
killme.txt
killme2.txt
24
5
10
e

EOD


cat killme2.txt
        0,        0
       10,        0
       20,       16
       30,       51
       40,       66
       50,       45
       60,       33
       70,       11
       80,       13
       90,        5
      100,        4
      110,        4
      120,        4
      130,        0
      140,        1
      150,        0
      160,        1
      170,        0
      180,        0
      190,        0
      200,        0
      210,        0
      220,        0
      230,        1
      240,        0
      250,        1


table of contents

histo_column_binary

This should be run on sorted binary data. It histograms the number of times runs of given lengths are found in the data. Optionally, it will output those that occur more than CUTOFF times (after first converting them to DNA sequence.) For instance, if run on a sorted BINARY NMER list this can tell you that there are 120 Nmers that appear 10 times.

table of contents

least_nmer

This program converts all nmers in a fixed position size (ie, start at column 1, size 6) to their complements, then compares the two. Only if the original NMER is lexically less than or equal to its conjugate it is sent to the output file. The purpose of the exercise it to strip out redundant lines for those analyses which give the same score for AAAAT and ATTTT. This is more or less the inverse of NO_HOLES.

Prompts:
Input the name of the file to process
Input the name of the output file
Beginning column for the Nmer in each record
Length of the Nmer in each record

Example:
cat >killme.txt <<EOD
   AAGC  12
   AGCT  1
   ATTT  2
   TTCT  4
EOD
least_nmer <<EOD >/dev/null
killme.txt
killme2.txt
4
4
EOD
cat killme2.txt
   AAGC  12
   AGCT  1

table of contents

make_binary_fosn

Converts a list file (FOSN) to a binary list file. The latter consists of 20 byte fixed length records. The binary FOSN can be rapidly random accessed to retrieve GCG database entry names given the order number of an NMER (see MAKE_NMER_LIST_BINARY). Example:

table of contents

make_nmer_list

Converts one or more GCG formatted file into an NMER list. An NMER list just shows the location of each Nmer. Warning, the resulting file will be roughly 12X larger than the original sequence, so make sure that you have enough diskspace.

Example:
make_nmer_list -infile=genbank:dmwhite -nmer=6 -out=killme.txt -default
head -5 killme.txt
        6  .NMER. 
  1048576  .STEP. 
        1GAATTC 
        2AGAATT
rm killme.txt
make_nmer_list -infile=genbank:dmwhite -nmer=6 -out=killme.txt -feat <<EOD


fred
10
jane
100

EOD
head -5 killme.txt
        6  .NMER. 
  1048576  .STEP. 
       10 FRED 
      100 JANE 
        1GAATTC 
table of contents

make_nmer_list_binary

Makes a binary NMER list. Each 8 byte record consists of a 4 byte integer encoding up to a 16 mer, followed by a 4 byte integer listing the order number of the sequence that that NMER came from. No other positional information is present. Warning, the resulting file will be (8 X the number of bases in the original sequence) bytes long.

table of contents

make_nmer_profile

This calculates a binary Nmer profile. This file contains the number of times each Nmer is observed at a given position in a sequence. Then the NMER profile can be run against a sequence using TEST_AGAINST_NMER_PROFILE.

Prompts:
INPUT filename.  A sorted NMER list
LENGTH length of the longest sequence which went into the profile
TYPE   1=nucleotide, 2=peptide

Restrictions on profiles are:
  which        largest Nmer   longest sequence
  nucleotide   6              200
  peptide      3              100
table of contents

make_ranges

This program converts triplets of (name,begin,end), one per line into range/subranges for use with a_vs_range2 and also a_vs_b2 and such.

Prompts:
INPUT filename.  Contains a series of records like:
     name, begin, end
            (if begin > end) then the values are swapped)
      _OFFSET,val,0
            add val to all subsequent positions
     _STEP,val,0
            like offset
     _FILL,val,0
           Fill all ranges with enough subranges so that
           the distance between the ends of each subrange
           are <= VAL (ie, end - begin + 1 /N <= val)
Example:

cat <<EOD >range_input.txt
jane,100,200
fred,1,10
_OFFSET,1000,0
ethel,1,10
_FILL,20,0
windows,300,400
EOD

make_ranges <<EOD  >/dev/null
range_input.txt
range_output.txt
EOD

cat range_output.txt
      100 jane 
      200 jane 
        1 fred 
       10 fred 
     1001 ethel 
     1010 ethel 
     1300 windows 
     1320 windows 
     1320 windows 
     1340 windows 
     1340 windows 
     1361 windows 
     1361 windows 
     1381 windows 
     1381 windows 
     1400 windows

The output file is both a RANGES list and an NMER list. It does NOT contain the .step. and .nmer. fields since it is intended to be appended to an existing list (which already has that) and sorted.

Since some fields may be out of order, this should be sorted before being appended to an nmer list. This is because nmer lists are usually only sorted on the names because the positions are in order going in - that isn't the case here - see how "jane" is out of order with respect to "fred". Use a command like:

  on Unix:
   % sort range_output.txt > range_output_inorder.txt
  on VMS:
   $ sort/key=(pos:10,size:30)/key=(pos:1,size:9)/stable -
     range_output.txt ange_output_inorder.txt
table of contents

make_random_dna

Make a piece of random dna based on some sample. Output can be of any size up to the GCG limit. It can also accumulate statistics over a collection of DNA using the -DUMP=dna.stat switch, and then read that in again using -INStat=dna.stat. The statistic dump file is not ver large, and saves having to read through the collection again.

Examples:

# generate 50 bp with same statistics as dmwhite - directly
make_random_dna -infile=genbank:dmwhite -out=rand.seq -length=50 -default
cat rand.seq
       Length: 50  September 17, 2001 15:23  Type: N  Check: 1969  ..

       1  AGGAGTATGA TGACAAATTA TCCGAAACAG CTAACGTTTC GTCAAGGTTC 

# generate 50 bp with same statistics as dmwhite - through a statistics file

make_random_dna -infile=genbank:dmwhite -dump=dmwhite.stat -default

head -5 dmwhite.stat
Transition table generated by Make_Random_DNA
3Mer->      G         A         T         C  ..
GGG   0.29798   0.23737   0.20707   0.25758
GGA   0.27717   0.34783   0.18478   0.19022
GGT   0.39706   0.19118   0.25735   0.15441

make_random_dna -instat=dmwhite.stat -out=rand.seq -length=50 -default

cat rand.seq
      Length: 50  September 17, 2001 15:25  Type: N  Check: 9393  ..

       1  ACACTAAAAC CGGTCAAATT GCAATCTGTA TGCCCAGGGC CCCCAAGGGG 

table of contents

markovtest

This calculates expected and prob. of observed/expected from a set of subsequences (<= 19 AA or BP long, 50000 lines limit) against a set of sequences. It uses Markov chain methods to predict the expected number of each subsequence in the set of sequences. It is similar to NMER except that NMER generates this calculation for all tuples of a given size, to check a single long tuple (for instance, a 16mer) use MARKOVTEST instead. The format for the subsequences is to put them into a file like this:

!comment, this is ignored
ACGTCGCT,17
TCTCTCT,12   !comment, this is ignored

No leading or internal spaces in the subsequence,number part!!! The number shown is the number of "observed" for this subsequence. This must be known ahead of time. If it isn't known, then it defaults to zero, so expected may be calculated, but Prob will be for zero observed versus expected.

Examples i

cat <<EOD >dmwhite.patterns
TTTCATTGTC,5
CAAATTGTCT,23
AACTGCGACC,0
GCGGCG,1
GCGGCG,2
GCGGCG,3
GCGGCG,4
GCGGCG,5
GCGGCG,6
GCGGCG,7
GCGGCG,8
GCGGCG,9
GCGGCG,10
GCGGCG,11
GCGGCG,12
GCGGCG,13
GCGGCG,14
GCGGCG,15
TGTTGT,30
TGTTT,100
TGTT,100
TGT,100
EOD

markovtest -infile=gb_in:dmwhite -data=dmwhite.patterns -cutoff=0 \
    -markov=2 -outfile=dmwhite.markovtest.txt -default
    
cat dmwhite.markovtest.txt
 Expected generated by:  Markov 2/1            
 Direction analyzed: Forward
 Only observed >= 0 shown
 Nmer                    Obs          Exp  P (tail)
TTTCATTGTC                 5        0.032  0.00000000
CAAATTGTCT                23        0.031  0.00000000
AACTGCGACC                 0        0.011  0.98928922
GCGGCG                     1        2.064  0.38888121
GCGGCG                     2        2.064  0.65927798
GCGGCG                     3        2.064  0.34072202
GCGGCG                     4        2.064  0.15466189
GCGGCG                     5        2.064  0.05864090
GCGGCG                     6        2.064  0.01899767
GCGGCG                     7        2.064  0.00535840
GCGGCG                     8        2.064  0.00133616
GCGGCG                     9        2.064  0.00029832
GCGGCG                    10        2.064  0.00006026
GCGGCG                    11        2.064  0.00001109
GCGGCG                    12        2.064  0.00000191
GCGGCG                    13        2.064  0.00000030
GCGGCG                    14        2.064  0.00000006
GCGGCG                    15        2.064  0.00000000
TGTTGT                    30        3.965  0.00000000
TGTTT                    100       27.232  0.00000000
TGTT                     100       75.823  0.00450516
TGT                      100      211.116  0.00000000

#
# to pick up statistics over many files, use something like this
#

markovtest '-infile=gb_in:dm*' -data=dmwhite.patterns -cutoff=0 \
    -markov=2 -outfile=dmwhite.markovtest.txt -default

Markovtest should be run in batch mode if infile is a big chunk of a database!!!

table of contents

nmer

This is like the GCG program COMPOSITION except that it allows N in the range 1 to 10. It will work on multiple files using any of the GCG conventions (ie, GB:* or @mylist.fil). It tests the significance of observed vs. expected (from the composition or Markov chain method) by a single tailed summation of the Poisson distribution assuming that the calculated expected is the mean. So, if expected = 100 and observed is 150 it is the terms from 150->infinity, if observed = 50 it is the terms from 0->50. Warning: if obs/expected are >20000 or so the probability calculation may crash the program, so don't run Nmer <=3 on huge sequences! Nmer also outputs as the last line, the chi**2 value for all obs,expected pairs, with observerd >= cutoff. This can be used to test the goodness of fit of the Markov chain model used versus the actual data.

Use -check to see the various command line options. -DNAfast makes a huge difference if you can use it. NMER will also work on proteins.

Example:

nmer -infile=gb_in:dmwhite -nmer=6 -markov=3 -out=dmwhite.nmer6.txt -cutoff=15 -default

head -10 dmwhite.nmer6.txt
 Expected generated by:  Markov 3/2            
 Direction analyzed: Forward
 Required direct repeats of Nmer core: 1
 Only observed >= 15 AND Prob <= 1.00999999 shown
 Nmer                    Obs          Exp  P (tail)
AAAAAA                    39       39.414  0.51605129
AAAAAC                    15       18.244  0.26790133
AAAAAT                    26       22.736  0.27331400
AAAACT                    18       12.145  0.06870717
AAAAGT                    16       11.529  0.12352026

nmer -infile=gb_in:dmwhite -nmer=6 -markov=3 -out=dmwhite.nmer6.txt -repeat=2 -cutoff=2 -default

cat dmwhite.nmer6.txt
 Expected generated by:  Markov 3/2            
 Direction analyzed: Forward
 Required direct repeats of Nmer core: 2
 Only observed >= 2 AND Prob <= 1.00999999 shown
 Nmer                    Obs          Exp  P (tail)
AAAAAA                     4        0.183  0.00004005
ACACAC                     4        0.003  0.00000000
ATATAT                     9        0.009  0.00000000
CACACA                     5        0.004  0.00000000
CTCTCT                     3        0.002  0.00000000
TATATA                     8        0.007  0.00000000
TCTCTC                     3        0.002  0.00000000
TTTTTT                     8        0.232  0.00000000
>chi**2: 39417.425781  entries:8
table of contents

nfindpatterns

This is a modified version of findpatterns that outputs a set of hits in a format that can be read by the CLUSTERS programs. This provides a simple way of generating lists of positions of degenerate patterns, for instance, one could find all locations of every entry in the TFD and then look for clustering. The output, like that from make_nmer_list, must be sorted before use (see SORT). It has a few special switches:

   -Offset    Add this to all positions
   -Step      Add this to the offset for each set of hits in a 
              multisequence analysis.  (This allows some programs
              to analyze one sequence at a time, and yet do statistics
              over all.)  Step must be bigger than the biggest sequence
              that you analyze.
   -NMER      Some CLUSTERS programs can filter for overlaps, but they
              need to know how big a pattern is, and all patterns must
              be the same size.  You can use this switch to tell the
              machine what size to use for all patterns
   -PREFIX    Specify a string.  If the patterns are entered on the command
              line, or they are not given names in a data file, then they
              will be named " Pat1"," Pat2", etc. by default.  Prefix lets you
              change " Pat" to something else.  In particular, you 
              may or may not want to have the leading " ", which is used
              by a_vs_b2 to distinguish between special sites and regular
              sites.

Example:

nfindpatterns -infile=gb_in:dmwhite \
 -pattern=ATATATATATAT,CTCTCTCTCTCTCT \
 -out=dmwhite.nfindpatterns.txt -default
 
cat dmwhite.nfindpatterns.txt
        6 .NMER. 
   262144 .STEP. 
     3184Pat1 
     5728Pat1 
     5730Pat1 
     5732Pat1 
     5734Pat1 
     5736Pat1 
     5738Pat1 
     5740Pat1 
     5742Pat1 
    14066Pat2 
    14068Pat2 


Where Pat1 is ATATATATATAT, Pat2 is CTCTCTCTCTCTCT.

If there are more than just a couple of patterns to
check put them in a patterns file and load them with
-data=file.patterns.txt.  The advantage of doing this
is that it also allows more meaningful names to be used
for the matches (compared to Pat1, Pat2, etc.) Example:

cat >killme.pattern <<EOD
blah blah
Name    Offset  Pattern             Overhang  Documentation  ..
AT_REP  0       ATATATATATAT        0 !
CT_REP  0       CTCTCTCTCTCTCT      0 !
EOD

nfindpatterns -infile=gb_in:dmwhite \
 -data=killme.pattern \
 -out=dmwhite.nfindpatterns.txt -default
 cat dmwhite.nfindpatterns.txt
        6 .NMER. 
   262144 .STEP. 
     3184AT_REP 
     5728AT_REP 
     5730AT_REP 
     5732AT_REP 
     5734AT_REP 
     5736AT_REP 
     5738AT_REP 
     5740AT_REP 
     5742AT_REP 
    14066CT_REP 
    14068CT_REP 

table of contents

no_holes

Fills in missing records in a sorted NMER list. Many programs, CORES, for instance, only output NMERs that match certain criteria, not the full list of all possible NMERs. While this is fine for many purposes, it does make it quite difficult to compare the results of runs which happened to find different sets of NMERS. NO_HOLES can be used to postprocess a partial sorted NMER list into a complete sorted NMER list. The NMERs may begin in any column abd can be any size. There may be leading and trailing non-NMER records in the file. NMERs are converted to upper case before being compared, and that can result in mixed cases (see below)

Example:

cat >testseq.dat <<EOD
1
2
3
4
5
   ac   2
   ag   3
   AT   4
   CC   2
   CG   3
   CT   4
   GA   1
   GC   2
   GG   3
   GT   4
   TA   1
   TC   2
trailing line 1
trailing line 2
trailing line 3
EOD

no_holes <<EOD >/dev/null
testseq.dat
newtestseq.dat
4
2
6
     NEW
___
EOD

cat newtestseq.dat
1
2
3
4
5
___AA     NEW
___ac   2
___ag   3
___AT   4
___CA     NEW
___CC   2
___CG   3
___CT   4
___GA   1
___GC   2
___GG   3
___GT   4
___TA   1
___TC   2
___TG     NEW
___TT     NEW
trailing line 1
trailing line 2
trailing line 3
table of contents

occupation

This is a modified version of findpatterns that measures the area covered by the patterns searched for. That is, it tells what fraction of the DNA or Protein is comprised of the patterns entered. Optionally, it can plot this density, or the locations of these sites. To see all command line options run the program with -check. Some switches of note:

   -WINdow    Specifies a window size for plot
   -STEP      Bases/amino acids to shift between windows
   -DOCount   If set, then the plot shows the density of sites instead
              of their coverage.
              
Example:

# set up graphics
xwindows color

occupation -infile=gb_in:dmwhite \
 -out=dmwhite.atoccup.txt\
 -pattern=at -window=100 -default
 
 
# the plot comes up on the graphics device

cat dmwhite.atoccup.txt
! run on: September 17, 2001 16:04
!
! Patterns searched:
!        1 at
!Results follow:  
!               Name    Cover     Length   Fraction
             DMWHITE      2136     14245 0.14994735

 

table of contents

report_usage

This reads a sorted Nmer list (that *MUST* have been derived from a single sequence). Each Nmer must be large enough (at least size 20 for small sequences, 30 or more for larger ones) so that it would normally occur only once by chance within the sequence.It reports how often each Nmer appears in the sequence (once, twice, etc.), how much sequence each covers (sort of), how big the longest run is and where it is. This program indicates retrospectively how difficult a region would have been to assemble via shotgun methods. TEST_SHOTGUN is a related program.

prompts:

 Input the name of Sorted Nmer file to process
 Input the name of the output file
 Input the minimum length of nonunique
   sequence that would split a contig in two.
   
Example:

make_nmer_list -infile=GB_IN:DMU31961 -nmer=30 -out=killme.txt -default

sort -t\n -k0.10 -k0.0,0.9 killme.txt > killme2.txt

report_usage <<EOD >killme4.txt
killme2.txt
killme3.txt
300
EOD

One output file gives a blow by blow summary of the various types of DNA that occur, their copy number and how often they are found. In the following table the columns are:

  1. Begin. start of segment of DNA
  2. End. end of that segment of DNA
  3. Length. length of that segment of DNA
  4. Copy. copy number, 1 means single copy, 2 means 2 copies
  5. Type. A name assigned to that type of DNA, 0 for single copy
  6. Gap. Gap between single copy sgements (minus the Nmer size)

cat killme3.txt
 Nmer Usage report
    Begin      End   Length     Copy     Type      Gap
        1    26012    26012        1        0        0
    26013    26014        2        2   164141        0
    26015    26044       30        1        0        2
    26045    26046        2        2   164141        0
    26047    49027    22981        1        0        2
    49028    49037       10        5   265728        0
    49038    68561    19524        1        0       10
    68562    68644       83        2   267201        0
    68645    68672       28        1        0       83
    68673    68755       83        2   267201        0
    68756   338234   269479        1        0       83
    

The data sent to stdout, redirected above to killme4.txt, contains a global view of the possible assembly problems with this sequence.

  1. Repeat. Summary for copy number. 1 = single copy, 2 = two copies
  2. Coverage. Amount of DNA covered by this copy number
  3. Segments. Segments of DNA covered by this copy number
  4. Avg. Average length = coverage/segments
  5. Longest. Longest segment
  6. Where. Location of Longest
cat killme4.txt
   Summary of runs observed in sequence
      Using Nmer size:                 30

      Using Split size:               300

    Repeat Coverage Segments     Avg.  Longest    Where
         1   338054        6    56342   269479    68756
         2      170        4       42       83    68562
         5       10        1       10       10    49028
       >=2      180        5       36       83    68562
    splits        0        0        0        0        0
   contigs   338234        1   338234   338234        1

In this example the DNA region has no serious obstructions to shotgun cloning - there are no gaps larger than 300 bp ("splits") and amount of multicopy sequence is very small.

table of contents

sort

Most programs require that NMER_LISTs be in sorted order.

 Unix%    sort -t\n -k0.10 -k0.0,0.9 nmer.list > nmer.sorted
 OpenVMS$ SORT/STABLE/KEY=(POS:10,SIZE:64)/Proc=tag nmer.list nmer.sorted

 (Ignore the warning message about invalid key.)

You will also want to sort the output of A_VS_A, putting the most clustered
Nmers at the top.  The command for doing this on the mean of the closest 
values is:

 Unix%    sort -t\n -k0.26,0.34          a_vs_a.out > a_vs_a.sorted
 OpenVMS$ SORT/STABLE/KEY=(POS:26,SIZE:9) a_vs_a.out   a_vs_a.sorted

To put a particularly messy list (both positions and nmers out of order)
use this:

 Unix%    sort -t\n -k0.10 -k0.0,0.9 nmer.list > nmer.sorted
 OpenVMS$ SORT/STABLE/KEY=(POS:10,SIZE:64,NUMBER:1) -
             /KEY=(POS:1,SIZE:9,NUMBER:2) -
             /Proc=tag nmer.list nmer.sorted

Note that many files on OpenVMS have an invalid RMS attribute set which is why 
/PROC=tag is specified above.  If you happen to know the length of the 
longest line in a file, and the file's format is stream-lf (use ANAL/RMS
to find out the file format), it is generally faster to do:

 OpenVMS$ set file/attrib=mrs:20 filename
 
then sort

Where "20" would be replaced as appropriate for the file
table of contents

sum_nmer_list

Run this next on the SORTED Nmer_list. The output will tell you how many of each Nmer are present.

Prompts:
 Also process non Nmer marked locations Y/N [N]
 Input the name of the file to process
 Input the name of the output file

Example:

make_nmer_list -infile=genbank:dmwhite -nmer=6 -out=killme.txt -default

sort -t\n -k0.10 -k0.0,0.9  killme.txt > killme2.txt

sum_nmer_list <<EOD >/dev/null
N
killme2.txt
killme3.txt
EOD

head -5 killme3.txt
      10,AAATTT
       1,AACGTT
      30,AACTTT
       3,AAGCTT
       3,AAGGTT
table of contents

test_against_nmer_profile

Scan a binary NMER profile down a plain text or FASTA sequence, and report the score at each position. The score is the average over the window size of the profile (set by the longest sequence) for the weights of each Nmer observed. That is, if AGCT is at position 3, it looks up position 3 in the profile, for Nmer AGCT, sees that it was present 21 times in the input sequences at that position, and also sees that 1000 input sequences had valid Nmers at that position, so it adds 21/1000 to the score. When the whole window is done it divides by the window length. (That is, scores are normalized for length and number of input sequences, so that scores are always between 0 and 1.)

There are no prompts.  The program expects to find the following:

  bnprof              the name of the file holding the binary Nmer profile
  testseq             the name of the sequence to test.  Nucleotide or
                      peptide.  Plain text or fasta, lines no longer than 256
                      characters, no spaces, numbers or other detritus
  results             the name of the output file
table of contents

test_shotgun

TEST_SHOTGUN simulates sequencing projects. It can be used to figure out the degree of completion of a project, and/or to figure out what a project would cost. Projects up to 1 Mbp are simulated directly, over that they are simulated in a 1 Mbp region and then scaled to the final size. The program can handle multiple "phases", each with a different sequencing protocol. In any given phase, at least ONE clone must be added - note that for very large projects this scales too, so that adding 1 clone to the human genome (3 Gb) shows up in the output as 3000 clones.

Prompts:
 Input the name of the output file
 Input one text line that describes this project (<=60 characters)
 Input the amount of project DNA, in base pairs
 Input the cost of obtaining an oligonucleotide primer
 Input the amount of sequence recoverable from a
   primer walking sequencing step (,cost)
 Input the size of clones that will be used to close
   gaps between Mesh/Contigs (,cost).  (NOTE.  if clones
   are to be selected by hybridization with a single
   probe then this size should be 1/2 the actual size
   of the clones as each clone will be randomly positioned
   with respect to that probe so, on average, only 1/2 of
   each clone will be new sequence.)
 Input the minimum overlap to merge a Mesh/Contig
   AND for overlap in primer walking
 DNA sequencing project phase:   1
  
 Input the size of each clone (,cost per clone)
 Input the amount of sequence from each end(,cost)
 Input the number of such clones to process
 0 exits, any other number to add more fragments

Example:

test_shotgun <<EOD
killme.txt
This is a made up project
100000
10.00
500,5.00
10000,2.00
50
2000,0.50
500,5.00
200
0
EOD

cat killme.txt
 
 Project: This is a made up project                                   
 
 Project constants
 
   Genome size (base pairs):                            100000.
   Scaling from internal subsequence:                     1.00
   Required overlap for Mesh merge:                         50
   Bases of sequence from a custom primer:                 450
   Bases in clones used to close mesh gaps:              10000
   Cost to obtain a clone outside a Mesh/contig:          2.00
   Cost to make a custom primer:                         10.00
   Cost to sequence from a custom primer:                 5.00
 
 
 Phase :                                                     1
   Number of clones added:                                 200
   Size of each clone added:                              2000
   Bases of sequence from each end primer:                 500
   Number of genome units of coverage:                    4.00
   Number of genome units of sequence:                    2.00
 
   Unit cost: clone isolation and preparation             0.50
   Unit cost: end primer sequencing:                      5.00
 
   Total phase cost for clone preparation:              100.00
   Total phase cost for end sequencing:                2000.00
 
 Project status after phase:                                 1
 
 Summary of coverage fractions:
   included in clones:                               0.9757300
   sequenced:                                        0.8643800
     both directions:                                0.3757600
     multiply, one direction:                        0.1975600
     once:                                           0.2910600
   gaps:                                             0.1356200
     IntraMesh:                                      0.1113500
     InterMesh:                                      0.0242700
 
 Mesh information
   number (Meshes):                                          4
   max size (clones):                                      130
   avg. size (clones):                                      50
   max size (bp):                                        62224
   avg. size (bp):                                       25463
 
 Contig information
   number (Contigs):                                        76
   max size (sequences):                                    31
   avg. size (sequences):                                    5
   max size (bp):                                         4322
   avg. size (bp):                                        1140
 
 IntraMesh Gaps (inside clones)
   fraction:                                         0.1113500
   number (internal gaps):                                  60
   max size (bp):                                          782
   avg. size: (bp)                                         185
 
 InterMesh Gaps (between clones)
   fraction:                                         0.0242700
   number (external gaps):                                   3
   max size (bp):                                         1093
   avg. size (bp):                                         809
   minimum new clones to close:                              3
 
 Regions sequenced more than once, one direction only
   fraction:                                         0.1975600
   number (regions):                                        82
   max size (bp):                                         1415
   avg. size (bp):                                         240
 
 Regions sequenced both directions
   fraction:                                         0.3757600
   number (regions):                                        98
   max size (bp):                                         1203
   avg. size (bp):                                         383
 
 Regions sequenced once only
   fraction:                                         0.2910600
   number (regions):                                       175
   max size (bp):                                          500
   avg. size (bp):                                         166
 
 Sequencing steps required to close
   IntraMesh gaps (X):                                      64
   InterMesh gaps (Y):                                       7
   IntraMesh, one strand -> both strands(Z):               191
   close all, both strands (Z+2*(X+Y)):                    333
 
 Total costs
   Shotgun clone isolation and preparation (A):         100.00
   Shotgun clone end sequencing (B):                   2000.00
   Custom primer synthesis (C+D+E):                    3330.00
      IntraMesh gaps (C):                              1280.00
      InterMesh gaps (D):                               140.00
      IntraMesh 1->2 strands (E):                      1910.00
   Custom primer sequencing (F+G+H):                   1665.00
      IntraMesh gaps (F):                               640.00
      InterMesh gaps (G):                                70.00
      IntraMesh 1->2 strands (H):                       955.00
   InterMesh clone isolation (I):                         6.00
 
   Grand Total cost (A+B+C+D+E+F+G+H+I):               7101.00

table of contents

unique_nmer

This program converts all nmers in a fixed position size (ie, start at column 1, size 6) to a "unique" form. This merges all rotamers and conjugate nmers into a single form. This assumes no base changes - it cannot merge nmers that differ in unique forms by a single base into a single form. For instance, all of the following forms become: TGCAGC
GCAGCT,CAGCTG,(and other rotamers)
GCTGCT,CTGCTG (and other rotamer/conjugates)

Prompts:
 Input the name of the file to process
 Input the name of the output file
 Beginning column for the Nmer in each record
 Length of the Nmer in each record

Example:

# the first 8 lines are all rotamer/conjugates for GGGC

cat >killme.txt <<EOD
GGGC
GGCG
GCGG
CGGG
CCCG
CCGC
CGCC
GCCC
ATAT
EOD

unique_nmer <<EOD >/dev/null
killme.txt
killme2.txt
1
4
EOD

cat killme2.txt
GGGC
GGGC
GGGC
GGGC
GGGC
GGGC
GGGC
GGGC
TATA

table of contents

search

Search is a VMS utility. For the equivalent function on unix, do "man grep". If you find a bucket in a HISTO_COLUMN output that interests you (for instance, on one of the tails), you can find the matching NMER by searching for that value in the SUM_NMER file or the A_VS_A file. Also use search against GENMOREDATA:TFSITES.DAT to see if a particular sequence is a known transcription factor.

table of contents

zero_palindrome_nmer

Sometimes palindromic sequences show up twice in analyses, with the second appearance being in a separate column. In this instance, use this program to reset the value to 0.

Prompts:

Input the name of the output file
output file
starting column for nmers
Length of the Nmer in each record
Begin processing Nmer lines at which file record?
text which will overwrite the value part of the record
column where this overwrite should occur

Example:
nmer -infile=genbank:dmwhite -nmer=2 -out=killme.txt -markov=1 -both -default

zero_palindrome_nmer <<EOD >/dev/null
killme.txt
killme2.txt
1
2
6
     0 
23
EOD

cat killme2.txt
 Expected generated by:  Markov 1/0            
 Direction analyzed: BothStrands
 Required direct repeats of Nmer core: 1
 Only observed >= 0 AND Prob <= 1.00999999 shown
 Nmer                    Obs          Exp  P (tail)
AA                      2868     2283.463  0.00000000
AC                      1506     1749.192  0.00000000
AG                      1556     1749.321  0.00000133
AT                         0     1141.729  0.01440653
CA                      1894     1749.192  0.00032663
CC                      1479     1341.900  0.00011998
CG                         0      669.077  0.04123657
CT                      1556     1749.321  0.00000133
GA                      1550     1749.321  0.00000064
GC                         0      669.077  0.00000000
GG                      1479     1341.900  0.00011998
GT                      1506     1749.192  0.00000000
TA                         0     1141.729  0.00000000
TC                      1550     1749.321  0.00000064
TG                      1894     1749.192  0.00032663
TT                      2868     2283.463  0.00000000
>chi**2: 611.164185  entries:16
table of contents

nmer lists

Most of these programs work with sorted Nmer lists which are derived from the regular nmer list described here. An Nmer is a term we sometimes use for a sequence of length N (the more technical mathematical name is a tuple). An Nmer list is a text file (although some programs use a binary nmer list, which is similar but can't be viewed as normal text) that begins with one or more header records and then is followed by a long list of Nmers and their positions in the sequence. Here is an example:

        6  .NMER. The size of the nmers in this file
  1048576  .STEP. If M=1->K sequences, the positions for sequence M are (M-1)*.STEP.+offset
      100 EcoRI An annotated site
        1GAATTC An Nmer, its position is a 9 digit integer - which is 
        2AGAATT   big enough for any real sequence.  This is followed by the sequence
        3CAGAAT   of the Nmer itself.
        4TTCTGC 
        5TCTGCT Note that each line ends with a SPACE, which is hard to see
        6GAGCAG   but essential for some of the programs to work properly.
        7TGCTCC 
        8TGGAGC Note that AAAAAT and ATTTTT would both be stored as AAAAAT
        9CTGGAG 

It's possible to put site annotation into this type of file as well. When doing so the name of the site must begin with a space. This fixed format allows the files to be easily sorted on the Nmer field.

table of contents

sorted nmer lists

To generate a sorted nmer list from a regular nmer list use the appropriate sort command for the operating system you are using. For Unix that command would be:


sort -t\n -k0.10 -k0.0,0.9  nmer_list.txt > sorted_nmer_list.txt

When applied to the nmer list in the previous example the result would be:
        6  .NMER. 
  1048576  .STEP. 
      100 EcoRI 
        2AGAATT 
        3CAGAAT 
        1GAATTC 
        6GAGCAG 
        5TCTGCT 
        7TGCTCC 
        8TGGAGC 
        4TTCTGC 

To see better why the lists are sorted a bigger example is best employed

make_nmer_list -infile=GB_IN:DMWHITE -nmer=6 -out=killme.txt -default
cat <<EOD >>killme.txt
      100 EcoRI 
     2000 EcoRI 
     1500 HindIII 
     2100 HindIII 
EOD
sort -t\n -k0.10 -k0.0,0.9 killme.txt > killme2.txt
head -10 killme2.txt
        6  .NMER. 
  1048576  .STEP. 
      100 EcoRI 
     2000 EcoRI 
     1500 HindIII 
     2100 HindIII 
      783AAATTT 
     3404AAATTT 
     3471AAATTT 
     4084AAATTT 

The sort command says to use a newline as a field separator - and since there won't be one in any line the whole line becomes a single field. The first key says to sort lexically (by character order) starting at the 10th character on the line (the Nmer) and the second key says to sort within that on characters 1 through 9 (the position data).

table of contents


range lists

These are similar in format to Nmer lists but lack the .STEP. and .NMER. lines. Each range is specified by a consecutive pair of positions with the same name. example:

     6960 0-500up
     7460 0-500up
     7460 transcribed 
    13428 transcribed 
     7686 CDS 
     7757 CDS 
    10853 CDS 
    11153 CDS 

In this example the ranges are: 0-500up: 6960-7460, transcribed: 7460-13428, CDS: 7686-7757, 10853-11153. As with an Nmer list the position is set by 9 digit integer, followed by the name (which usually begins with a space to distinguish it from an Nmer) and followed by a space. The program Make_Ranges generates these files from a set of commands.

table of contents