Fundamentals of Sequence Analysis, 1998-1999

Fundamentals of Sequence Analysis, 1998-1999

Lecture 5. Tools for Molecular Biology II.

Today we're going to continue examining tools that you can use for planning and analyzing your molecular Biology experiments. The focus of this lecture will be on DNA sequencing.


Picking up your sequence data

To begin with, I'm going to assume that at this point you are all using the DNA sequencing facility. For each sequence that you send them they will place two files on Seqaxp in a subdirectory of:

If you want to work with this data on Seqaxp you may copy the files directly from that location to your directory. If you want to work with it on another system you may retrieve them with a web or FTP client. For further instructions on how to do that see http://seqaxp.bio.caltech.edu/www/saf_manuals/fetch/fetch_instructions.html.

There are two kinds of files that result from a DNA sequencing run: trace files and sequence files. The trace files are quite large, averaging about 230 kilobytes, whereas the sequence files are much smaller, taking up only 2 kilobytes. If you keep many trace files around you will run out of disk quota rapidly.

You may find that files transferred onto Seqaxp will occasionally have strange names. This comes about because the name space on a Macintosh (where the files are created) is much larger than that on OpenVMS, so a sort of encoding takes place, which replaces spaces, case changes, and illegal characters with a set of escape codes. This is the origin of the $ symbols and extra numbers which may appear in your filenames. After you copy files into your own directory, you may repair the filenames with the following command:


$ fix_abi_here

The sequence files are not in GCG format, to convert them use the command:

$ abitogcg test_abi.seq test_gcg.seq


Viewing or printing traces

You may view or print the traces on Seqaxp through a locally developed program which uses GCG graphics. This works best with GCG graphics devices that do color, such as vt340 Regis, TEK4105, or Xwindow. The Xwindow device, and some of the Tektronix emulators also support zoom.


$ tektronix versaterm term
$ abiprintout/infile=class:trace.abi -
  /begin=1/end=100

or expand the scale

$ abiprintout/infile=class:trace.abi -
  /begin=1/end=100/pnt=400

or zoom in on a region:

$ abiprintout/infile=class:trace.abi -
  /begin=50/end=70/fit

There is also a program for extracting sequence from a trace file.


$ ABI2FTR/infile=class:trace.abi/out=myseq.seq/noftr/nolnk -
    /begin=100/end=300

If you have an X11 server on your Mac or Windows machine, you may use Ted to view traces and edit the called sequence. It won't save the changes in the trace file, but it will write out a text file with the modified sequence. Ted was distributed as part of the Staden package (http://www.mrc-lmb.cam.ac.uk/pubseq/)


$ @teddir:setup_ted
$ ted -"ABI" class:trace.abi


Other sequence assembly tools

TED and the GCG assembly tools which we'll discuss below all work pretty well, but are also a bit old. So what else is out there? The heavy duty assembly and editing software Phred, Phrap, Consed, etc. (http://bozeman.genome.washington.edu/) used by the genome centers and other sites that crunch a lot of sequence all run in Unix environments. These are not programs for casual use, but at least they are free for academic users. Most of the commercial products for Macs and Windows have some assembly tools in them, but I cannot tell you how well they work. The Sequencher program from GeneCodes (http://www.genecodes.com/) is arguably the best assembly package which you'll find on either of these platforms. However, it is not cheap: the academic price for the Windows version is $2570 and for the Macintosh version it is $2000. As far as I know there are no free assembly packages for either Windows or Macintosh, although a Windows NT version of the Staden package is expected by mid 1999. So as long as you're doing fairly light assembly, for instance, working on a sequence up to 20 or 30 Kb, then the GCG tools should be adequate. If you need to do more than that, the best route would probably be to get a Linux workstation, load it with the Unix based tools, and learn to use them.


Sequence assembly terminology

The figure above is a representation of a partially completed shotgun sequencing project. It should help illustrate what some of the terms used in this talk mean. Note especially that in this talk we use two distinct terms for sequence clusters:

The GCG assembly programs, like most other assembly packages, ignore the physical linkage between end sequences derived from single subclones - that is, they provide no Mesh information. Therefore, they cannot tell you when you already have in hand a clone which should close a gap between two contigs, and when you do not. The distinction is important, because to close an intramesh gap, you need only supply a custom primer for a subcloned region you already have on hand. However, to close an intermesh gap, you need to identify another subclone, most likely through selection by hybridization with a probe, which will span this region. That is why the cost of obtaining an intermesh clone in the TEST_SHOTGUN example below is so much higher than it is for a regular clone - extra steps are needed to ensure that the clone represents the right piece of DNA.


Overview of the sequence assembly programs

Here is a brief overview of the sequence assembly, and related, programs which we will discuss today. All of the Gel programs are from either the GCG or EGCG packages, Test_Shotgun and ShowMesh are programs which I wrote.

Test_Shotgun Plan sequencing experiments, minimize costs
GelStart Create or select a sequence assembly project
GelEnter Put sequences into a project
GelMerge Merge contigs, remove vector contamination
GelView Show contigs in a project graphically
GelStatus Show information about contigs in a project
GelPicture Show detailed view of one contig
GelAnalyze Estimate gap sizes between contigs
ShowMesh Group contigs into meshes
GelAssemble Final editing and assembly

Planning a shotgun assembly experiment and minimizing costs

There are a few parameters which you can vary in a shotgun sequencing project, mainly, the size of the clones, and the number of randomly selected clones which will be end sequenced before going over to directed sequencing. One of the factors you would like to consider is the total cost of the project. GCG doesn't have any cost estimation tools, so I wrote the program TEST_SHOTGUN specifically for doing cost estimation for average shotgun sequencing projects. It can be used for any size sequencing project, even the human genome project (http://seqaxp.bio.caltech.edu/www/science_shotgun.html). Here is an example run (with all of the prompts removed) which roughly corresponds to the example sequencing project we will consider today. An additional set of results are shown in parallel for an experiment which starts with 15 clones (30 sequences, which is 10 more than our example), so that the two may be compared.

$ TEST_SHOTGUN
myshotgun.txt                  output file name
This is a one line description of the project
3000                           sequence this many bp
20.0                           cost of a custom primer
300,10.0                       bp per directed sequence, cost
1000,20.00         intermesh closing: clone size, cost to obtain
50                             minimum overlap to merge
1000,1.00                      shotgun clone size, cost to obtain 1
300,10.0                       bp per end sequence, cost
10                             number of shotgun clones
0

The next file has been modified to compare the results for a project which started with 10 clones versus one which started with 15.

$ type myshotgun.txt
 Project: This is a one line description of the project               
 
 Project constants
 
   Genome size (base pairs):                        3000.      3000.
   Scaling from internal subsequence:               1.00       1.00
   Required overlap for Mesh merge:                   50         50
   Bases of sequence from a custom primer:           250        250
   Bases in clones used to close mesh gaps:         1000       1000
   Cost to obtain a clone outside a Mesh/contig:   20.00      20.00
   Cost to make a custom primer:                   20.00      20.00
   Cost to sequence from a custom primer:          10.00      10.00
 
 
 Phase :                                               1          1
   Number of clones added:                            10         15
   Size of each clone added:                        1000       1000
   Bases of sequence from each end primer:           300        300
   Number of genome units of coverage:              3.33       5.00
   Number of genome units of sequence:              2.00       3.00
 
   Unit cost: clone isolation and preparation       1.00       1.00
   Unit cost: end primer sequencing:               10.00      10.00
 
   Total phase cost for clone preparation:         10.00      15.00
   Total phase cost for end sequencing:           200.00     300.00
 
 Project status after phase:                           1          1
 
 Summary of coverage fractions:
   included in clones:                         1.0000000  1.0000000
   sequenced:                                  0.8646666  0.9360000
     both directions:                          0.3186667  0.4273333
     multiply, one direction:                  0.0010000  0.1266667
     once:                                     0.5450000  0.3820000
   gaps:                                       0.1353333  0.0640000
     IntraMesh:                                0.1353333  0.0640000
     InterMesh:                                0.0000000  0.0000000
 
 Mesh information
   number (Meshes):                                    1          1
   max size (clones):                                 10         15
   avg. size (clones):                                10         15
   max size (bp):                                   3000.      3000.
   avg. size (bp):                                  3000.      3000.
 
 Contig information
   number (Contigs):                                   7          3
   max size (sequences):                               3          9
   avg. size (sequences):                              1          6
   max size (bp):                                    514       1170
   avg. size (bp):                                   386        936
 
 IntraMesh Gaps (inside clones)
   fraction:                                   0.1353333  0.0640000
   number (internal gaps):                             3          3
   max size (bp):                                    143         80
   avg. size: (bp)                                   135         64
 
 InterMesh Gaps (between clones)
   fraction:                                   0.0000000  0.0000000
   number (external gaps):                             0          0
   max size (bp):                                      0          0
   avg. size (bp):                                     0          0
   minimum new clones to close:                        0          0
 
 Regions sequenced more than once, one direction only
   fraction:                                   0.0010000  0.1266667
   number (regions):                                   1          5
   max size (bp):                                      3         98
   avg. size (bp):                                     3         76
 
 Regions sequenced both directions
   fraction:                                   0.3186667  0.4273333
   number (regions):                                   7          7
   max size (bp):                                    297        300
   avg. size (bp):                                   136        183
 
 Regions sequenced once only
   fraction:                                   0.5450000  0.3820000
   number (regions):                                  10         11
   max size (bp):                                    257        244
   avg. size (bp):                                   163        104
 
 Sequencing steps required to close
   IntraMesh gaps (X):                                 3          3
   InterMesh gaps (Y):                                 0          0
   IntraMesh, one strand -> both strands(Z):          12         11
   close all, both strands (Z+2*(X+Y)):               18         17
 
 Total costs
   Shotgun clone isolation and preparation (A):    10.00      15.00
   Shotgun clone end sequencing (B):              200.00     300.00
   Custom primer synthesis (C+D+E):               360.00     340.00
      IntraMesh gaps (C):                         120.00     120.00
      InterMesh gaps (D):                           0.00       0.00
      IntraMesh 1->2 strands (E):                 240.00     220.00
   Custom primer sequencing (F+G+H):              180.00     170.00
      IntraMesh gaps (F):                          60.00      60.00
      InterMesh gaps (G):                           0.00       0.00
      IntraMesh 1->2 strands (H):                 120.00     110.00
   InterMesh clone isolation (I):                   0.00       0.00
 
   Grand Total cost (A+B+C+D+E+F+G+H+I):          750.00     825.00
 

From this, we can see that for the process costs used here it would be less expensive to close via directed sequencing than by the addition of more random clones. However, the quality of the final sequence in the latter case is expected to be slightly higher, since more regions would have been sequenced more times.


Limitations of the GCG assembly tools

Before we go on to discuss assembling your sequences, now is a good time to stop and consider how, exactly, you want to go about sequencing. There is one experimental detail which is paramount - you want to be very careful to size your inserts before you clone them, and then to size the plasmids after they are cloned, so that you do not introduce sequence from any hybrid clones from the experiment. When random sequence segments are fused together, they can make mincemeat of the assembly process, since they introduce false linkage between one part of the sequence and another.

The second point that I want to make is that, in general, the DNA assembly packages have limits on the number of fragments that they can handle, as well as the total length of DNA in the final assembled sequence. The GCG package will only handle 750 fragments of maximum length 2500 each, for a maximum contig length of 75000, and the sum of all sequence lengths in the project cannot exceed 380750 bases. While this may seem like rather a lot if you are used to working on plasmids, it is just at the limit for working with larger cosmids, and is too small for a YAC or BAC.

The other thing to consider is that shotgun assembly of N sequence fragments will, in the worst case, take a bit less than N squared comparisons to put the contigs together. So if you have 100 fragments to shotgun, that is only 10000 comparisons, which is not unreasonable with current hardware. However, if 10000 sequences might require 100 million comparisons, which is liable to be very time consuming. Divide and conquer is a good strategy, if you can do it. That is, if you need a total of N fragments, and have some convenient way to subdivide them into M regions, then the number of comparisons that need to be made is roughly

Subdividing not only will make the assembly run faster, but it may be required in any case to keep each subassembly within the range that the GCG package can handle.


Using the GCG assembly tools

Creating a project

To start a new project, or to do some more work on an existing one, use the GelStart command. The progam expects the name of the project as the single parameter on the command line:


$ gelstart existing_project_name

or

$ gelstart/new  new_project_name

If you are creating a new project Gelstart will create a subdirectory in your current default directory for it. If it is an existing project it will look downwards from your login directory for a subdirectory with the project's name. Only one person at a time should work on any given project, or they might end up corrupting the project data by doing incompatible operations. If you do a lot of separate projects it is a good idea to create them from a subdirectory, so that your main directory doesn't become too cluttered. Once you have run GelStart you can work from any directory - the GCG package will remember where the project is.

The subdirectory heirarchy for a sequencing project looks like this:


   [.project_name]
      miscellaneous configuration files, copies of vectors
      [.ARCHIVE]           original sequences
      [.CONSENSUS]         contig consensus sequences
      [.RELATION]          what goes where
      [.WORKING]           modified/edited sequences

Don't ever modify any of these files except through the fragment assembly programs or the project may become corrupted!

Some of the other qualifiers you might use with GelStart are:


/vectors=name.seq,gb:pb3222,gb_sy:cvkslic
   Vectors which were used for sequencing.  Regions homologous
   to these will be detected and highlighted if GelEnter
   is run interactively.
/sites=gaattc
   Sites which should be highlighted.  Typically these
   are restriction sites which were used for cloning.
/delete
   Delete the whole project.  Use with care!

Let's start an example project for use during the rest of this lecture. It will have GAATTC sites marked, and the only vector used will be bluescript.

$ gelstart/new/vector=gb_sy:cvkslic/site=GAATTC  example

Entering sequence data into a project

Once the project exists sequence data may be entered into it through the program GelEnter. All sequences must first have been converted to GCG format. While GelEnter will allow you to type in your sequences interactively, you probably won't do much of that now that we have the DNA sequencing facility. Each fragment in a project must have a unique name, and that name will be derived from the name of the file which was input into the project. Give some thought to your naming convention before you start! In particular, fragments often have very similar names, and you want the part that varies to be as close to the front of the name as possible. Additionally, try to have the variable parts line up in columns. For instance, instead of

  pclone_13_14_f.seq    not all columns line up 
  pclone_3_14_r.seq

use

  p13_14f.seq    all columns line up 
  p03_14r.seq

Generally you want to put all of your new sequences into one directory, and then enter them all into the project at once with:


$ gelenter/enter=[-.mysequences]*.seq

For use in this lecture I have prepared an artificial test case which consists of 20 randomly selected 300 bp fragments between 2000 and 5000 in DMWHITE (http://seqaxp.bio.caltech.edu/www/seqanalysis/class.) Each was slightly corrupted with the addition of 3 mismatches and 1 indel. The program SHOTGUN was used to prepare this data. It is a modified version of the GCG CORRUPT program. Additionally, two bad sequences will be entered, one consists of two randomly selected pieces fused together (like a cloning mistake) and the other is a fragment contaminated with the vector (bluescript). Let's put those into our example now


$ gelenter/enter=class:test*.seq
$ gelenter/enter=class:bad*.seq

Remember, GelStart must be run first for each session to tell the GCG software which project you are working on. Once the sequences have been entered into the project you don't really need to keep the originals. The raw sequences will always be in the [.archive] subdirectory of the project and will never be modified.


Removing vector contamination

The next step is to find and remove any vector contamination. To do that use the program GelMerge. Remember, like all other GCG assembly programs, you must have run GelStart before you run GelMerge so that GelMerge will know which project to act on. One complication concerning the excision of vector sequence is that it only happens on single fragment contigs, ie, fragments that have not yet been merged, and it only happens for vector sequence that is on the *end* of a fragment. So if you happen to have two chunks of vector stuck side by side onto an insert, then you will only remove the distal one on the first pass. Start with:


$ gelmerge/report=file.txt/excise/nomerge
GelMerge aligns the sequences in a fragment assembly project into
assemblies called contigs.  You can view and edit these
assemblies in GelAssemble.
 What word size (* 7 *) ?
 What fraction of the words in an overlap must match (* 0.80 *) ?
 What is the minimum overlap length (* 14 *) ?
   Reading ......................
 Searching for Cvkslic
           .*....................
  Excising last 101 bases from Bad0002
   Writing ......................
          Input Contigs:         22
         Output Contigs:         22
               CPU time:      02.81

$ search file.txt excis
Bad0002  Length: 301    Excised last 101 bases
$ gelmerge/report=file.txt/excise/nomerge
Output not shown
$ search file.txt excis
%SEARCH-I-NOMATCHES, no strings matched

Repeat these two commands until the report says that no vectors were found, in the above example, that happened on the second run. Keep in mind that this method is designed to remove big chunks of vector, if you have a couple of bases from the vector on the end of the sequence it won't be removed, but it probably won't interfere with anything either. Once all of the vector contamination is gone, merge the sequences into overlapping groups of contigs:


$ gelmerge
GelMerge aligns the sequences in a fragment assembly project into
assemblies called contigs.  You can view and edit these
assemblies in GelAssemble.
 What word size (* 7 *) ?
 What fraction of the words in an overlap must match (* 0.80 *) ?
 What is the minimum overlap length (* 14 *) ?
   Reading ......................
 Comparing ......................
  Aligning ..................
   Writing ....
          Input Contigs:         22
         Output Contigs:          4
               CPU time:      02.65

There are several other command line qualifiers that you can use with GelMerge, but probably the defaults will be adequate.


Status reports

After running GelMerge you'll want to get an overview of the status of the assembly project that's a bit more complete than just the number of Contigs, which is what GelMerge output above. Obtain this information with the command:


$ gelview
GelView displays the structure of the contigs in a fragment
assembly project.
 What should I call the output file (** Example.View **) ?
 EXAMPLE has 22 Fragments in 4 Contigs
$ type example.view
GELVIEW Fragment Assembly contig display of Project: EXAMPLE
    February 11, 1999 13:51

Contig: Bad0001
 2  Bad0001     +---------------------------------->
 C  CONSENSUS   +---------------------------------->
                |----------|----------|----------|---------|---------|
                0         100        200        300        400
Contig: Bad0002
 6  Test0009                   <--------------------------------+
 5  Test0019          <--------------------------------+
 4  Test0005          <--------------------------------+
 3  Test0002      +--------------------------------->
 2  Bad0002     +--------------------->
 C  CONSENSUS   +----------------------------------------------->
                |----------|----------|----------|---------|---------|
                0         100        200        300        400
Contig: Test0001
10  Test0016                                       <----------------+
 9  Test0007                                     +---------------->
 8  Test0004                               +---------------->
 7  Test0011                              <----------------+
 6  Test0008                           +---------------->
 5  Test0014                     +--------------->
 4  Test0003                    <----------------+
 3  Test0012            +---------------->
 2  Test0001    +--------------->
 C  CONSENSUS   +--------------------------------------------------->
                |----------|----------|----------|---------|---------|
                0         200        400        600        800
Contig: Test0017
 8  Test0020                         <-------+
 7  Test0015                       +------->
 6  Test0018                 +-------->
 5  Test0010           <-------+
 4  Test0013       <-------+
 3  Test0006    +------->
 2  Test0017    +------->
 C  CONSENSUS   +---------------------------->
                |----------|----------|----------|---------|---------|
                0         400        800       1200       1600
22 Fragments in 4 Contigs

From this we learn that GelMerge put the 21 good fragments into 3 contigs and the fusion into a separate one. Of course, if this was real shotgun sequencing data, we wouldn't know yet why that one fragment had not been merged into a contig. One point which should be obvious from this is that since we didn't put in any NONshotgunned fragments we have no simple way to orient this. It is possible that you might have a restriction map, or STS, or something else in there to help you sort it out. In general, it's a good idea to also throw in the sequences from the end of your clone/region to help orient things.

EGCG supplies a similar command to GelView, but which displays a bit more information, and which you might like better:


$ gelstatus
GELSTATUS reads a GCG Fragment Assembly database, and produces a 
summary report of the quality of each contig.
 What should I call the output file (* example.dat *) ?
$ type example.dat
Gelstatus of project EXAMPLE,    February 11, 1999 13:53
                               No. Pct                      No.  Pct
Contig      Len Both Many Once Dif Dif Fragment    Dir Len  Dif  Dif
----------  --- ---- ---- ---- --- --- ----------- --- ---  ---  ---
                                       Bad0001      +   322   0  0.0
                                       --------------- ---- ---  ---
ad0001.Con  322    0    0  322   0 0.0 (  1 fragments)  322   0  0.0
                                       Test0009     -   305   3  1.0
                                       Test0019     -   304   5  1.6
                                       Test0005     -   304   3  1.0
                                       Test0002     +   305   3  1.0
                                       Bad0002      +   204   1  0.5
                                       --------------- ---- ---  ---
ad0002.Con  441  273   66  102  15 3.4 (  5 fragments) 1422  15  1.1
                                       Test0016     -   302   3  1.0
                                       Test0007     +   302   5  1.7
                                       Test0004     +   302   4  1.3
                                       Test0011     -   302   3  1.0
                                       Test0008     +   300   5  1.7
                                       Test0014     +   305   5  1.6
                                       Test0003     -   305   3  1.0
                                       Test0012     +   305   6  2.0
                                       Test0001     +   300   4  1.3
                                       --------------- ---- ---  ---
st0001.Con  951  610  140  201  35 3.7 (  9 fragments) 2723  38  1.4
                                       Test0020     -   302   5  1.7
                                       Test0015     +   302   2  0.7
                                       Test0018     +   301   2  0.7
                                       Test0010     -   303   8  2.6
                                       Test0013     -   304   4  1.3
                                       Test0006     +   302   7  2.3
                                       Test0017     +   302   6  2.0
                                       --------------- ---- ---  ---
st0017.Con 1071  509  291  271  25 2.3 (  7 fragments) 2116  34  1.6
           ---- ---- ---- ---- --- --- --------------  ---- ---  ---
(Total)    2785 1392  497  896  75 2.7 (Total)         6583  87  1.3

 EXAMPLE has 22 fragments in 4 Contigs
 Longest contig is Test0017.Con, length 1071

There is a lot of information in here, most of it quite useful. The program outputs:


Contig    Contig name
Length    length of the contig
both      length sequences on both strands   }
many      length sequenced more than both    }  these add up to Length
once      length sequenced once              }
No. Dif.  Number of sites that differ from the consensus
Pct Dif.  Percent of sites that differ from the consensus

The sum of the differences for the individual fragments can be more here than versus the consensus because multiple incorrect assignments at a single site count as just 1 difference for the latter. High difference rates indicate a problem in the assembly, here there is no such problem.

Notice that the program doesn't leave much room for the contig names and truncates them from the left, so TEST0001.CON becomes ST0001.CON. That can get confusing, but the fragment names are shown and can usually be used to figure out the consensus name (which is always that of one of the fragments, the last in the list).

To look at a single contig use the EGCG GelPicture program (or GelAssemble).


$ gelpicture test0001
$ type test0001.pic
GelPicture of project EXAMPLE contig TEST0001, 2/11/1999 13:57
10 Test0016 -                                 <-+----+----+---
 9 Test0007 +                               ----+----+----+>
 8 Test0004 +                         +----+----+----+>
 7 Test0011 -                         <----+----+----+
 6 Test0008 +                      ---+----+----+->
 5 Test0014 +                ----+----+----+>
 4 Test0003 -                <---+----+----+-
 3 Test0012 +         -+----+----+--->
 2 Test0001 + ----+----+----+>
 1 Consensus  ----+----+----+----+----+----+----+----+----+-->
              ....+....+....+....+....+....+....+....+....+....+
              1                                             1000

Pretty of project EXAMPLE contig TEST0001,     February 11, 1999 13:57

 2 Test0001 +   1 CGCAAAGTGCTCTAATTGAATTACATTTCGCATTAAAGCGCCCCAGTGGG  50
 1 Consensus    1 ---------+---------+---------+---------+---------+  50

 2 Test0001 +  51 CGTGGCCCAGGGGGGGGGGTGCGGTGTGTGGGTTAACCAATCTGGGGCAT 100
 1 Consensus   51 ---------+---------+---------+---------+---------+ 100

 2 Test0001 + 101 TATTAAATTATCCCTTAGACACCGCAATCGCACTACCCCCTGGATTTCCC 150
 1 Consensus  101 ---------+---------+---------+---------+---------+ 150

 3 Test0012 +   1           ACACCCTCCCCCCCGCCACCAGCGCACACGGTCTTGGCTT  40
 2 Test0001 + 151 TTTTTCCTCAACACCCTCCCCCCCGCCACCAGCGCACACGGTCTTGGCTT 200
 1 Consensus  151 ---------+---------+---------+---------+---------+ 200

 3 Test0012 +  41 TTGTTATTATAAGCTTTGCTTTGGCTTTTTGCCACCTTTCAGTTTGGTGC  90
 2 Test0001 + 201 TTGTTATTATAAGCTTTGCTTTGGCTTTTTGCCACTTTTCAGTTTGGTGC 250
 1 Consensus  201 ---------+---------+---------+-----y---+---------+ 250

 3 Test0012 +  91 CAGGCACTTTTTAGCCATTAATCTTCTTTGACGCGCTTAAATCACATGAA 140
 2 Test0001 + 251 CAGGCACTTTTTAGCCATTAATCTTCTTTGACGC...TAAATCACATGAA 300
 1 Consensus  251 ---------+---------+---------+----gct--+---------+ 300

 5 Test0014 +   1              GGCATATTAGTACGTGGCTTTACCGCCTTTGTACTAA  37
 4 Test0003 - 305         TAGATGGCATATTAGTACGTGGCTTTACCGCCTTTGTAATAA 264
 3 Test0012 + 141 TAACAAATTAGATGACATATTAGTACGTGGCTTTACCGCCTTGGTAATAA 190
 1 Consensus  301 --------=+====g====+=========+=========+==t===a==+ 350

 5 Test0014 +  38 TTTTTT...TTTTTTTTTAAGCCCAACATAATGATTCAAATGTTGTGGCA  87
 4 Test0003 - 263 TTTTTT...TTTTTTTTTAAGCC.AACATAATG..TCAAATGTTGTGGCA 214
 3 Test0012 + 191 TTTTTTGGGTTTTTTTTTAAGCCCAACATAATG..TCAAATGTTGTGGCA 240
 1 Consensus  351 ======...+=========+===c=====+===..====+=========+ 400

 6 Test0008 +   1                                 AAAAGACAGGGAAATGAG  18
 5 Test0014 +  88 CCAAAAAAGAGAGCGAAAAAGGAAAGAATGCAAAAAGAAAGGGAAATGAG 137
 4 Test0003 - 213 CCAAAAAAGAGAGCGAAAAAGGAAAGAATGCAAAAAGAAAGGGAAATGAG 164
 3 Test0012 + 241 CCAAAAAAGAGAGCGAAAAAGGAAAGAATGCAAAAAGAAAGGGAAATGAG 290
 1 Consensus  401 =========+=========+=========+========a+=========+ 450

 8 Test0004 +   1                                                 TT   2
 7 Test0011 - 302                                   TTGGCTCTATATATTT 287
 6 Test0008 +  19 GATGAAAATTAAATAACAAAAGGATCTCGCTCCCTTGGCTCTATATATTT  68
 5 Test0014 + 138 GATGAAAATTAAATAACAAAAGGATCTCGCTCCCTTGGTTCTATATATTT 187
 4 Test0003 - 163 GATGAAAATTAAATAACAAAAGGATCTCGCTCCCTTGGCTCTATATATTT 114
 3 Test0012 + 291 GATGAAAATTAAATA                                    305
 1 Consensus  451 =========+=========+=========+========c+=========+ 500

 8 Test0004 +   3 TTGCATAATTCACTGGCCTGTCTAACGGTGCGTATACGTTATCCCCACTT  52
 7 Test0011 - 286 TTGCATAATTCATTGGCCTGTCTAACGGTGCATATACGTTATCCCCACTT 237
 6 Test0008 +  69 TTTCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCCGACTT 118
 5 Test0014 + 188 TTGCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCACACTT 237
 4 Test0003 - 113 TTGCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCCCACTT  64
 1 Consensus  501 ==g======+==t======+=========+=g=======+====cc===+ 550

 8 Test0004 +  53 TCCCTTGAGCTTTTCCATTTCACCCCTCCCCCCCTCCACACACACACACA 102
 7 Test0011 - 236 TCCCTTGAGCTTTTCCATTTCACCCCTCCCCCCCTCCACACACACACACA 187
 6 Test0008 + 119 TCCCTTGAGCTTTTCCATTTCACCCCTCCCCCCCTCCACACACACACACA 168
 5 Test0014 + 238 TCCCTTGAGCTTTTCCATTTCACCCCTCCCCCCCTCCACACACACACACA 287
 4 Test0003 -  63 TCCCTTGAGCTTTTCCATTTCACCCTTCCCCCCCTGCACACACACACACA  14
 1 Consensus  551 =========+=========+=====c===+=====c===+=========+ 600

10 Test0016 - 302                                                  C 302
 9 Test0007 +   1                 GCATGCACATGTATTTATAAATATTTTTCGCCCC  34
 8 Test0004 + 103 CACACTCACACAGACGGCATGCACATGTACTTATAAATATTTTTCGCCCC 152
 7 Test0011 - 186 CACACTCACACAGACGGCATGCACATGTATTTAT..ATATTTTTCGCCCC 137
 6 Test0008 + 169 CACACTCACACAGACGGCATGCACA..TATTTATAAATATTTTTCGCCCC 218
 5 Test0014 + 288 CACACTCACACAGACGGC                                 305
 4 Test0003 -  13 CACACTCACACAG                                        1
 1 Consensus  601 =========+=========+=====tg==t====aa===+=========+ 650

10 Test0016 - 301 TTTGTGCTGTCAGTTTGCTAATTAGCTACGCATTTTCATTGTCGTTAAAA 252
 9 Test0007 +  35 TTTGTGCTGTCAGTTTGCTAATTAGCTACGCATTTTCATTGTCGTTAAAA  84
 8 Test0004 + 153 TTTGTGCTGTCAGTTTGCTAATTAGCTACGCATTTTCATTGTCGTTAAAA 202
 7 Test0011 - 136 TTTGTGCTGTCAGTTTGCTAATTAGCTACGCATTTTCATTGTCGTTAAAA  87
 6 Test0008 + 219 TTTGTGCTGTCAGTTTGCTAATTAGCTACGCATTTTCATTGTCGTTAAAA 268
 1 Consensus  651 =========+=========+=========+=========+=========+ 700

10 Test0016 - 251 ACTCCATTCACATGCATATTTACTCACTCCGCCGAGTA..TATTATTCAT 202
 9 Test0007 +  85 ACTCCATTCACATGCATATTTACTCACTCCGCCGAGTA..TATTATTCAT 134
 8 Test0004 + 203 ACTCCATTCACATGCATATTTACTCACTCCGCCGAGTATTTATTATTCAT 252
 7 Test0011 -  86 ACTCCATTCACATGCATATTTACTCACTCCGCCGAGTA..TATTATTCAT  37
 6 Test0008 + 269 ACTCCATTCACATGCATATTTACTCACTCCGC                   300
 1 Consensus  701 =========+=========+=========+========..=========+ 750

10 Test0016 - 201 TTTGTTCGCCTTGTGTCCCCCTCATTCTTTTAATGCAATTTAATTATTTT 152
 9 Test0007 + 135 TTTGGTCGCCTTGTGTCCCCCTCATTCTTTTAATGCAATTTAATTATTTT 184
 8 Test0004 + 253 TTTGTTCGCCTTGTGTCCCCCTCATTCTTTTAATGCAATTTAATTATTTT 302
 7 Test0011 -  36 TTTGTTCGCCTTGTGTCCCCCTCATTCTTTTAATGC                 1
 1 Consensus  751 ====t====+=========+=========+=========+=========+ 800

10 Test0016 - 151 ACTATATGTTTGCGTCACTGACAGAAAAGAGAAAAGCCCATGTCTTCT.A 102
 9 Test0007 + 185 ACTATATGTTTGCGTCACAGACAGAAAAGAGAAAAGCCCATGTCTTCTGA 234
 1 Consensus  801 =========+========w+=========+=========+========g+ 850

10 Test0016 - 101 CATGTGCCATATACCCCCGACCTCTGACCTCTTCCTGCCTGCTTTACTTC  52
 9 Test0007 + 235 CATGTGCCATATACCCCCTACCTCTGACCTCTTCCTGCCT..TTTACTTC 284
 1 Consensus  851 =========+========k+=========+=========+gc=======+ 900

10 Test0016 -  51 ACATTTTGATAAATTATATGCATTTTTAATCTAATTATATTGCAAAAAGC   2
 9 Test0007 + 285 ACATTTTGATAAATTATA                                 302
 1 Consensus  901 =========+========-+---------+---------+---------+ 950

10 Test0016 -   1 G                                                    1
 1 Consensus  951 -                                                  951

If you will be making any custom primers to close the gaps in this sequence, you can verify their locations and directions. The primers file, if supplied, must be in the same format as a Findpatterns input file. Example:


$ create primers.dat
Some primers  - where do they fall in the contigs?
------    ----  -------------------     --- --- ------------------   ..
prim1      0    atgatgcatcgggcacgat      .  !  first attempt
prim2      0    ggcgtgactatattaatcactt   .  !  assume GA is correct
prim3      0    aattcattggcctgtctaac     .  !
^Z
$ gelpicture/contig=test0001/primer=primers.dat
$ search/wind=(10,0) test0001.pic prim

 8 Test0004 +   3 TTGCATAATTCACTGGCCTGTCTAACGGTGCGTATACGTTATCCCCACTT   52
 7 Test0011 - 286 TTGCATAATTCATTGGCCTGTCTAACGGTGCATATACGTTATCCCCACTT  237
 6 Test0008 +  69 TTTCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCCGACTT  118
 5 Test0014 + 188 TTGCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCACACTT  237
 4 Test0003 - 113 TTGCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCCCACTT   64
 1 Consensus  501 ==g======+==t======+=========+=g=======+====cc===+  550
                        >******************>
                        prim3

There is a graphical variant of Gelpicture called GelFigure, but it's a bit buggy and doesn't really do much that Gelpicture doesn't.


What about that single sequence contig?

Unfortunately GelMerge does not write out any explanation for why it did not fuse two contigs. In this example, and in general, a contig consisting of a single sequence isn't a good thing at this stage of assembly - it is likely there is something wrong with that fragment so that it should be removed from the project. One quick way to find out what is wrong is to compare it to the contig consensus sequences with FASTA and see what turns up. In this case, we see that it has high scores against two other contigs - that is, it appears to be a hybrid sequence. (Think about what repetitive sequences will do to a sequencing project!) In other cases, it might have no similarity to the other consensus sequences - which either means that more fragments are needed to close the gap, or the fragment is a contaminant.

$ fasta/infile=[.example.consensus]BAD0001.CON -
 /infile2=[.example.consensus]*.con -
 /nohisto/noalign/default
$ search bad0001.fasta cons
disk:[name.Example.Consensus]Bad0001.Con  Bad0001...1288  1288  1288
disk:[name.Example.Consensus]Test0017.Con  Test00... 569   569   573
disk:[name.Example.Consensus]Test0001.Con  Test00... 504   504   690
disk:[name.Example.Consensus]Bad0001.Con /rev  Ba...  56    56    56
disk:[name.Example.Consensus]Test0017.Con /rev  T...  56    56    57
disk:[name.Example.Consensus]Bad0002.Con  Bad0002...  54    54    68
disk:[name.Example.Consensus]Bad0002.Con /rev  Ba...  44    44    45
disk:[name.Example.Consensus]Test0001.Con /rev  T...  40    40    56


Is it time to go to directed sequencing?

When doing shotgun sequencing there is a point at which it becomes more efficient to go to directed sequencing, rather than to keep adding randomly selected subclones. The EGCG program GelAnalyze processes the output of GelStatus to prepare a report which will help you decide if the project has reached that state. The reference is Lander and Waterman (1988) Genomics 2:231-239. GelAnalyze needs to be told how big the final sequence will be with the /size qualifier. You can have it predict what will happen if you add N more fragments with the /NEWFRAGS=N qualifier. You decide when to go over to directed sequencing by what it tells you about the number of contigs that would result and the size and distibution of gaps.


$ gelanalyze/size=3000/default
$ type example.ana
GELANALYZE of Gelstatus report example.dat, February 11, 1999 14:18
 Qualifiers: /OVERlap=30. /SIZe=3000.
          Number of fragments:     22
      Average fragment length:  299.2
    Total length of fragments:   6583
    Sigma mean: 0.899,   Sigma variance: 0.0001
 (i) Number of apparent contigs
                           Actual:   4
                         Expected:   3.07
 Allowing fragment length to vary:   3.07

 (ii) Number of apparent contigs of j fragments

       j     Actual     Expected
      ---    ------     --------
        1         1         0.43 lines with 0 observed deleted
        5         1         0.23
        7         1         0.17
        9         1         0.13
    >  50         0         0.00

 (ii') Number of apparent contigs of at least 2 fragments
                           Actual:   3
                         Expected:   2.64
 Allowing fragment length to vary:   2.64

 (iii) Number of clones in an apparent contig
                           Actual:   5.50
                         Expected:   7.17
 Allowing fragment length to vary:   7.16

 (iv) Length of an apparent contig
                           Actual: 696.25
                         Expected: 871.82

 (v) Number of actual contigs if overlapping is perfect
                         Expected:   2.46

 (vi) Probability that a gap of given length occurs

      Length   Given Gap   Any Gap
      ------   ---------   -------
          50       0.56      0.92
         100       0.39      0.78
         150       0.27      0.62
         200       0.19      0.47
         250       0.13      0.34
         300       0.09      0.25
         350       0.06      0.18
         400       0.04      0.13
         450       0.03      0.09
         500       0.02      0.06

 (vi') Probability that a gap is real:  0.80

 Maximum number of contigs:   4.1
         occurs at redundancy (c) = 1.11
         when total fragment length sequenced is 3334 bp

For instance, from GelStatus we know that there are 3000 - 2439 = 561 unsequenced bases and three contigs. That means we have either 4, 3 or 2 gaps, depending on whether or not the contigs extend to the ends of the sequence. These gaps must be of average size 140, 187, or 280 bp. The statistics are a bit odd though, and the mean isn't really what we want to know. Looking at section (vi) we find the probabilities for either a given gap, or any of the gaps, being a certain size. If we run GelAnalyze again, with more fragments, then compare the two results, it will give us an idea of what our odds are of closing the gaps with single directed sequencing steps. For instance, adding 10 more fragments:


$ gelanalyze/size=3000/newfrags=10

Let's have a look at the gap sizes we have in our test case (which we can do because we know the full sequence), and compare that to the expected gap sizes from the two example.ana files.


Example case gaps
Where what    Length
 2000 gap      23 
 2023 contig  943
 2966 gap      60
 3026 contig  410
 3436 gap     196
 3632 contig 1063
 4695 gap     305
 5000

         Current      +10 frags
Length | Given Any  | Given Any   | Observed
       | Gap   Gap  | Gap   Gap   | Gap Sizes
------ | ----- ---- | ----  ----  | --------
    50 | 0.63  0.95 | 0.43  0.64  |  1
   100 | 0.45  0.84 | 0.25  0.41  |  1
   150 | 0.32  0.69 | 0.15  0.25  |  0
   200 | 0.23  0.55 | 0.09  0.15  |  1
   250 | 0.17  0.42 | 0.05  0.09  |  0
   300 | 0.12  0.32 | 0.03  0.05  |  0
   350 | 0.09  0.24 | 0.02  0.03  |  1
   400 | 0.06  0.17 | 0.01  0.02  |  0
   450 | 0.04  0.13 | 0.01  0.01  |  0
   500 | 0.03  0.09 | 0.00  0.01  |  0

We learn from this that there is a 24% chance that the current data has a gap of 350 bases or larger, which is large enough that it might not be spanned in a single directed sequencing run. On the other hand, after adding 10 more shotgun fragments, there would be only a 3% chance of such directed sequencing failing to bridge a gap. In addition, there is rather a lot of the 3Kb piece which has only been sequenced in one direction.


Can a particular project reduce to a single contig?

Often you want to know if your project may be reduced to a single contig by directed sequencing within clones. The locally written program ShowMesh can tell you this. In order to use it, you must have prepared an ends file, which contains the names of the end sequences from each clone (case is ignored). Here is an example showing how the program is run, and the contents of a small ends file, which only covers the ends of a few of the clones in the example project:

$ create ends.txt
Test0011,Test0004  !intracontig
Test0007,Test0017  !intramesh
Test0013,noright   !right missing
noleft,Test0010    !left missing
^Z
$ showmesh [.example.relation] ends.txt showmesh.out
 about to open ends.txt
Read in ends for 4 clones
opened file BAD0002.FIL;5
opened file TEST0001.FIL;3
opened file TEST0017.FIL;3
$ type showmesh.out
  Test0011  TEST0001.FIL;3    Test0004  TEST0001.FIL;3 IntraContig
  Test0007  TEST0001.FIL;3    Test0017  TEST0017.FIL;3 IntraMesh
  Test0013  TEST0017.FIL;3     noright         MISSING right_missing
    noleft         MISSING    Test0010  TEST0017.FIL;3 left_missing

From this example, we can see that contigs TEST0001 and TEST0017 may be merged by sequencing within the clone which has ends Test0007,Test0017.


Final assembly - reconciling the differences

At this point you have an overview of your sequencing project. Since the sequences won't be perfect you will need to reconcile them manually using the program GelAssemble. It is a good idea to take a look at your contigs as you go along in the project, just to make sure that you don't find some problem like nonneighboring fragments that have been fused during the cloning. However, it isn't necessarily a good idea to start out right away editing the working sequences. You want to wait until you are pretty sure that you have all the sequences you're going to have in a given region - otherwise you may have to reedit the same region, which would be duplicated work.

To start GelAssemble issue the command:


$ GelAssemble

There is only one possible qualifier for this program, which is /contig=contigname, which will load that contig automatically for you when you start. GelAssemble is a multiple sequence editor that is a derivative of the GCG program Lineup. There is no choice other than to learn the slightly arcane set of command that it uses. If you get into the program in screen mode (after you've loaded a contig) and hit a ? it will put up a list of the commands. It is a good idea to print that screen and tape it up next to the terminal for handy reference. Some terminal emulators may need to be configured carefully so that they will work properly with GelAssemble (and Seqed, and Lineup...)


                                   SCREEN MODE

                      [n] is an optional numeric parameter
         Where no VT220 version is given, you may use the VT100 version.

 VT52 or VT100    VT220 or VT240          Action
 [n]<Right-arrow>                      move ahead [n bases]
 [n]<Left-arrow>                       move back [n bases]
 [n]<Up-arrow>                         move up [to row n]
 [n]<Down-arrow>                       move down [to row n]
  >               <Next-screen>        scroll one screen to the right
  <               <Prev-screen>        scroll one screen to the left
 <Ctrl>H          <F12>                move to start of the sequence
 <Ctrl>E          <F13>                move to end of the sequence
 /GATTC<Return>   <Find>GATTC<Return>  find next occurrence of GATTC
 165<Return>                           move to base 165 in sequence
 <Ctrl>A                               move to next ambiguity in alignment
 <Ctrl>R                               move to next ambiguity in sequence
 <Ctrl>V                               move to next gap in consensus
 <Ctrl>Z          <Do>                 enter Command Mode
 <Ctrl>L                               toggle alignment display enlargement
 <Ctrl>W                               redraw the screen
 <Ctrl>O          <F11>                toggle INSERT/OVERSTRIKE mode
  !                                    summary of current sequence
  ?               <Help>               display these help screens
 <Ctrl>G                               recalculate the consensus
 G A T C ....                          add base at the cursor
 <Delete>                              delete a base, or move sequence left
 <Space bar>                           move the sequence to the right
 <Ctrl>X                               delete alignment column
 <Ctrl>I                               restore alignment column
 <Ctrl>B          <Select>             begin selecting a range for removal
 <Ctrl>N          <Remove>             remove the selected range
 <Ctrl>P          <Insert Here>        insert the removed range
   -                                   reject current fragment

                                 COMMAND MODE

                     [a,b] specifies a range of fragments.
                       [x,y] specifies a range of bases.
                     [n] is an optional numeric parameter.

      EDit [ContigName]     replace current contig with a new contig
      CONTIGs               select another contig for editing
      WRite                 write a contig to the database
      ERASE                 delete current contig from the database
      EXit                  write the contig and quit
      QUIT                  quit without writing
      238                   move to position 238 in the current fragment
[x,y] PRETTYout [FileName]  write the sequence alignment [position x - y]
[a,b] SEQOUT                write fragments [a - b] to sequence files
      BIGPICture [FileName] write bar schematic to an output file
      OVERstrike            select OVERSTRIKE sequence edit mode
      NOOVERstrike          select INSERT sequence edit mode
[x,y] CONSensus             recalculate the consensus sequence
[a,b] LOCk                  lock strands [a through b]
[a,b] Unlock                unlock strands [a through b]
[x,y] SELect                select bases [x through y]
      REMove                remove the selected bases
[n]   INSert                insert the removed bases [at position n]
      CAncel                cancel the selection
[x,y] DElete                delete bases [x through y]
      GOTo [FragmentName]   move to strand by name
      FInd GAATC            find the next occurrence of GAATC
      DIfferences           show differences from the consensus
      MAtches               show matches with the consensus
      Neither               show neither matches nor differences
      REDraw                redraw the screen
      Help                  display these help screens
      SORt [DEScending]     sorts strands by their offsets in alignment
[a,b] MOve                  moves a strand [from line a to line b]
      OPen                  opens a blank line at the cursor position
[a,b] ANChor                anchors strands [a through b]
[a,b] NOANchor              unanchors strands [a through b]
      LOad [ContigName]     loads another contig into the Edit Screen
      REVerse               reverse-complement the (anchored) strand(s)
[n]   Offset                shifts the current fragment [to begin at n]
      REJect                removes the current fragment from the screen
      NODUPlicate           removes a duplicated fragment from the screen
      SPAWN                 renames a duplicated fragment
      SEParate              makes two contigs from anchored and
                                 unanchored strands

A quick aside. It is tempting to use GelAssemble by searching for each mismatch, fixing it, possibly after consulting the traces, and going on. However, if the sequence happens to be bad in the same way on the 2 or 3 sequences that cover a region you won't ever check it. The tools used by the genome centers work around this problem by assigning a quality to each region, so that regions of low quality are pointed out to you. In the GCG assembly system this isn't an option. As a bare minimum, you should review the output from GelPicture and look carefully through the traces for those regions which have not been sequenced at least once in each direction. Ideally, there would be no regions like this - everything would have been sequenced in both directions.

Ok, now let's use GelAssemble a little just to give you an overview of how it works. First, let's remove the BAD0001 sequence from the database, this completely removes it, which is only something you want to do if you suspect the sequence is really useless.


$ gelassemble/contig=bad0001
^Z
:erase
Y
^Z
:quit

Now let's rebuild the database and work on some of the contigs:


$ set term/page=40/width=132
$ gelmerge
$ gelassemble

Gelassemble has three modes:

  1. contig mode
  2. screen mode
  3. command mode

The program starts in contig mode, if the project is nearing completion there will be only one contig.Scroll through the contigs and pick one to load with ^K. To get from screen mode to command mode use ^Z, the program returns to screen mode after each command mode command. To get back to the contig screen from screen mode:


^Z
:contigs

but only do this after using WRITE to save any changes!

The commands are listed above and are available from within the program, by typing a ?. The ones you'll use the most for routine editing are:

^A to find next mismatch
^O to toggle insert/overstrike modes
^X to delete a column
^I to undelete a column (only if the cursor has not moved)
^G to recalculate the consensus

If in command mode you anchor a group of sequences together, then any change made to one of these sequences will also be made to all of the others. This is particularly useful if you want to manually assemble two contigs into one. First load the first contig from contig mode, then load the second contig via:

^Z
:load name_of_second_contig

You will then be able to move the entire second contig, which is loaded with all sequences anchored to each other, with respect to the first contig.

To prevent the modification of one or more sequences use the Lock command.

To save any changes to a contig, and then continue editing, use Write from command mode. The Exit command will also save the changes to the last contig which was modified. Quit will not save any changes which have been made!

One other note, you may want to define AGCT to be elsewhere on the keyboard, for instance under QWER (you may not remap the number keys). Use the program Setkeys to do this. The resulting file "set.keys" must be in the directory where you are running GelAssemble.

When you are done with the project you can pull the final consensus sequence out. It will be in [.project...]*.con (you cannot predict the name). The .CON files are just GCG sequence files, so copy one out and rename it.

Questions? Next week we will cover tools for protein analysis.