Today we're going to continue examining tools that you can use for planning and analyzing your molecular Biology experiments. The focus of this lecture will be on DNA sequencing.
To begin with, I'm going to assume that at this point you are all using the DNA sequencing facility. For each sequence that you send them they will place two files on Seqaxp in a subdirectory of:
There are two kinds of files that result from a DNA sequencing run: trace files and sequence files. The trace files are quite large, averaging about 230 kilobytes, whereas the sequence files are much smaller, taking up only 2 kilobytes. If you keep many trace files around you will run out of disk quota rapidly.
You may find that files transferred onto Seqaxp will occasionally have strange names. This comes about because the name space on a Macintosh (where the files are created) is much larger than that on OpenVMS, so a sort of encoding takes place, which replaces spaces, case changes, and illegal characters with a set of escape codes. This is the origin of the $ symbols and extra numbers which may appear in your filenames. After you copy files into your own directory, you may repair the filenames with the following command:
$ fix_abi_here
The sequence files are not in GCG format, to convert them use the command:
$ abitogcg test_abi.seq test_gcg.seq
You may view or print the traces on Seqaxp through a locally developed program which uses GCG graphics. This works best with GCG graphics devices that do color, such as vt340 Regis, TEK4105, or Xwindow. The Xwindow device, and some of the Tektronix emulators also support zoom.
$ tektronix versaterm term $ abiprintout/infile=class:trace.abi - /begin=1/end=100
![]()
or expand the scale
$ abiprintout/infile=class:trace.abi - /begin=1/end=100/pnt=400
![]()
or zoom in on a region:
$ abiprintout/infile=class:trace.abi - /begin=50/end=70/fit
![]()
There is also a program for extracting sequence from a trace file.
$ ABI2FTR/infile=class:trace.abi/out=myseq.seq/noftr/nolnk - /begin=100/end=300If you have an X11 server on your Mac or Windows machine, you may use Ted to view traces and edit the called sequence. It won't save the changes in the trace file, but it will write out a text file with the modified sequence. Ted was distributed as part of the Staden package (http://www.mrc-lmb.cam.ac.uk/pubseq/)
$ @teddir:setup_ted $ ted -"ABI" class:trace.abi
![]()
Other sequence assembly tools
TED and the GCG assembly tools which we'll discuss below all work pretty well, but are also a bit old. So what else is out there? The heavy duty assembly and editing software Phred, Phrap, Consed, etc. (http://bozeman.genome.washington.edu/) used by the genome centers and other sites that crunch a lot of sequence all run in Unix environments. These are not programs for casual use, but at least they are free for academic users. Most of the commercial products for Macs and Windows have some assembly tools in them, but I cannot tell you how well they work. The Sequencher program from GeneCodes (http://www.genecodes.com/) is arguably the best assembly package which you'll find on either of these platforms. However, it is not cheap: the academic price for the Windows version is $2570 and for the Macintosh version it is $2000. As far as I know there are no free assembly packages for either Windows or Macintosh, although a Windows NT version of the Staden package is expected by mid 1999. So as long as you're doing fairly light assembly, for instance, working on a sequence up to 20 or 30 Kb, then the GCG tools should be adequate. If you need to do more than that, the best route would probably be to get a Linux workstation, load it with the Unix based tools, and learn to use them.
Sequence assembly terminology
![]()
The figure above is a representation of a partially completed shotgun sequencing project. It should help illustrate what some of the terms used in this talk mean. Note especially that in this talk we use two distinct terms for sequence clusters:
Here is a brief overview of the sequence assembly, and related, programs which we will discuss today. All of the Gel programs are from either the GCG or EGCG packages, Test_Shotgun and ShowMesh are programs which I wrote.
| Test_Shotgun | Plan sequencing experiments, minimize costs |
| GelStart | Create or select a sequence assembly project |
| GelEnter | Put sequences into a project |
| GelMerge | Merge contigs, remove vector contamination |
| GelView | Show contigs in a project graphically |
| GelStatus | Show information about contigs in a project |
| GelPicture | Show detailed view of one contig |
| GelAnalyze | Estimate gap sizes between contigs |
| ShowMesh | Group contigs into meshes |
| GelAssemble | Final editing and assembly |
$ TEST_SHOTGUN myshotgun.txt output file name This is a one line description of the project 3000 sequence this many bp 20.0 cost of a custom primer 300,10.0 bp per directed sequence, cost 1000,20.00 intermesh closing: clone size, cost to obtain 50 minimum overlap to merge 1000,1.00 shotgun clone size, cost to obtain 1 300,10.0 bp per end sequence, cost 10 number of shotgun clones 0
The next file has been modified to compare the results for a project which started with 10 clones versus one which started with 15.
$ type myshotgun.txt
Project: This is a one line description of the project
Project constants
Genome size (base pairs): 3000. 3000.
Scaling from internal subsequence: 1.00 1.00
Required overlap for Mesh merge: 50 50
Bases of sequence from a custom primer: 250 250
Bases in clones used to close mesh gaps: 1000 1000
Cost to obtain a clone outside a Mesh/contig: 20.00 20.00
Cost to make a custom primer: 20.00 20.00
Cost to sequence from a custom primer: 10.00 10.00
Phase : 1 1
Number of clones added: 10 15
Size of each clone added: 1000 1000
Bases of sequence from each end primer: 300 300
Number of genome units of coverage: 3.33 5.00
Number of genome units of sequence: 2.00 3.00
Unit cost: clone isolation and preparation 1.00 1.00
Unit cost: end primer sequencing: 10.00 10.00
Total phase cost for clone preparation: 10.00 15.00
Total phase cost for end sequencing: 200.00 300.00
Project status after phase: 1 1
Summary of coverage fractions:
included in clones: 1.0000000 1.0000000
sequenced: 0.8646666 0.9360000
both directions: 0.3186667 0.4273333
multiply, one direction: 0.0010000 0.1266667
once: 0.5450000 0.3820000
gaps: 0.1353333 0.0640000
IntraMesh: 0.1353333 0.0640000
InterMesh: 0.0000000 0.0000000
Mesh information
number (Meshes): 1 1
max size (clones): 10 15
avg. size (clones): 10 15
max size (bp): 3000. 3000.
avg. size (bp): 3000. 3000.
Contig information
number (Contigs): 7 3
max size (sequences): 3 9
avg. size (sequences): 1 6
max size (bp): 514 1170
avg. size (bp): 386 936
IntraMesh Gaps (inside clones)
fraction: 0.1353333 0.0640000
number (internal gaps): 3 3
max size (bp): 143 80
avg. size: (bp) 135 64
InterMesh Gaps (between clones)
fraction: 0.0000000 0.0000000
number (external gaps): 0 0
max size (bp): 0 0
avg. size (bp): 0 0
minimum new clones to close: 0 0
Regions sequenced more than once, one direction only
fraction: 0.0010000 0.1266667
number (regions): 1 5
max size (bp): 3 98
avg. size (bp): 3 76
Regions sequenced both directions
fraction: 0.3186667 0.4273333
number (regions): 7 7
max size (bp): 297 300
avg. size (bp): 136 183
Regions sequenced once only
fraction: 0.5450000 0.3820000
number (regions): 10 11
max size (bp): 257 244
avg. size (bp): 163 104
Sequencing steps required to close
IntraMesh gaps (X): 3 3
InterMesh gaps (Y): 0 0
IntraMesh, one strand -> both strands(Z): 12 11
close all, both strands (Z+2*(X+Y)): 18 17
Total costs
Shotgun clone isolation and preparation (A): 10.00 15.00
Shotgun clone end sequencing (B): 200.00 300.00
Custom primer synthesis (C+D+E): 360.00 340.00
IntraMesh gaps (C): 120.00 120.00
InterMesh gaps (D): 0.00 0.00
IntraMesh 1->2 strands (E): 240.00 220.00
Custom primer sequencing (F+G+H): 180.00 170.00
IntraMesh gaps (F): 60.00 60.00
InterMesh gaps (G): 0.00 0.00
IntraMesh 1->2 strands (H): 120.00 110.00
InterMesh clone isolation (I): 0.00 0.00
Grand Total cost (A+B+C+D+E+F+G+H+I): 750.00 825.00
From this, we can see that for the process costs used here it would be less expensive to close via directed sequencing than by the addition of more random clones. However, the quality of the final sequence in the latter case is expected to be slightly higher, since more regions would have been sequenced more times.
Before we go on to discuss assembling your sequences, now is a good time to stop and consider how, exactly, you want to go about sequencing. There is one experimental detail which is paramount - you want to be very careful to size your inserts before you clone them, and then to size the plasmids after they are cloned, so that you do not introduce sequence from any hybrid clones from the experiment. When random sequence segments are fused together, they can make mincemeat of the assembly process, since they introduce false linkage between one part of the sequence and another.
The second point that I want to make is that, in general, the DNA assembly packages have limits on the number of fragments that they can handle, as well as the total length of DNA in the final assembled sequence. The GCG package will only handle 750 fragments of maximum length 2500 each, for a maximum contig length of 75000, and the sum of all sequence lengths in the project cannot exceed 380750 bases. While this may seem like rather a lot if you are used to working on plasmids, it is just at the limit for working with larger cosmids, and is too small for a YAC or BAC.
The other thing to consider is that shotgun assembly of N sequence fragments will, in the worst case, take a bit less than N squared comparisons to put the contigs together. So if you have 100 fragments to shotgun, that is only 10000 comparisons, which is not unreasonable with current hardware. However, if 10000 sequences might require 100 million comparisons, which is liable to be very time consuming. Divide and conquer is a good strategy, if you can do it. That is, if you need a total of N fragments, and have some convenient way to subdivide them into M regions, then the number of comparisons that need to be made is roughly
To start a new project, or to do some more work on an existing one, use the GelStart command. The progam expects the name of the project as the single parameter on the command line:
$ gelstart existing_project_nameor
$ gelstart/new new_project_name
If you are creating a new project Gelstart will create a subdirectory in your current default directory for it. If it is an existing project it will look downwards from your login directory for a subdirectory with the project's name. Only one person at a time should work on any given project, or they might end up corrupting the project data by doing incompatible operations. If you do a lot of separate projects it is a good idea to create them from a subdirectory, so that your main directory doesn't become too cluttered. Once you have run GelStart you can work from any directory - the GCG package will remember where the project is.
The subdirectory heirarchy for a sequencing project looks like this:
[.project_name]
miscellaneous configuration files, copies of vectors
[.ARCHIVE] original sequences
[.CONSENSUS] contig consensus sequences
[.RELATION] what goes where
[.WORKING] modified/edited sequences
Don't ever modify any of these files except through the fragment assembly programs or the project may become corrupted!
Some of the other qualifiers you might use with GelStart are:
/vectors=name.seq,gb:pb3222,gb_sy:cvkslic Vectors which were used for sequencing. Regions homologous to these will be detected and highlighted if GelEnter is run interactively. /sites=gaattc Sites which should be highlighted. Typically these are restriction sites which were used for cloning. /delete Delete the whole project. Use with care!
Let's start an example project for use during the rest of this lecture. It will have GAATTC sites marked, and the only vector used will be bluescript.
$ gelstart/new/vector=gb_sy:cvkslic/site=GAATTC example
Once the project exists sequence data may be entered into it through the program GelEnter. All sequences must first have been converted to GCG format. While GelEnter will allow you to type in your sequences interactively, you probably won't do much of that now that we have the DNA sequencing facility. Each fragment in a project must have a unique name, and that name will be derived from the name of the file which was input into the project. Give some thought to your naming convention before you start! In particular, fragments often have very similar names, and you want the part that varies to be as close to the front of the name as possible. Additionally, try to have the variable parts line up in columns. For instance, instead of
pclone_13_14_f.seq not all columns line up pclone_3_14_r.seq use p13_14f.seq all columns line up p03_14r.seq
Generally you want to put all of your new sequences into one directory, and then enter them all into the project at once with:
$ gelenter/enter=[-.mysequences]*.seq
For use in this lecture I have prepared an artificial test case which consists of 20 randomly selected 300 bp fragments between 2000 and 5000 in DMWHITE (http://seqaxp.bio.caltech.edu/www/seqanalysis/class.) Each was slightly corrupted with the addition of 3 mismatches and 1 indel. The program SHOTGUN was used to prepare this data. It is a modified version of the GCG CORRUPT program. Additionally, two bad sequences will be entered, one consists of two randomly selected pieces fused together (like a cloning mistake) and the other is a fragment contaminated with the vector (bluescript). Let's put those into our example now
$ gelenter/enter=class:test*.seq $ gelenter/enter=class:bad*.seq
Remember, GelStart must be run first for each session to tell the GCG software which project you are working on. Once the sequences have been entered into the project you don't really need to keep the originals. The raw sequences will always be in the [.archive] subdirectory of the project and will never be modified.
The next step is to find and remove any vector contamination. To do that use the program GelMerge. Remember, like all other GCG assembly programs, you must have run GelStart before you run GelMerge so that GelMerge will know which project to act on. One complication concerning the excision of vector sequence is that it only happens on single fragment contigs, ie, fragments that have not yet been merged, and it only happens for vector sequence that is on the *end* of a fragment. So if you happen to have two chunks of vector stuck side by side onto an insert, then you will only remove the distal one on the first pass. Start with:
$ gelmerge/report=file.txt/excise/nomerge
GelMerge aligns the sequences in a fragment assembly project into
assemblies called contigs. You can view and edit these
assemblies in GelAssemble.
What word size (* 7 *) ?
What fraction of the words in an overlap must match (* 0.80 *) ?
What is the minimum overlap length (* 14 *) ?
Reading ......................
Searching for Cvkslic
.*....................
Excising last 101 bases from Bad0002
Writing ......................
Input Contigs: 22
Output Contigs: 22
CPU time: 02.81
$ search file.txt excis
Bad0002 Length: 301 Excised last 101 bases
$ gelmerge/report=file.txt/excise/nomerge
Output not shown
$ search file.txt excis
%SEARCH-I-NOMATCHES, no strings matched
Repeat these two commands until the report says that no vectors were found, in the above example, that happened on the second run. Keep in mind that this method is designed to remove big chunks of vector, if you have a couple of bases from the vector on the end of the sequence it won't be removed, but it probably won't interfere with anything either. Once all of the vector contamination is gone, merge the sequences into overlapping groups of contigs:
$ gelmerge
GelMerge aligns the sequences in a fragment assembly project into
assemblies called contigs. You can view and edit these
assemblies in GelAssemble.
What word size (* 7 *) ?
What fraction of the words in an overlap must match (* 0.80 *) ?
What is the minimum overlap length (* 14 *) ?
Reading ......................
Comparing ......................
Aligning ..................
Writing ....
Input Contigs: 22
Output Contigs: 4
CPU time: 02.65
There are several other command line qualifiers that you can use with GelMerge, but probably the defaults will be adequate.
After running GelMerge you'll want to get an overview of the status of the assembly project that's a bit more complete than just the number of Contigs, which is what GelMerge output above. Obtain this information with the command:
$ gelview
GelView displays the structure of the contigs in a fragment
assembly project.
What should I call the output file (** Example.View **) ?
EXAMPLE has 22 Fragments in 4 Contigs
$ type example.view
GELVIEW Fragment Assembly contig display of Project: EXAMPLE
February 11, 1999 13:51
Contig: Bad0001
2 Bad0001 +---------------------------------->
C CONSENSUS +---------------------------------->
|----------|----------|----------|---------|---------|
0 100 200 300 400
Contig: Bad0002
6 Test0009 <--------------------------------+
5 Test0019 <--------------------------------+
4 Test0005 <--------------------------------+
3 Test0002 +--------------------------------->
2 Bad0002 +--------------------->
C CONSENSUS +----------------------------------------------->
|----------|----------|----------|---------|---------|
0 100 200 300 400
Contig: Test0001
10 Test0016 <----------------+
9 Test0007 +---------------->
8 Test0004 +---------------->
7 Test0011 <----------------+
6 Test0008 +---------------->
5 Test0014 +--------------->
4 Test0003 <----------------+
3 Test0012 +---------------->
2 Test0001 +--------------->
C CONSENSUS +--------------------------------------------------->
|----------|----------|----------|---------|---------|
0 200 400 600 800
Contig: Test0017
8 Test0020 <-------+
7 Test0015 +------->
6 Test0018 +-------->
5 Test0010 <-------+
4 Test0013 <-------+
3 Test0006 +------->
2 Test0017 +------->
C CONSENSUS +---------------------------->
|----------|----------|----------|---------|---------|
0 400 800 1200 1600
22 Fragments in 4 Contigs
From this we learn that GelMerge put the 21 good fragments into 3 contigs and the fusion into a separate one. Of course, if this was real shotgun sequencing data, we wouldn't know yet why that one fragment had not been merged into a contig. One point which should be obvious from this is that since we didn't put in any NONshotgunned fragments we have no simple way to orient this. It is possible that you might have a restriction map, or STS, or something else in there to help you sort it out. In general, it's a good idea to also throw in the sequences from the end of your clone/region to help orient things.
EGCG supplies a similar command to GelView, but which displays a bit more information, and which you might like better:
$ gelstatus
GELSTATUS reads a GCG Fragment Assembly database, and produces a
summary report of the quality of each contig.
What should I call the output file (* example.dat *) ?
$ type example.dat
Gelstatus of project EXAMPLE, February 11, 1999 13:53
No. Pct No. Pct
Contig Len Both Many Once Dif Dif Fragment Dir Len Dif Dif
---------- --- ---- ---- ---- --- --- ----------- --- --- --- ---
Bad0001 + 322 0 0.0
--------------- ---- --- ---
ad0001.Con 322 0 0 322 0 0.0 ( 1 fragments) 322 0 0.0
Test0009 - 305 3 1.0
Test0019 - 304 5 1.6
Test0005 - 304 3 1.0
Test0002 + 305 3 1.0
Bad0002 + 204 1 0.5
--------------- ---- --- ---
ad0002.Con 441 273 66 102 15 3.4 ( 5 fragments) 1422 15 1.1
Test0016 - 302 3 1.0
Test0007 + 302 5 1.7
Test0004 + 302 4 1.3
Test0011 - 302 3 1.0
Test0008 + 300 5 1.7
Test0014 + 305 5 1.6
Test0003 - 305 3 1.0
Test0012 + 305 6 2.0
Test0001 + 300 4 1.3
--------------- ---- --- ---
st0001.Con 951 610 140 201 35 3.7 ( 9 fragments) 2723 38 1.4
Test0020 - 302 5 1.7
Test0015 + 302 2 0.7
Test0018 + 301 2 0.7
Test0010 - 303 8 2.6
Test0013 - 304 4 1.3
Test0006 + 302 7 2.3
Test0017 + 302 6 2.0
--------------- ---- --- ---
st0017.Con 1071 509 291 271 25 2.3 ( 7 fragments) 2116 34 1.6
---- ---- ---- ---- --- --- -------------- ---- --- ---
(Total) 2785 1392 497 896 75 2.7 (Total) 6583 87 1.3
EXAMPLE has 22 fragments in 4 Contigs
Longest contig is Test0017.Con, length 1071
There is a lot of information in here, most of it quite useful. The program outputs:
Contig Contig name Length length of the contig both length sequences on both strands } many length sequenced more than both } these add up to Length once length sequenced once } No. Dif. Number of sites that differ from the consensus Pct Dif. Percent of sites that differ from the consensus
The sum of the differences for the individual fragments can be more here than versus the consensus because multiple incorrect assignments at a single site count as just 1 difference for the latter. High difference rates indicate a problem in the assembly, here there is no such problem.
Notice that the program doesn't leave much room for the contig names and truncates them from the left, so TEST0001.CON becomes ST0001.CON. That can get confusing, but the fragment names are shown and can usually be used to figure out the consensus name (which is always that of one of the fragments, the last in the list).
To look at a single contig use the EGCG GelPicture program (or GelAssemble).
$ gelpicture test0001
$ type test0001.pic
GelPicture of project EXAMPLE contig TEST0001, 2/11/1999 13:57
10 Test0016 - <-+----+----+---
9 Test0007 + ----+----+----+>
8 Test0004 + +----+----+----+>
7 Test0011 - <----+----+----+
6 Test0008 + ---+----+----+->
5 Test0014 + ----+----+----+>
4 Test0003 - <---+----+----+-
3 Test0012 + -+----+----+--->
2 Test0001 + ----+----+----+>
1 Consensus ----+----+----+----+----+----+----+----+----+-->
....+....+....+....+....+....+....+....+....+....+
1 1000
Pretty of project EXAMPLE contig TEST0001, February 11, 1999 13:57
2 Test0001 + 1 CGCAAAGTGCTCTAATTGAATTACATTTCGCATTAAAGCGCCCCAGTGGG 50
1 Consensus 1 ---------+---------+---------+---------+---------+ 50
2 Test0001 + 51 CGTGGCCCAGGGGGGGGGGTGCGGTGTGTGGGTTAACCAATCTGGGGCAT 100
1 Consensus 51 ---------+---------+---------+---------+---------+ 100
2 Test0001 + 101 TATTAAATTATCCCTTAGACACCGCAATCGCACTACCCCCTGGATTTCCC 150
1 Consensus 101 ---------+---------+---------+---------+---------+ 150
3 Test0012 + 1 ACACCCTCCCCCCCGCCACCAGCGCACACGGTCTTGGCTT 40
2 Test0001 + 151 TTTTTCCTCAACACCCTCCCCCCCGCCACCAGCGCACACGGTCTTGGCTT 200
1 Consensus 151 ---------+---------+---------+---------+---------+ 200
3 Test0012 + 41 TTGTTATTATAAGCTTTGCTTTGGCTTTTTGCCACCTTTCAGTTTGGTGC 90
2 Test0001 + 201 TTGTTATTATAAGCTTTGCTTTGGCTTTTTGCCACTTTTCAGTTTGGTGC 250
1 Consensus 201 ---------+---------+---------+-----y---+---------+ 250
3 Test0012 + 91 CAGGCACTTTTTAGCCATTAATCTTCTTTGACGCGCTTAAATCACATGAA 140
2 Test0001 + 251 CAGGCACTTTTTAGCCATTAATCTTCTTTGACGC...TAAATCACATGAA 300
1 Consensus 251 ---------+---------+---------+----gct--+---------+ 300
5 Test0014 + 1 GGCATATTAGTACGTGGCTTTACCGCCTTTGTACTAA 37
4 Test0003 - 305 TAGATGGCATATTAGTACGTGGCTTTACCGCCTTTGTAATAA 264
3 Test0012 + 141 TAACAAATTAGATGACATATTAGTACGTGGCTTTACCGCCTTGGTAATAA 190
1 Consensus 301 --------=+====g====+=========+=========+==t===a==+ 350
5 Test0014 + 38 TTTTTT...TTTTTTTTTAAGCCCAACATAATGATTCAAATGTTGTGGCA 87
4 Test0003 - 263 TTTTTT...TTTTTTTTTAAGCC.AACATAATG..TCAAATGTTGTGGCA 214
3 Test0012 + 191 TTTTTTGGGTTTTTTTTTAAGCCCAACATAATG..TCAAATGTTGTGGCA 240
1 Consensus 351 ======...+=========+===c=====+===..====+=========+ 400
6 Test0008 + 1 AAAAGACAGGGAAATGAG 18
5 Test0014 + 88 CCAAAAAAGAGAGCGAAAAAGGAAAGAATGCAAAAAGAAAGGGAAATGAG 137
4 Test0003 - 213 CCAAAAAAGAGAGCGAAAAAGGAAAGAATGCAAAAAGAAAGGGAAATGAG 164
3 Test0012 + 241 CCAAAAAAGAGAGCGAAAAAGGAAAGAATGCAAAAAGAAAGGGAAATGAG 290
1 Consensus 401 =========+=========+=========+========a+=========+ 450
8 Test0004 + 1 TT 2
7 Test0011 - 302 TTGGCTCTATATATTT 287
6 Test0008 + 19 GATGAAAATTAAATAACAAAAGGATCTCGCTCCCTTGGCTCTATATATTT 68
5 Test0014 + 138 GATGAAAATTAAATAACAAAAGGATCTCGCTCCCTTGGTTCTATATATTT 187
4 Test0003 - 163 GATGAAAATTAAATAACAAAAGGATCTCGCTCCCTTGGCTCTATATATTT 114
3 Test0012 + 291 GATGAAAATTAAATA 305
1 Consensus 451 =========+=========+=========+========c+=========+ 500
8 Test0004 + 3 TTGCATAATTCACTGGCCTGTCTAACGGTGCGTATACGTTATCCCCACTT 52
7 Test0011 - 286 TTGCATAATTCATTGGCCTGTCTAACGGTGCATATACGTTATCCCCACTT 237
6 Test0008 + 69 TTTCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCCGACTT 118
5 Test0014 + 188 TTGCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCACACTT 237
4 Test0003 - 113 TTGCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCCCACTT 64
1 Consensus 501 ==g======+==t======+=========+=g=======+====cc===+ 550
8 Test0004 + 53 TCCCTTGAGCTTTTCCATTTCACCCCTCCCCCCCTCCACACACACACACA 102
7 Test0011 - 236 TCCCTTGAGCTTTTCCATTTCACCCCTCCCCCCCTCCACACACACACACA 187
6 Test0008 + 119 TCCCTTGAGCTTTTCCATTTCACCCCTCCCCCCCTCCACACACACACACA 168
5 Test0014 + 238 TCCCTTGAGCTTTTCCATTTCACCCCTCCCCCCCTCCACACACACACACA 287
4 Test0003 - 63 TCCCTTGAGCTTTTCCATTTCACCCTTCCCCCCCTGCACACACACACACA 14
1 Consensus 551 =========+=========+=====c===+=====c===+=========+ 600
10 Test0016 - 302 C 302
9 Test0007 + 1 GCATGCACATGTATTTATAAATATTTTTCGCCCC 34
8 Test0004 + 103 CACACTCACACAGACGGCATGCACATGTACTTATAAATATTTTTCGCCCC 152
7 Test0011 - 186 CACACTCACACAGACGGCATGCACATGTATTTAT..ATATTTTTCGCCCC 137
6 Test0008 + 169 CACACTCACACAGACGGCATGCACA..TATTTATAAATATTTTTCGCCCC 218
5 Test0014 + 288 CACACTCACACAGACGGC 305
4 Test0003 - 13 CACACTCACACAG 1
1 Consensus 601 =========+=========+=====tg==t====aa===+=========+ 650
10 Test0016 - 301 TTTGTGCTGTCAGTTTGCTAATTAGCTACGCATTTTCATTGTCGTTAAAA 252
9 Test0007 + 35 TTTGTGCTGTCAGTTTGCTAATTAGCTACGCATTTTCATTGTCGTTAAAA 84
8 Test0004 + 153 TTTGTGCTGTCAGTTTGCTAATTAGCTACGCATTTTCATTGTCGTTAAAA 202
7 Test0011 - 136 TTTGTGCTGTCAGTTTGCTAATTAGCTACGCATTTTCATTGTCGTTAAAA 87
6 Test0008 + 219 TTTGTGCTGTCAGTTTGCTAATTAGCTACGCATTTTCATTGTCGTTAAAA 268
1 Consensus 651 =========+=========+=========+=========+=========+ 700
10 Test0016 - 251 ACTCCATTCACATGCATATTTACTCACTCCGCCGAGTA..TATTATTCAT 202
9 Test0007 + 85 ACTCCATTCACATGCATATTTACTCACTCCGCCGAGTA..TATTATTCAT 134
8 Test0004 + 203 ACTCCATTCACATGCATATTTACTCACTCCGCCGAGTATTTATTATTCAT 252
7 Test0011 - 86 ACTCCATTCACATGCATATTTACTCACTCCGCCGAGTA..TATTATTCAT 37
6 Test0008 + 269 ACTCCATTCACATGCATATTTACTCACTCCGC 300
1 Consensus 701 =========+=========+=========+========..=========+ 750
10 Test0016 - 201 TTTGTTCGCCTTGTGTCCCCCTCATTCTTTTAATGCAATTTAATTATTTT 152
9 Test0007 + 135 TTTGGTCGCCTTGTGTCCCCCTCATTCTTTTAATGCAATTTAATTATTTT 184
8 Test0004 + 253 TTTGTTCGCCTTGTGTCCCCCTCATTCTTTTAATGCAATTTAATTATTTT 302
7 Test0011 - 36 TTTGTTCGCCTTGTGTCCCCCTCATTCTTTTAATGC 1
1 Consensus 751 ====t====+=========+=========+=========+=========+ 800
10 Test0016 - 151 ACTATATGTTTGCGTCACTGACAGAAAAGAGAAAAGCCCATGTCTTCT.A 102
9 Test0007 + 185 ACTATATGTTTGCGTCACAGACAGAAAAGAGAAAAGCCCATGTCTTCTGA 234
1 Consensus 801 =========+========w+=========+=========+========g+ 850
10 Test0016 - 101 CATGTGCCATATACCCCCGACCTCTGACCTCTTCCTGCCTGCTTTACTTC 52
9 Test0007 + 235 CATGTGCCATATACCCCCTACCTCTGACCTCTTCCTGCCT..TTTACTTC 284
1 Consensus 851 =========+========k+=========+=========+gc=======+ 900
10 Test0016 - 51 ACATTTTGATAAATTATATGCATTTTTAATCTAATTATATTGCAAAAAGC 2
9 Test0007 + 285 ACATTTTGATAAATTATA 302
1 Consensus 901 =========+========-+---------+---------+---------+ 950
10 Test0016 - 1 G 1
1 Consensus 951 - 951
If you will be making any custom primers to close the gaps in this sequence, you can verify their locations and directions. The primers file, if supplied, must be in the same format as a Findpatterns input file. Example:
$ create primers.dat
Some primers - where do they fall in the contigs?
------ ---- ------------------- --- --- ------------------ ..
prim1 0 atgatgcatcgggcacgat . ! first attempt
prim2 0 ggcgtgactatattaatcactt . ! assume GA is correct
prim3 0 aattcattggcctgtctaac . !
^Z
$ gelpicture/contig=test0001/primer=primers.dat
$ search/wind=(10,0) test0001.pic prim
8 Test0004 + 3 TTGCATAATTCACTGGCCTGTCTAACGGTGCGTATACGTTATCCCCACTT 52
7 Test0011 - 286 TTGCATAATTCATTGGCCTGTCTAACGGTGCATATACGTTATCCCCACTT 237
6 Test0008 + 69 TTTCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCCGACTT 118
5 Test0014 + 188 TTGCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCACACTT 237
4 Test0003 - 113 TTGCATAATTCATTGGCCTGTCTAACGGTGCGTATACGTTATCCCCACTT 64
1 Consensus 501 ==g======+==t======+=========+=g=======+====cc===+ 550
>******************>
prim3
There is a graphical variant of Gelpicture called GelFigure, but it's a bit buggy and doesn't really do much that Gelpicture doesn't.
Unfortunately GelMerge does not write out any explanation for why it did not fuse two contigs. In this example, and in general, a contig consisting of a single sequence isn't a good thing at this stage of assembly - it is likely there is something wrong with that fragment so that it should be removed from the project. One quick way to find out what is wrong is to compare it to the contig consensus sequences with FASTA and see what turns up. In this case, we see that it has high scores against two other contigs - that is, it appears to be a hybrid sequence. (Think about what repetitive sequences will do to a sequencing project!) In other cases, it might have no similarity to the other consensus sequences - which either means that more fragments are needed to close the gap, or the fragment is a contaminant.
$ fasta/infile=[.example.consensus]BAD0001.CON - /infile2=[.example.consensus]*.con - /nohisto/noalign/default $ search bad0001.fasta cons disk:[name.Example.Consensus]Bad0001.Con Bad0001...1288 1288 1288 disk:[name.Example.Consensus]Test0017.Con Test00... 569 569 573 disk:[name.Example.Consensus]Test0001.Con Test00... 504 504 690 disk:[name.Example.Consensus]Bad0001.Con /rev Ba... 56 56 56 disk:[name.Example.Consensus]Test0017.Con /rev T... 56 56 57 disk:[name.Example.Consensus]Bad0002.Con Bad0002... 54 54 68 disk:[name.Example.Consensus]Bad0002.Con /rev Ba... 44 44 45 disk:[name.Example.Consensus]Test0001.Con /rev T... 40 40 56
When doing shotgun sequencing there is a point at which it becomes more efficient to go to directed sequencing, rather than to keep adding randomly selected subclones. The EGCG program GelAnalyze processes the output of GelStatus to prepare a report which will help you decide if the project has reached that state. The reference is Lander and Waterman (1988) Genomics 2:231-239. GelAnalyze needs to be told how big the final sequence will be with the /size qualifier. You can have it predict what will happen if you add N more fragments with the /NEWFRAGS=N qualifier. You decide when to go over to directed sequencing by what it tells you about the number of contigs that would result and the size and distibution of gaps.
$ gelanalyze/size=3000/default
$ type example.ana
GELANALYZE of Gelstatus report example.dat, February 11, 1999 14:18
Qualifiers: /OVERlap=30. /SIZe=3000.
Number of fragments: 22
Average fragment length: 299.2
Total length of fragments: 6583
Sigma mean: 0.899, Sigma variance: 0.0001
(i) Number of apparent contigs
Actual: 4
Expected: 3.07
Allowing fragment length to vary: 3.07
(ii) Number of apparent contigs of j fragments
j Actual Expected
--- ------ --------
1 1 0.43 lines with 0 observed deleted
5 1 0.23
7 1 0.17
9 1 0.13
> 50 0 0.00
(ii') Number of apparent contigs of at least 2 fragments
Actual: 3
Expected: 2.64
Allowing fragment length to vary: 2.64
(iii) Number of clones in an apparent contig
Actual: 5.50
Expected: 7.17
Allowing fragment length to vary: 7.16
(iv) Length of an apparent contig
Actual: 696.25
Expected: 871.82
(v) Number of actual contigs if overlapping is perfect
Expected: 2.46
(vi) Probability that a gap of given length occurs
Length Given Gap Any Gap
------ --------- -------
50 0.56 0.92
100 0.39 0.78
150 0.27 0.62
200 0.19 0.47
250 0.13 0.34
300 0.09 0.25
350 0.06 0.18
400 0.04 0.13
450 0.03 0.09
500 0.02 0.06
(vi') Probability that a gap is real: 0.80
Maximum number of contigs: 4.1
occurs at redundancy (c) = 1.11
when total fragment length sequenced is 3334 bp
For instance, from GelStatus we know that there are 3000 - 2439 = 561 unsequenced bases and three contigs. That means we have either 4, 3 or 2 gaps, depending on whether or not the contigs extend to the ends of the sequence. These gaps must be of average size 140, 187, or 280 bp. The statistics are a bit odd though, and the mean isn't really what we want to know. Looking at section (vi) we find the probabilities for either a given gap, or any of the gaps, being a certain size. If we run GelAnalyze again, with more fragments, then compare the two results, it will give us an idea of what our odds are of closing the gaps with single directed sequencing steps. For instance, adding 10 more fragments:
$ gelanalyze/size=3000/newfrags=10
Let's have a look at the gap sizes we have in our test case (which we can do because we know the full sequence), and compare that to the expected gap sizes from the two example.ana files.
Example case gaps
Where what Length
2000 gap 23
2023 contig 943
2966 gap 60
3026 contig 410
3436 gap 196
3632 contig 1063
4695 gap 305
5000
Current +10 frags
Length | Given Any | Given Any | Observed
| Gap Gap | Gap Gap | Gap Sizes
------ | ----- ---- | ---- ---- | --------
50 | 0.63 0.95 | 0.43 0.64 | 1
100 | 0.45 0.84 | 0.25 0.41 | 1
150 | 0.32 0.69 | 0.15 0.25 | 0
200 | 0.23 0.55 | 0.09 0.15 | 1
250 | 0.17 0.42 | 0.05 0.09 | 0
300 | 0.12 0.32 | 0.03 0.05 | 0
350 | 0.09 0.24 | 0.02 0.03 | 1
400 | 0.06 0.17 | 0.01 0.02 | 0
450 | 0.04 0.13 | 0.01 0.01 | 0
500 | 0.03 0.09 | 0.00 0.01 | 0
We learn from this that there is a 24% chance that the current data has a gap of 350 bases or larger, which is large enough that it might not be spanned in a single directed sequencing run. On the other hand, after adding 10 more shotgun fragments, there would be only a 3% chance of such directed sequencing failing to bridge a gap. In addition, there is rather a lot of the 3Kb piece which has only been sequenced in one direction.
Often you want to know if your project may be reduced to a single contig by directed sequencing within clones. The locally written program ShowMesh can tell you this. In order to use it, you must have prepared an ends file, which contains the names of the end sequences from each clone (case is ignored). Here is an example showing how the program is run, and the contents of a small ends file, which only covers the ends of a few of the clones in the example project:
$ create ends.txt
Test0011,Test0004 !intracontig
Test0007,Test0017 !intramesh
Test0013,noright !right missing
noleft,Test0010 !left missing
^Z
$ showmesh [.example.relation] ends.txt showmesh.out
about to open ends.txt
Read in ends for 4 clones
opened file BAD0002.FIL;5
opened file TEST0001.FIL;3
opened file TEST0017.FIL;3
$ type showmesh.out
Test0011 TEST0001.FIL;3 Test0004 TEST0001.FIL;3 IntraContig
Test0007 TEST0001.FIL;3 Test0017 TEST0017.FIL;3 IntraMesh
Test0013 TEST0017.FIL;3 noright MISSING right_missing
noleft MISSING Test0010 TEST0017.FIL;3 left_missing
From this example, we can see that contigs TEST0001 and TEST0017 may be merged by sequencing within the clone which has ends Test0007,Test0017.
At this point you have an overview of your sequencing project. Since the sequences won't be perfect you will need to reconcile them manually using the program GelAssemble. It is a good idea to take a look at your contigs as you go along in the project, just to make sure that you don't find some problem like nonneighboring fragments that have been fused during the cloning. However, it isn't necessarily a good idea to start out right away editing the working sequences. You want to wait until you are pretty sure that you have all the sequences you're going to have in a given region - otherwise you may have to reedit the same region, which would be duplicated work.
To start GelAssemble issue the command:
$ GelAssemble
There is only one possible qualifier for this program, which is /contig=contigname, which will load that contig automatically for you when you start. GelAssemble is a multiple sequence editor that is a derivative of the GCG program Lineup. There is no choice other than to learn the slightly arcane set of command that it uses. If you get into the program in screen mode (after you've loaded a contig) and hit a ? it will put up a list of the commands. It is a good idea to print that screen and tape it up next to the terminal for handy reference. Some terminal emulators may need to be configured carefully so that they will work properly with GelAssemble (and Seqed, and Lineup...)
SCREEN MODE
[n] is an optional numeric parameter
Where no VT220 version is given, you may use the VT100 version.
VT52 or VT100 VT220 or VT240 Action
[n]<Right-arrow> move ahead [n bases]
[n]<Left-arrow> move back [n bases]
[n]<Up-arrow> move up [to row n]
[n]<Down-arrow> move down [to row n]
> <Next-screen> scroll one screen to the right
< <Prev-screen> scroll one screen to the left
<Ctrl>H <F12> move to start of the sequence
<Ctrl>E <F13> move to end of the sequence
/GATTC<Return> <Find>GATTC<Return> find next occurrence of GATTC
165<Return> move to base 165 in sequence
<Ctrl>A move to next ambiguity in alignment
<Ctrl>R move to next ambiguity in sequence
<Ctrl>V move to next gap in consensus
<Ctrl>Z <Do> enter Command Mode
<Ctrl>L toggle alignment display enlargement
<Ctrl>W redraw the screen
<Ctrl>O <F11> toggle INSERT/OVERSTRIKE mode
! summary of current sequence
? <Help> display these help screens
<Ctrl>G recalculate the consensus
G A T C .... add base at the cursor
<Delete> delete a base, or move sequence left
<Space bar> move the sequence to the right
<Ctrl>X delete alignment column
<Ctrl>I restore alignment column
<Ctrl>B <Select> begin selecting a range for removal
<Ctrl>N <Remove> remove the selected range
<Ctrl>P <Insert Here> insert the removed range
- reject current fragment
COMMAND MODE
[a,b] specifies a range of fragments.
[x,y] specifies a range of bases.
[n] is an optional numeric parameter.
EDit [ContigName] replace current contig with a new contig
CONTIGs select another contig for editing
WRite write a contig to the database
ERASE delete current contig from the database
EXit write the contig and quit
QUIT quit without writing
238 move to position 238 in the current fragment
[x,y] PRETTYout [FileName] write the sequence alignment [position x - y]
[a,b] SEQOUT write fragments [a - b] to sequence files
BIGPICture [FileName] write bar schematic to an output file
OVERstrike select OVERSTRIKE sequence edit mode
NOOVERstrike select INSERT sequence edit mode
[x,y] CONSensus recalculate the consensus sequence
[a,b] LOCk lock strands [a through b]
[a,b] Unlock unlock strands [a through b]
[x,y] SELect select bases [x through y]
REMove remove the selected bases
[n] INSert insert the removed bases [at position n]
CAncel cancel the selection
[x,y] DElete delete bases [x through y]
GOTo [FragmentName] move to strand by name
FInd GAATC find the next occurrence of GAATC
DIfferences show differences from the consensus
MAtches show matches with the consensus
Neither show neither matches nor differences
REDraw redraw the screen
Help display these help screens
SORt [DEScending] sorts strands by their offsets in alignment
[a,b] MOve moves a strand [from line a to line b]
OPen opens a blank line at the cursor position
[a,b] ANChor anchors strands [a through b]
[a,b] NOANchor unanchors strands [a through b]
LOad [ContigName] loads another contig into the Edit Screen
REVerse reverse-complement the (anchored) strand(s)
[n] Offset shifts the current fragment [to begin at n]
REJect removes the current fragment from the screen
NODUPlicate removes a duplicated fragment from the screen
SPAWN renames a duplicated fragment
SEParate makes two contigs from anchored and
unanchored strands
A quick aside. It is tempting to use GelAssemble by searching for each mismatch, fixing it, possibly after consulting the traces, and going on. However, if the sequence happens to be bad in the same way on the 2 or 3 sequences that cover a region you won't ever check it. The tools used by the genome centers work around this problem by assigning a quality to each region, so that regions of low quality are pointed out to you. In the GCG assembly system this isn't an option. As a bare minimum, you should review the output from GelPicture and look carefully through the traces for those regions which have not been sequenced at least once in each direction. Ideally, there would be no regions like this - everything would have been sequenced in both directions.
Ok, now let's use GelAssemble a little just to give you an overview of how it works. First, let's remove the BAD0001 sequence from the database, this completely removes it, which is only something you want to do if you suspect the sequence is really useless.
$ gelassemble/contig=bad0001 ^Z :erase Y ^Z :quit
Now let's rebuild the database and work on some of the contigs:
$ set term/page=40/width=132 $ gelmerge $ gelassemble
Gelassemble has three modes:
The program starts in contig mode, if the project is nearing completion there will be only one contig.Scroll through the contigs and pick one to load with ^K. To get from screen mode to command mode use ^Z, the program returns to screen mode after each command mode command. To get back to the contig screen from screen mode:
^Z :contigs
but only do this after using WRITE to save any changes!
The commands are listed above and are available from within the program, by typing a ?. The ones you'll use the most for routine editing are:
^A to find next mismatch ^O to toggle insert/overstrike modes ^X to delete a column ^I to undelete a column (only if the cursor has not moved) ^G to recalculate the consensus
If in command mode you anchor a group of sequences together, then any change made to one of these sequences will also be made to all of the others. This is particularly useful if you want to manually assemble two contigs into one. First load the first contig from contig mode, then load the second contig via:
^Z :load name_of_second_contig
You will then be able to move the entire second contig, which is loaded with all sequences anchored to each other, with respect to the first contig.
To prevent the modification of one or more sequences use the Lock command.
To save any changes to a contig, and then continue editing, use Write from command mode. The Exit command will also save the changes to the last contig which was modified. Quit will not save any changes which have been made!
One other note, you may want to define AGCT to be elsewhere on the keyboard, for instance under QWER (you may not remap the number keys). Use the program Setkeys to do this. The resulting file "set.keys" must be in the directory where you are running GelAssemble.
When you are done with the project you can pull the final consensus sequence out. It will be in [.project...]*.con (you cannot predict the name). The .CON files are just GCG sequence files, so copy one out and rename it.
Questions? Next week we will cover tools for protein analysis.