Today we'll start running through the tools that are available for planning and analyzing your molecular Biology experiments.
Restriction mapping is one of the most common sequence analysis tasks. Starting with a piece of known DNA sequence, and the recognition sites for the restriction enzymes, there are three actions which are typically performed - the GCG package has a program for each:
MAP is the tool to use to generate a detailed restriction digest of a known DNA sequence. The most commonly used qualifiers for this program are:
/begin=100 begin position on DNA /end=200 end position on DNA /six sixcutters only /open=20 translate only for ORF >20 aa. /once only cuts once /mincuts=N control # of cuts accepted /maxcuts=M
$ map/infile=gb_in:dmwhite/begin=1/end=100/six/default $ type dmwhite.map
This and subsequent output files have been edited for clarity or brevity. In this instance, - the list of sites which cut and those which didn't cut has been omitted. The restriction sites are marked at the 5' end of the cut site. Multiple sites which cut at the same location are grouped above a / symbol. The three reading frames a,b,c are frames 1,2,3 from the beginning of the sequence. * is a stop codon.
E
Ac B
po s
oR b
II I
/
GAATTCTGCTCCAGTTGCCAGTTGCCAGCGACTTGACTTATTACTCGCAGTAAGAGCAAC
1 ---------+---------+---------+---------+---------+---------+ 60
CTTAAGACGAGGTCAACGGTCAACGGTCGCTGAACTGAATAATGAGCGTCATTCTCGTTG
a E F C S S C Q L P A T * L I T R S K S N -
b N S A P V A S C Q R L D L L L A V R A T -
c I L L Q L P V A S D L T Y Y S Q * E Q H -
MAP may also be used with the locally developed qualifiers /FRAME, which causes it to output a list of open reading frames, and /VALine, which causes it to accept GTG in addition to ATG as the start of an open reading frame. In the example, the known coding regions in this vector are also shown. The ORF corresponding to the first CDS region (italics) extends quite far 5' to the actual start - this is because the promoter does not contain any STOP codons. promoter region which is used. Also notice that many of the ORFs are located inside the coding regions, but in different frames. That is a very common observation - that a real coding region tends to have more ORFs in other frames than does a noncoding region. s3 indicates that the ORF starts with a START (rather than the front end of the fragment), and ends with a STOP (rather than the end of the fragment)./menu=s tells the program that we want to see all six reading frames, t would have been the three forward.
$ map/infile=gb_sy:Synpbr322/circular/frame/open=60/menu=s/default $ type Synpbr322.map FRM00001 +1 259 642 384 s3 in FRM00004 FRM00002 +1 1180 1518 339 s3 FRM00003 +1 1882 2106 225 s3 CDS 1915. .2106 FRM00004 +2 86 1276 1191 s3 CDS 86. .1276 FRM00005 +2 1523 1723 201 s3 FRM00006 +2 2009 2218 210 s3 FRM00007 +3 1593 1811 219 s3 FRM00008 +3 2298 2480 183 s3 FRM00009 +3 3423 3689 267 s3 in FRM00011 FRM00010 -1 1515 1883 369 s3 FRM00011 -2 3293 4153 861 s3 CDS complement(3293. .4153) FRM00012 -2 623 1081 459 s3 in FRM00004 FRM00013 -2 323 565 243 s3 in FRM00004 FRM00014 -3 439 789 351 s3 in FRM00004
The standard set of restriction enzymes has hundreds of entries, and in most cases you'll only ever use the ones in your lab fridge, perhaps 10 or 20 enzymes. So it is convenient to make a file which contains only those enzymes, so that you don't have to select them one by one each time from the master list. There are two ways to do this: either copy the full enzyme file and edit it, or use MAPSELECT.
$ copy genrundata:enzyme.dat []my_enzymes.dat $ edit my_enzymes.dat
The qualifiers in the following command have these meanings:
$ MAPSELECT/infile=gb_in:dmwhite/begin=1/end=2 -
/select=1/menu=n/outfile=nla0:/enzfile=my_enzymes.dat
MAPSELECT selects restriction enzymes by name or by their ability to
cut a given sequence, and writes them to a new enzyme file for use in other
programs.
Select the enzymes: Type nothing or "*" to get all enzymes. Type "?"
for help on which enzymes are available and how to select them.
Enzyme(* * *): ecori
"ECORI" selected 1 enzyme, new total: 1. Enzyme: ecorv
EcoRV
"ECORV" selected 1 enzyme, new total: 2. Enzyme: bamhi
BamHI
"BAMHI" selected 1 enzyme, new total: 3. Enzyme: hindiii
HindIII
"HINDIII" selected 1 enzyme, new total: 4. Enzyme:
$
In either case, the file my_enzymes.dat now holds your set
of enzymes. Eliminate manual selection by adding the /enzyme
qualifier to any MAP command:
$ MAP/infile=gb_in:dmwhite/begin=1/end=1000 -
/data=my_enzymes.dat/enzyme=*/default
Map commands can be customized. For instance:
$ MYMAP :== $genutil:map/data=my_enzymes.dat $ mymap gb_in:dmwhite
In this example a graphical representation is made of the restriction digest of the DMWHITE entry with the enzymes contained in my_enzymes.dat. If a mark file named dmwhite.mrk exists it will also be displayed. Here we create one which contains the known exons for the gene in this piece of DNA.
$ tektronix versaterm term $ create dmwhite.mrk .. 7686 7757 10853 11153 11228 11882 11944 12259 12480 12611 12682 13296 ^Z $ mapplot/infile=gb_in:dmwhite/data=my_enzymes.dat/default
$ MAPSORT/infile=gb_in:dmwhite/begin=1/end=2000 -
/data=my_enzymes.dat/default
$ type dmwhite.mapsort
Using enzyme data from: MY_ENZYMES.DAT;1 FileCheck: 4368
With 4 enzymes: *
February 8, 1999 12:38 ..
EcoRI G'AATT_C Cuts at: 0 1 1860 2000
Size: 1 1859 140
Fragments arranged by size:
1859 140 1
EcoRV GAT'ATC Cuts at: 0 1962 2000
Size: 1962 38
HindIII A'AGCT_T Cuts at: 0 1155 2000
Size: 1155 845
Enzymes that do cut:
EcoRI EcoRV HindIII
Enzymes that do not cut:
BamHI
Since this was a linear fragment the program assigned cuts at postion 0 (one before the first base pair) and at 2000 (the final base pair). This would not have happened had we specified /CIRCular on the command line. Use the /DIGest qualifier if you want to do a multiple digest with more than one enzyme.
$ MAPSORT/infile=gb_in:dmwhite/begin=1/end=2000 -
/digest -
/data=my_enzymes.dat/default
$ type dmwhite.mapsort
Using enzyme data from: ]MY_ENZYMES.DAT;1 FileCheck: 4368
With 4 enzymes: *
February 8, 1999 12:34 ..
A: EcoRI G'AATT_C B: EcoRV GAT'ATC C: HindIII A'AGCT_T
Cuts at: 0 1A 1155C 1860A 1962B 2000
Size: 1 1154 705 102 38
Fragments arranged by size:
1154 705 102 38 1
Enzymes that do cut:
EcoRI EcoRV HindIII
Enzymes that do not cut:
BamHI
Adding the /SILent qualifier tells the MAP programs to look for a translationally silent restriction site. Here is a specific example, where a BamHI site is placed inside an exon. Since we know that this is an exon, we only need to see the known translation frame, which is set via the /menu=a qualifier.
$ map/infile=gb_in:dmwhite/begin=11230/end=11882/silent - /enzyme=bamhi/menu=a
Here is part of the resulting dmwhite.map file:
b
a
m
h
i
TGCGGCGTGGCCTATCCGGGCGAACTTTTGGCCGTGATGGGCAGTTCCGGTGCCGGAAAG
11230 +---------+---------+---------+---------+---------+--------- 11289
ACGCCGCACCGGATAGGCCCGCTTGAAAACCGGCACTACCCGTCAAGGCCACGGCCTTTC
a C G V A Y P G E L L A V M G S S G A G K -
The map shows a location where a BamHI site could be engineered in by site specific mutagenesis without changing the coding of the exon. Here's how it would work:
Current site: GGCAGT -> GGC, AGT -> Gly,Ser New BamHI site: GGATCC -> GGA, TCC -> Gly,Ser
The enzyme names for silent sites are shown in lower case, while the names of enzymes with existing restriction sites are written in upper case. The translated frame is determined by /begin - the map should show the first amino acid flush left on the first line.
Process restriction data. GCG doesn't supply any programs for calculating molecular weights for DNA fragments from their migration distances in electrophoresis experiments. To do this, GELFRAG, which is a locally written program. GELFRAG is basically the same as using semilog paper, but you just type in the data and it produces the plot and fits a line to the points.
$ tektronix versaterm term $ gelfrags 3.4,1000 2.4,2000 4.4,500 3.2,0 1.1 5 Read in 3.000000 pairs of standards values and Read in 3.000000 unknowns weight = a * log10(distance) + b a,b = -0.3010305 4.023504 Distance weight weight %error observed known predicted 3.400000 1000.000 1000.001 5.4931639E-05 2.400000 2000.000 2000.003 1.6479495E-04 4.400000 500.0001 499.9995 -1.0986326E-04 3.200000 - 1148.699 1.100000 - 4924.591 5.000000 - 329.8765 $
There are three programs available for this task: DIGED, MAPC, and MAPL. These are quite old. They were written by W. Pearson, and published in: Nucleic Acids Research (1982)10:217-227.
DIGED is used for data entry. For circular data it may ask for Xoff, which is the offset of the first known cut for each enzyme from the 0.0 point - if you don't know this value enter -1.0. There must be at least one fragment in each double digest that isn't present in either of the singles.
Like this:
$ diged
1 new data
1 linear
one short name for first enzyme
two short name for second enzyme
blank line to terminate the enzyme list
7.0 3.0 / single digest pieces for "one", in Kb.
3.0 3.0 4.0 / single digest pieces for "two"
1 2 / IBEG and IEND, see below
1 3 /
3.0 3.0 1.0 3.0 / double digest pieces
6 write it out
mydata.txt the file to put it in
one line comment a one line comment
7 quit diged
Consider a case where an enzyme cuts 4 times, and makes fragments of size 1.1, 2.0, 2.5, 3.1, and 4.1. If you don't know where any of those fragments are on the original linear fragment, you would specify IBEG=1 and IEND=5. However, if you knew that 2.0 was on the left side, and 2.5 on the right, you would instead enter the data like this:
Here's an example using real data, which is supplied with Pearson's programs. This is a 5.55 Kb fragment which was singly and doubly digested with BamI, BglII, and PstI. The reason you have to copy the data file to your local directory is that the program opens it with read/write access, which would be denied on the shared file.
$!
$ copy digest:ct2.dat []ct2.dat
$ mapc
ct2.dat
0.2 2.0
test.out
$!
$ type test.out
pGT55 glutathione-s-transferase 16-Jan-81
ERROR = 0.2000 EFACT = 2.000
1764 Digestions calculated in 0.02 sec
Bam1
A 3.7600 B 1.3800 C 0.42000
Bgl2
A 5.5500
Pst1
A 4.3200 B 0.43000 C 0.39000 D 0.34000
T ERROR=0.425E-02 D ERROR=0.437E-02
Bam1 ---1----------------A----------------1-C1------B--|--1-------
Bam1 0.375 <-A-> 4.135 <-C-> 4.555 <-B->
Bgl2 -------------A------------------------2-----------|----------
Bgl2 4.267 <-A->
Pst1 -------------A------------------3-B-3-C-3D-3------|----------
Pst1 3.612 <-B-> 4.042 <-C-> 4.432 <-D-> 4.772 <-A->
Bam1 A C B
Bgl2 A
Pst1 B C D A
For each solution (there are usually many) each enzyme is represented by three lines. On the first line the cuts are shown by a number, and the fragments by letters. In the second line, the numbers represent the position of the cuts, and the letter flanked by arrows by the fragments. The third line shows the order of the fragments as entered into the program. For instance, in the data shown above, the fragments for Bam1 were: A=3.76, B=1.38, and C=0.42. The map shows cuts at positions 0.375, 4.135, and 4.555. Fragment A is between the first and second cuts, C between the second and third, and B wraps around and lies between the third and fourth.
The T error represents an error measurement for the first two enzymes, the D error represents an error measurement covering all enzymes (if only two enzymes, the values are the same). There is some form of normalization applied to these though, so that either T or D might be larger in any given solution. When the programs ask for Error and Efact what it wants are the upper limits for the T and D errors. Here is another solution which has slightly higher error values. The only difference is that the Bgl2 site is moved 141 bases.
T ERROR=0.550E-02 D ERROR=0.437E-02 Bam1 ---1----------------A----------------1-C1------B--|--1------- Bam1 0.375 <-A-> 4.135 <-C-> 4.555 <-B-> Bgl2 --------------A------------------------2----------|---------- Bgl2 4.408 <-A-> Pst1 -------------A------------------3-B-3-C-3D-3------|---------- Pst1 3.612 <-B-> 4.042 <-C-> 4.432 <-D-> 4.772 <-A-> Bam1 A C B Bgl2 A Pst1 B C D A
It can take a relatively long time to figure out a complicated map (or it may fail altogether if the accuracy of the fragment sizes is poor). For instance, there is a lambda digest dataset supplied with the software, it uses 5 enzymes which make 23 unique single digest fragments, and around 80 double digest fragments, and it takes 72 seconds to run.
Normally you only know the molecular weights of your fragments to an accuracy of perhaps 2-5%. Unless the analysis returns a single clear result, with no reasonable alternative solutions, it is a worthwhile exercise to repeat the analysis using slightly different values for the molecular weights. That is, intead of 3.1,2.1, and 5.1, you might use 3.0, 2.2, and 5.0. By comparing the results you should get a feel for the "stability" of the solutions. It will also help you determine which other data you may need to obtain in order to resolve the map. For instance, you might find that you need to put one or more cuts inside a particular fragment in order to nail down its true position.
The MAPC and MAPL programs are coded such that they cannot handle a digest with one enzyme that creates more than 10 fragments.
There are a lot of tools and approaches for this.
The GCG program FRAMES graphically displays open reading frames. The default is really for bacteria - it shows from start to stop. When used for Eukaryotes (except on cDNA) use /ALL switch to look for regions lacking stops. In the graphic produced, START is above the line, STOP is below, so large ORFs show up as clear areas below the line. If a mark file is used, there is a sort of optical illusion where it can appear that an ORF extends over surrounding STOP sites. This is evident in the following result if you look at the second and final ORFs. The illusion is removed if you place a ruler on the output, and there are tic marks above and below the graphs to help you keep the ruler exactly vertical.
$ FRAMES/infile=gb_in:dmwhite/all/mark=dmwhite.mrk - /begin=10000/end=14000
Organisms tend to use certain codons more than others. An ORF can be checked against this table to determine if the codon usage in it is typical for that particular organism. The GCG CODONPREFERENCE program can do this for you. In the following example, we use the local extension /NOSTART, which shows all stops, but no starts. This is a clearer way to show the open reading frames in this program than the GCG supplied /ALLframes qualifier. Unlike in the FRAMES program, STOP in this program is a vertical line which is both above and below the axis. Rare codons are also indicated.
$ codonpreference/infile=gb_in:dmwhite - /freq=data:DRO.COD - /begin=10000/end=14000 - /nostart - /mark=dmwhite.mrk/default
The GCG TESTCODE program implements the algorithm of Dr. James Fickett, see Nucl. Acids Res. 10(17); 5303-5318 (1982). This algorithm looks for certain statistical behavior of the DNA within the sliding window tested (read the GCG manual for a complete description.) Theoretically, points in the top region are 95% likely to be coding, in the bottom region, 95% likely to be noncoding. The vertical bars above the plot are STARTs, and the diamonds are STOPs, shown for the three forward reading frames.
The method depends on statistical properties of the DNA, and so is most sensitive for open reading frames of at least 200 base pairs. Unfortunately, because of splicing, there are a lot of eukaryotic open reading frames that are much shorter than that. Notice that in the FRAMES output there was what appeared to be a good ORF at 10.3-10.5 Kb. In the TESTCODE output this region is a peak, but it scores lower than do the other ORFs.
$ testcode/infile=gb_in:dmwhite/mark=dmwhite.mrk - /begin=10000/end=14000/default
There are a slew of more sophisticated programs around which claim that they can find genes. Among them, Geneid, Netgene, Grail, and Genefinder. Some of these run remotely, some of them locally. In the test cases that I've run genefinder, a program from Phil Green's lab, seems to do the best. Possibly this is because I use Drosophila test cases and the others may have been trained/optimized for mammalian codon usage.
$ genefinder dros
Organisms supported are: human nem thal dros
Using DROS tables
Command What
gf genefinder
xgf X11 based genefinder
gf -h for list of command line options (enclose in "" as they are
case sensitive)
Convert any GCG sequences to fasta before feeding them into genefinder.
$ tofasta/infile=whatever.seq/outfile=whatever.fasta
$ gf -seqfile whatever.fasta
$ tofasta/infile=gb_in:dmwhite/out=dmwhite.fasta/default
$ define/user sys$output exons.txt
$ define/user sys$error messages.txt
$ gf -seqfile dmwhite.fasta
$ type gftranspred.genes
>DMWHITE.C.cand.1 94
MNHSEPFERTESRGTNDSRSKVVEQKVNVLRRAGGRQRKSNYETWCGGDS
AIEVEELAQDIRYPKDVDTLATRVTSARQLKKCSTPARSPIIG*
>DMWHITE.C.cand.2 137
MTVLPFCDSFFSTWTTLWAVNESSPEVGSSQIRSGGSVSASEANARRFLS
PPDRPFTLPGTPMIGADKEVILLDIGAPGLHLLGVHRLAIEQSHPGWRYL
DALRRSKGKGIQQGRLSGTGTAHHGQKFARIGHAAN*
>DMWHITE.cand.1 749
MVEEPCHSHHPSSHRPIVHRPTTCDPGPVNRPPPSAHFQGHPPSASPNLH
ILNRPRHLFFWETWKTRILGIFIKQMGSFFIKPRGGGGADASQSCINQGF
GQAKNYGTLLPPSPPEDSGSGSGQLAENLTYAWHNMDIFGAVNQPGSGWR
QLVNRTRGLFCNERHIPAPRKHLLKNVCGVAYPGELLAVMGSSGAGKTTL
LNALAFRSPQGIQVSPSGMRLLNGQPVDAKEMQARCAYVQQDDLFIGSLT
AREHLIFQAMVRMPRHLTYRQRVARVDQVIQELSLSKCQHTIIGVPGRVK
GLSGGERKRLAFASEALTDPPLLICDEPTSGLDSFTAHSVVQVLKKLSQK
GKTVILTIHQPSSELFELFDKILLMAEGRVAFLGTPSEAVDFFSYVGAQC
PTNYNPADFYVQVLAVVPGREIESRDRIAKICDNFAISKVARDMEQLLAT
KNLEKPLEQPENGYTYKATWFMQFRAVLWRSWLSVLKEPLLVKVRLIQTT
MVAILIGLIFLGQQLTQVGVMNINGAIFLFLTNMTFQNVFATINVFTSEL
PVFMREARSRLYRCDTYFLGKTIAELPLFLTVPLVFTAIAYPMIGLRAGV
LHFFNCLALVTLVANVSTSFGYLISCASSSTSMALSVGPPVIIPFLLFGG
FFLNSGSVPVYLKWLSYLSWFRYANEGLLINQWADVEPGEISCTSSNTTC
PSSGKVILETLNFSAADLPLDYVGLAILIVSFRVLAYLALRLRARRKE*
$ fromfasta/infile=gftranspred.genes
FromFastA reformats one or more sequences from FastA format into
individual files in GCG format.
dmwhiteccand1. 94 aa.
dmwhiteccand2. 137 aa.
dmwhitecand1. 749 aa.
Finished FROMFASTA with 3 files written.
980 sequence characters were reformatted.
$ rename dmwhitec*.; .pep
Then BLAST them all (not shown) and find that:
How well did Genefinder do? Here is a comparison between what it predicted, and the actual gene:
Exons: Predicted Known Result - 7686..7757 missed 10374..10638 - false exon 10890..11153 10853..11153 close, wrong splice (11228..11882) 11228..11882 correct 11944..12259 11944..12259 correct 12480..12611 12480..12611 correct (12682..13296) 12682..13296 correct
The gene finding tools often get the splices wrong. If you find a very similar gene when searching the databases with a predicted protein, and there are gaps, go back and check the exons.txt file, to see if there are other splice sites which would make the predicted protein more consistent with the other protein.
To use GRAIL you need to first obtain a GRAIL ID. If you use this service on Seqaxp it will be stored for you in a file in your directory, and automatically retrieved when you queue a job. To obtain an ID, or to do a search, use one of these commands:
$ GRAIL $ GRAIL "1A" $ GRAIL "2"
and answering the questions. There are three versions of GRAIL called 1 (the default), 1A, and 2. 1A and 2 seem to work a bit better, but still not so well as genefinder. Here is the result of a GRAIL 1 run on DMWHITE, which does find all of the exons, but at the cost of many false positive regions, and less accuracy in calling the splice donor/acceptors.
>GB_IN:DMWHITE, len = 14245 Final Exon Predication: start/ acceptor/ rf donor stop strand quality orf | Real -------------------------------------------------------------- 1732 - 1791 f 1 excellent 1675 - 1791 | 2391 - 2479 f 2 good 2390 - 2479 | 4353 - 4448 f 2 good 4331 - 4474 | 4680 - 4801 f 2 excellent 4646 - 4801 | 6200 - 6306 f 2 good 6110 - 6328 | 6361 - 6605 f 3 good 6312 - 6605 | 6722 - 6835 f 2 good 6698 - 6835 | 7686 - 7757 f 3 marginal 7674 - 7802 | 7686..7757 9843 - 9936 f 3 good 9813 - 9965 | 10882 - 11153 f 2 excellent 10853 - 11179 | 10853..11153 11266 - 11883 f 1 excellent 11221 - 11883 | 11228..11882 11944 - 12259 f 2 excellent 11915 - 12286 | 11944..12259 12480 - 12611 f 3 excellent 12465 - 12626 | 12480..12611 12709 - 13296 f 1 excellent 12682 - 13296 | 12682..13296
Dan Prestridge's PROSCAN software can check DNA sequence for polymerase III promoters. This program works best for primates, or at least mammals. It is not very useful for other organisms, especially Drosophila and yeast. The current version has a slight and intermittant bug when it uses GCG formatted sequence, so convert to plain sequence first. It can take quite a while to run, and it only runs interactively, so drop your priority first.
$ totext/infile=gb_pr:HUMHBB/outfile=test.seq/default $ set proc/prio=2 $ proscan
The file PREDICTIONS.; will hold the results. It has a lot of information, but to see just the predicted locations, and their scores use:
$ search predictions.; promoter
Here is a comparison of the above PROSCAN run with the documented
promoter locations. (In this context Documented is defined as the space between
genes that were found, or 1 kb upstream, whichever is smallest.)
Documented Predicted Hit?
10520 10732 -
18289 19289 epsilon globin
23290 23540 -
26570 26820 -
33504 33754 +???
33478 34478 34211 34461 + G-globin
38414 39414 39147 39397 + A-globin
44710 45710 pseudo-hbp
46268 46518 -
53740 54740 delta globin
59813 60063 -
61550 61800 +???
61137 62137 61875 62125 + beta-globin
*72732 72482 -
*43320 43070 -
*42857 42607 -
*11995 11745 -
There are 5 real genes, and the program found promoters for 3 -> 60% sensitivity. However, it also found 9 false positives in 73308 bp. That means for every 8.1 Kb it analyzes, it will call 1 false positive promoter.
The default mode for PRIME is to find primers for PCR, if you want sequencing primers use the qualifiers /FORwardprimers or /REVerseprimers. There are a lot of available command line qualifiers for the parameters that vary, like salt concentrations, desired melting temperatures, and so forth.
$ prime/infile=gb_in:dmwhite/begin=1/end=300/forward $ figure prime.figure
$ type dmwhite.prime
The contents of this file have been edited to decrease its size.
PRIME of: Gb_In:Dmwhite ck: 6789 from: 1 to: 300 February 5, 1999 13:56
INPUT SUMMARY
-------------
Input sequence: Gb_In:Dmwhite
*** PRIME is set to search for primers on forward strand only. ***
Primer constraints:
primer size: 18 - 22
primer 3' clamp: S
primer sequence ambiguity: NOT ALLOWED
primer GC content: 40.0 - 55.0%
primer Tm: 50.0 - 65.0 degrees Celsius
primer self-annealing. . .
3' end: < 8.0 (weight: 2.0)
total: < 14.0 (weight: 1.0)
unique primer binding sites: required
primer-template and primer-repeat annealing. . .
3' end: ignored
total: ignored
repeated sequences screened: none specified
PRIMER SUMMARY
--------------
forward reverse
Number of primers considered: 665 0
Number of primers rejected for . . .
primer 3' clamp: 167 0
primer sequence ambiguity: 0 0
primer GC content: 366 0
primer Tm: 17 0
non-unique binding sites: 0 0
primer self-annealing: 27 0
primer-template annealing: 0 0
primer-repeat annealing: 0 0
Number of primers accepted: 88 0
Number of primers saved: 25 0
--------------------------------------------------------------------------------
Primer: 1 forward strand primer (19-mer): 169 AGTTTTTGGGTGGTTGGTG 187
[DNA] = 50.000 nM [salt] = 50.000 mM
primer %GC: 47.4 Tm (degrees Celsius): 58.3 annealing score: 10.0
--------------------------------------------------------------------------------
Primer: 2 forward strand primer (18-mer): 170 GTTTTTGGGTGGTTGGTG 187
[DNA] = 50.000 nM [salt] = 50.000 mM
primer %GC: 50.0 Tm (degrees Celsius): 57.1 annealing score: 10.0
--------------------------------------------------------------------------------
Primer: 3
[DNA] = 50.000 nM [salt] = 50.000 mM
forward strand primer (21-mer): 24 GCCAGCGACTTGACTTATTAC 44
primer %GC: 47.6 Tm (degrees Celsius): 56.7 annealing score: 12.0
--------------------------------------------------------------------------------
etc.
$ melt/infile=gb_in:dmwhite/begin=1/end=300/default $ extract/head=10 dmwhite.melt Melt of: GB_IN:DMWHITE check: 9858 from: 1 to 300 1 GAATTCTGCTCCAGTTGCCA 20 Tm=61.3 GC%=50.0 2 AATTCTGCTCCAGTTGCCAG 21 Tm=60.4 GC%=50.0 3 ATTCTGCTCCAGTTGCCAGT 22 Tm=59.9 GC%=50.0 4 TTCTGCTCCAGTTGCCAGTT 23 Tm=61.0 GC%=50.0 5 TCTGCTCCAGTTGCCAGTTG 24 Tm=62.6 GC%=55.0 6 CTGCTCCAGTTGCCAGTTGC 25 Tm=64.4 GC%=60.0 7 TGCTCCAGTTGCCAGTTGCC 26 Tm=67.0 GC%=60.0 $ meltplot/infile=gb_in:dmwhite/begin=1/end=300/default
The black line in the plot is %GC, the blue line is the Tm (given at the last point in the sliding window, for that whole window), and the red line is "product". For the latter two, the /FORMamide qualifier is broken.
We also have some programs froom L. Lerman and W. Fripp for handling denaturing gradient gel electrophoresis (DGGE), specifically for when that gradient is a temperature gradient. DGGE is a physical method for detecting point mutations in a known sequence - it does not require that each potential mutant be sequenced. It depends upon the observation that when DNA is electrophoresed through a denaturing gradient, the mobility drops precipitously at the point in the gradient where the first large domain melts. Heteroduplexes of a point mutation with wild type DNA will melt at a slightly lower point than will the the homoduplex for either the wild type or point mutation. So a large number of potential mutations may be screened by PCRing the region of interest, hybridizing that DNA to a wild type fragment, and then running all of them at once on a denaturing gel. There are a fair number of complications that affect this scheme, and these programs are designed to help you determine if the region you are interested in is amenable to this type of analysis. The programs are pretty straightforward to use, but they are a bit specialized to cover in detail here. Briefly, these are the three programs:
$ help @local general openvms melt
Here is what some of the output from MELT looks like:
$ run melt:plotmelt BG340N.5MM example
Next week we'll cover sequencing and sequence assembly.