Fundamentals of Sequence Analysis, 1998-1999

Fundamentals of Sequence Analysis, 1998-1999

Lecture 4. Tools for Molecular Biology I.

Today we'll start running through the tools that are available for planning and analyzing your molecular Biology experiments.


DNA mapping problems

Restriction mapping is one of the most common sequence analysis tasks. Starting with a piece of known DNA sequence, and the recognition sites for the restriction enzymes, there are three actions which are typically performed - the GCG package has a program for each:

  1. MAP provides a detailed analysis of which enzymes cut and which don't, where they cut, and how those cuts lines up with the translated peptide in 3 or 6 reading frames.
  2. MAPSORT provides a detailed analysis of which enzymes cut and which don't, and displays the fragment sizes which would result from single or multiple digestions.
  3. MAPPLOT provides a detailed analysis of which enzymes cut and which don't, and displays the results graphically, at a fairly low resolution. That is, it lets you see that a site is at about 4.3 Kb, but doesn't provide the resolution to say that it is at 4331 bp.


1. Detailed restriction digestion of a known DNA sequence

MAP is the tool to use to generate a detailed restriction digest of a known DNA sequence. The most commonly used qualifiers for this program are:

  /begin=100                     begin position on DNA
  /end=200                       end position on DNA
  /six                           sixcutters only
  /open=20                       translate only for ORF >20 aa.
  /once                          only cuts once
  /mincuts=N                     control # of cuts accepted
  /maxcuts=M

$ map/infile=gb_in:dmwhite/begin=1/end=100/six/default
$ type dmwhite.map

This and subsequent output files have been edited for clarity or brevity. In this instance, - the list of sites which cut and those which didn't cut has been omitted. The restriction sites are marked at the 5' end of the cut site. Multiple sites which cut at the same location are grouped above a / symbol. The three reading frames a,b,c are frames 1,2,3 from the beginning of the sequence. * is a stop codon.



    E
   Ac                                                        B
   po                                                        s
   oR                                                        b
   II                                                        I
    /
   GAATTCTGCTCCAGTTGCCAGTTGCCAGCGACTTGACTTATTACTCGCAGTAAGAGCAAC
 1 ---------+---------+---------+---------+---------+---------+ 60
   CTTAAGACGAGGTCAACGGTCAACGGTCGCTGAACTGAATAATGAGCGTCATTCTCGTTG

a  E  F  C  S  S  C  Q  L  P  A  T  *  L  I  T  R  S  K  S  N   -
b   N  S  A  P  V  A  S  C  Q  R  L  D  L  L  L  A  V  R  A  T  -
c    I  L  L  Q  L  P  V  A  S  D  L  T  Y  Y  S  Q  *  E  Q  H -


MAP may also be used with the locally developed qualifiers /FRAME, which causes it to output a list of open reading frames, and /VALine, which causes it to accept GTG in addition to ATG as the start of an open reading frame. In the example, the known coding regions in this vector are also shown. The ORF corresponding to the first CDS region (italics) extends quite far 5' to the actual start - this is because the promoter does not contain any STOP codons. promoter region which is used. Also notice that many of the ORFs are located inside the coding regions, but in different frames. That is a very common observation - that a real coding region tends to have more ORFs in other frames than does a noncoding region. s3 indicates that the ORF starts with a START (rather than the front end of the fragment), and ends with a STOP (rather than the end of the fragment)./menu=s tells the program that we want to see all six reading frames, t would have been the three forward.

$ map/infile=gb_sy:Synpbr322/circular/frame/open=60/menu=s/default
$ type Synpbr322.map
FRM00001 +1    259    642    384 s3      in FRM00004
FRM00002 +1   1180   1518    339 s3
FRM00003 +1   1882   2106    225 s3      CDS 1915. .2106
FRM00004 +2     86   1276   1191 s3      CDS 86. .1276
FRM00005 +2   1523   1723    201 s3
FRM00006 +2   2009   2218    210 s3
FRM00007 +3   1593   1811    219 s3
FRM00008 +3   2298   2480    183 s3
FRM00009 +3   3423   3689    267 s3      in FRM00011
FRM00010 -1   1515   1883    369 s3
FRM00011 -2   3293   4153    861 s3      CDS complement(3293. .4153)
FRM00012 -2    623   1081    459 s3      in FRM00004
FRM00013 -2    323    565    243 s3      in FRM00004
FRM00014 -3    439    789    351 s3      in FRM00004


2. Make your own set of enzymes

The standard set of restriction enzymes has hundreds of entries, and in most cases you'll only ever use the ones in your lab fridge, perhaps 10 or 20 enzymes. So it is convenient to make a file which contains only those enzymes, so that you don't have to select them one by one each time from the master list. There are two ways to do this: either copy the full enzyme file and edit it, or use MAPSELECT.

$ copy genrundata:enzyme.dat []my_enzymes.dat
$ edit my_enzymes.dat

The qualifiers in the following command have these meanings:

$  MAPSELECT/infile=gb_in:dmwhite/begin=1/end=2 -
  /select=1/menu=n/outfile=nla0:/enzfile=my_enzymes.dat

MAPSELECT selects restriction enzymes by name or by their ability to
cut a given sequence, and writes them to a new enzyme file for use in other
programs.

 Select the enzymes:  Type nothing or "*" to get all enzymes. Type "?"
 for help on which enzymes are available and how to select them.

                                       Enzyme(* * *):  ecori
 "ECORI" selected 1 enzyme, new total: 1.     Enzyme:  ecorv
 EcoRV
 "ECORV" selected 1 enzyme, new total: 2.     Enzyme:  bamhi
 BamHI
 "BAMHI" selected 1 enzyme, new total: 3.     Enzyme:  hindiii
 HindIII
 "HINDIII" selected 1 enzyme, new total: 4.   Enzyme:
$

In either case, the file my_enzymes.dat now holds your set
of enzymes.  Eliminate manual selection by adding the /enzyme
qualifier to any MAP command:

$ MAP/infile=gb_in:dmwhite/begin=1/end=1000 -
 /data=my_enzymes.dat/enzyme=*/default

Map commands can be customized. For instance:

$ MYMAP :== $genutil:map/data=my_enzymes.dat
$ mymap gb_in:dmwhite

3. Graphical representation of restriction maps.

In this example a graphical representation is made of the restriction digest of the DMWHITE entry with the enzymes contained in my_enzymes.dat. If a mark file named dmwhite.mrk exists it will also be displayed. Here we create one which contains the known exons for the gene in this piece of DNA.


$ tektronix versaterm term
$ create dmwhite.mrk
..
 7686  7757
10853 11153
11228 11882
11944 12259
12480 12611
12682 13296
^Z
$ mapplot/infile=gb_in:dmwhite/data=my_enzymes.dat/default


4. Restriction digest to find fragment sizes.

$ MAPSORT/infile=gb_in:dmwhite/begin=1/end=2000 -
   /data=my_enzymes.dat/default
$ type dmwhite.mapsort
 Using enzyme data from: MY_ENZYMES.DAT;1  FileCheck: 4368
 With 4 enzymes: * 
                         February  8, 1999 12:38  ..
EcoRI G'AATT_C     Cuts at:      0       1    1860    2000 
   Size:          1    1859     140
  Fragments arranged by size: 
                1859     140       1
EcoRV GAT'ATC      Cuts at:      0    1962    2000 
   Size:       1962      38
HindIII A'AGCT_T   Cuts at:      0    1155    2000 
   Size:       1155     845

 Enzymes that do cut:
    EcoRI    EcoRV  HindIII
 Enzymes that do not cut: 
    BamHI

Since this was a linear fragment the program assigned cuts at postion 0 (one before the first base pair) and at 2000 (the final base pair). This would not have happened had we specified /CIRCular on the command line. Use the /DIGest qualifier if you want to do a multiple digest with more than one enzyme.

$ MAPSORT/infile=gb_in:dmwhite/begin=1/end=2000 -
  /digest -
  /data=my_enzymes.dat/default
$ type dmwhite.mapsort
 Using enzyme data from: ]MY_ENZYMES.DAT;1  FileCheck: 4368
 With 4 enzymes: * 
                         February  8, 1999 12:34  ..

A: EcoRI G'AATT_C  B: EcoRV GAT'ATC  C: HindIII A'AGCT_T  

Cuts at:      0       1A   1155C   1860A   1962B   2000 
   Size:          1    1154     705     102      38
  Fragments arranged by size: 
                1154     705     102      38       1

 Enzymes that do cut:
    EcoRI    EcoRV  HindIII
 Enzymes that do not cut: 
    BamHI


5. Inserting a translationally silent restriction site.

Adding the /SILent qualifier tells the MAP programs to look for a translationally silent restriction site. Here is a specific example, where a BamHI site is placed inside an exon. Since we know that this is an exon, we only need to see the known translation frame, which is set via the /menu=a qualifier.


$ map/infile=gb_in:dmwhite/begin=11230/end=11882/silent -
   /enzyme=bamhi/menu=a

Here is part of the resulting dmwhite.map file:

                                                b
                                                a
                                                m
                                                h
                                                i
         TGCGGCGTGGCCTATCCGGGCGAACTTTTGGCCGTGATGGGCAGTTCCGGTGCCGGAAAG
   11230 +---------+---------+---------+---------+---------+--------- 11289
         ACGCCGCACCGGATAGGCCCGCTTGAAAACCGGCACTACCCGTCAAGGCCACGGCCTTTC

a        C  G  V  A  Y  P  G  E  L  L  A  V  M  G  S  S  G  A  G  K   -

The map shows a location where a BamHI site could be engineered in by site specific mutagenesis without changing the coding of the exon. Here's how it would work:


Current site:   GGCAGT  -> GGC, AGT -> Gly,Ser 
New BamHI site: GGATCC  -> GGA, TCC -> Gly,Ser

The enzyme names for silent sites are shown in lower case, while the names of enzymes with existing restriction sites are written in upper case. The translated frame is determined by /begin - the map should show the first amino acid flush left on the first line.


5. Handling data from physical restriction digestions.

Process restriction data. GCG doesn't supply any programs for calculating molecular weights for DNA fragments from their migration distances in electrophoresis experiments. To do this, GELFRAG, which is a locally written program. GELFRAG is basically the same as using semilog paper, but you just type in the data and it produces the plot and fits a line to the points.


$ tektronix versaterm term
$ gelfrags
  3.4,1000
  2.4,2000
  4.4,500
  3.2,0
  1.1
  5


Read in    3.000000     pairs of standards values and
Read in    3.000000     unknowns

weight = a * log10(distance) + b
a,b =  -0.3010305       4.023504

  Distance         weight         weight          %error
  observed          known      predicted
  3.400000       1000.000       1000.001      5.4931639E-05
  2.400000       2000.000       2000.003      1.6479495E-04
  4.400000       500.0001       499.9995     -1.0986326E-04
  3.200000           -          1148.699
  1.100000           -          4924.591
  5.000000           -          329.8765

$


6. Calculate a restriction map from a double digest.

There are three programs available for this task: DIGED, MAPC, and MAPL. These are quite old. They were written by W. Pearson, and published in: Nucleic Acids Research (1982)10:217-227.

DIGED is used for data entry. For circular data it may ask for Xoff, which is the offset of the first known cut for each enzyme from the 0.0 point - if you don't know this value enter -1.0. There must be at least one fragment in each double digest that isn't present in either of the singles.

Like this:


$ diged
1                   new data
1                   linear
one                 short name for first enzyme
two                 short name for second enzyme
                    blank line to terminate the enzyme list
7.0 3.0 /           single digest pieces for "one", in Kb.
3.0 3.0 4.0 /       single digest pieces for "two"
1  2 /              IBEG and IEND, see below
1  3 /
3.0 3.0 1.0 3.0 /   double digest pieces
6                   write it out
mydata.txt          the file to put it in
one line comment    a one line comment
7                   quit diged

Consider a case where an enzyme cuts 4 times, and makes fragments of size 1.1, 2.0, 2.5, 3.1, and 4.1. If you don't know where any of those fragments are on the original linear fragment, you would specify IBEG=1 and IEND=5. However, if you knew that 2.0 was on the left side, and 2.5 on the right, you would instead enter the data like this:

This tells the program that the 2.0 fragment goes on the left, the 2.5 on the right, and that the positions of the 2nd through 4th fragments needs to be determined. Similarly, if there were many more fragments, and much more information, you could specify the positions of multiple fragments on each end.

Here's an example using real data, which is supplied with Pearson's programs. This is a 5.55 Kb fragment which was singly and doubly digested with BamI, BglII, and PstI. The reason you have to copy the data file to your local directory is that the program opens it with read/write access, which would be denied on the shared file.

$!
$ copy digest:ct2.dat []ct2.dat
$ mapc
ct2.dat
0.2 2.0
test.out
$!
$ type test.out
  pGT55 glutathione-s-transferase 16-Jan-81

ERROR =  0.2000  EFACT =  2.000
    1764  Digestions calculated in     0.02 sec

Bam1
A   3.7600     B   1.3800     C  0.42000
Bgl2
A   5.5500
Pst1
A   4.3200     B  0.43000     C  0.39000     D  0.34000

T ERROR=0.425E-02  D ERROR=0.437E-02
Bam1     ---1----------------A----------------1-C1------B--|--1-------
Bam1   0.375 <-A->   4.135 <-C->   4.555 <-B->
Bgl2     -------------A------------------------2-----------|----------
Bgl2   4.267 <-A->
Pst1     -------------A------------------3-B-3-C-3D-3------|----------
Pst1   3.612 <-B->   4.042 <-C->   4.432 <-D->   4.772 <-A->
Bam1   A C B
Bgl2   A
Pst1   B C D A

For each solution (there are usually many) each enzyme is represented by three lines. On the first line the cuts are shown by a number, and the fragments by letters. In the second line, the numbers represent the position of the cuts, and the letter flanked by arrows by the fragments. The third line shows the order of the fragments as entered into the program. For instance, in the data shown above, the fragments for Bam1 were: A=3.76, B=1.38, and C=0.42. The map shows cuts at positions 0.375, 4.135, and 4.555. Fragment A is between the first and second cuts, C between the second and third, and B wraps around and lies between the third and fourth.

The T error represents an error measurement for the first two enzymes, the D error represents an error measurement covering all enzymes (if only two enzymes, the values are the same). There is some form of normalization applied to these though, so that either T or D might be larger in any given solution. When the programs ask for Error and Efact what it wants are the upper limits for the T and D errors. Here is another solution which has slightly higher error values. The only difference is that the Bgl2 site is moved 141 bases.

T ERROR=0.550E-02  D ERROR=0.437E-02
Bam1     ---1----------------A----------------1-C1------B--|--1-------
Bam1   0.375 <-A->   4.135 <-C->   4.555 <-B->
Bgl2     --------------A------------------------2----------|----------
Bgl2   4.408 <-A->
Pst1     -------------A------------------3-B-3-C-3D-3------|----------
Pst1   3.612 <-B->   4.042 <-C->   4.432 <-D->   4.772 <-A->
Bam1   A C B
Bgl2   A
Pst1   B C D A

It can take a relatively long time to figure out a complicated map (or it may fail altogether if the accuracy of the fragment sizes is poor). For instance, there is a lambda digest dataset supplied with the software, it uses 5 enzymes which make 23 unique single digest fragments, and around 80 double digest fragments, and it takes 72 seconds to run.

Normally you only know the molecular weights of your fragments to an accuracy of perhaps 2-5%. Unless the analysis returns a single clear result, with no reasonable alternative solutions, it is a worthwhile exercise to repeat the analysis using slightly different values for the molecular weights. That is, intead of 3.1,2.1, and 5.1, you might use 3.0, 2.2, and 5.0. By comparing the results you should get a feel for the "stability" of the solutions. It will also help you determine which other data you may need to obtain in order to resolve the map. For instance, you might find that you need to put one or more cuts inside a particular fragment in order to nail down its true position.

The MAPC and MAPL programs are coded such that they cannot handle a digest with one enzyme that creates more than 10 fragments.


Finding ORFS and coding regions.

There are a lot of tools and approaches for this.


1. Finding open reading frames.

The GCG program FRAMES graphically displays open reading frames. The default is really for bacteria - it shows from start to stop. When used for Eukaryotes (except on cDNA) use /ALL switch to look for regions lacking stops. In the graphic produced, START is above the line, STOP is below, so large ORFs show up as clear areas below the line. If a mark file is used, there is a sort of optical illusion where it can appear that an ORF extends over surrounding STOP sites. This is evident in the following result if you look at the second and final ORFs. The illusion is removed if you place a ruler on the output, and there are tic marks above and below the graphs to help you keep the ruler exactly vertical.

$ FRAMES/infile=gb_in:dmwhite/all/mark=dmwhite.mrk -
   /begin=10000/end=14000


2. Checking Codon Preferences.

Organisms tend to use certain codons more than others. An ORF can be checked against this table to determine if the codon usage in it is typical for that particular organism. The GCG CODONPREFERENCE program can do this for you. In the following example, we use the local extension /NOSTART, which shows all stops, but no starts. This is a clearer way to show the open reading frames in this program than the GCG supplied /ALLframes qualifier. Unlike in the FRAMES program, STOP in this program is a vertical line which is both above and below the axis. Rare codons are also indicated.

$ codonpreference/infile=gb_in:dmwhite -
  /freq=data:DRO.COD -
  /begin=10000/end=14000 -
  /nostart -
  /mark=dmwhite.mrk/default


3. Other statistical behavior of coding regions.

The GCG TESTCODE program implements the algorithm of Dr. James Fickett, see Nucl. Acids Res. 10(17); 5303-5318 (1982). This algorithm looks for certain statistical behavior of the DNA within the sliding window tested (read the GCG manual for a complete description.) Theoretically, points in the top region are 95% likely to be coding, in the bottom region, 95% likely to be noncoding. The vertical bars above the plot are STARTs, and the diamonds are STOPs, shown for the three forward reading frames.

The method depends on statistical properties of the DNA, and so is most sensitive for open reading frames of at least 200 base pairs. Unfortunately, because of splicing, there are a lot of eukaryotic open reading frames that are much shorter than that. Notice that in the FRAMES output there was what appeared to be a good ORF at 10.3-10.5 Kb. In the TESTCODE output this region is a peak, but it scores lower than do the other ORFs.


$ testcode/infile=gb_in:dmwhite/mark=dmwhite.mrk -
  /begin=10000/end=14000/default


Gene finding programs.

1. Genefinder.

There are a slew of more sophisticated programs around which claim that they can find genes. Among them, Geneid, Netgene, Grail, and Genefinder. Some of these run remotely, some of them locally. In the test cases that I've run genefinder, a program from Phil Green's lab, seems to do the best. Possibly this is because I use Drosophila test cases and the others may have been trained/optimized for mammalian codon usage.


$ genefinder dros
Organisms supported are:  human nem thal dros
Using DROS tables

   Command   What
   gf        genefinder
   xgf       X11 based genefinder

   gf -h     for list of command line options (enclose in "" as they are
             case sensitive)

   Convert any GCG sequences to fasta before feeding them into genefinder.

   $ tofasta/infile=whatever.seq/outfile=whatever.fasta
   $ gf -seqfile whatever.fasta

$ tofasta/infile=gb_in:dmwhite/out=dmwhite.fasta/default
$ define/user sys$output exons.txt
$ define/user sys$error  messages.txt
$ gf -seqfile dmwhite.fasta
$ type gftranspred.genes
>DMWHITE.C.cand.1   94
MNHSEPFERTESRGTNDSRSKVVEQKVNVLRRAGGRQRKSNYETWCGGDS
AIEVEELAQDIRYPKDVDTLATRVTSARQLKKCSTPARSPIIG*
>DMWHITE.C.cand.2   137
MTVLPFCDSFFSTWTTLWAVNESSPEVGSSQIRSGGSVSASEANARRFLS
PPDRPFTLPGTPMIGADKEVILLDIGAPGLHLLGVHRLAIEQSHPGWRYL
DALRRSKGKGIQQGRLSGTGTAHHGQKFARIGHAAN*
>DMWHITE.cand.1   749
MVEEPCHSHHPSSHRPIVHRPTTCDPGPVNRPPPSAHFQGHPPSASPNLH
ILNRPRHLFFWETWKTRILGIFIKQMGSFFIKPRGGGGADASQSCINQGF
GQAKNYGTLLPPSPPEDSGSGSGQLAENLTYAWHNMDIFGAVNQPGSGWR
QLVNRTRGLFCNERHIPAPRKHLLKNVCGVAYPGELLAVMGSSGAGKTTL
LNALAFRSPQGIQVSPSGMRLLNGQPVDAKEMQARCAYVQQDDLFIGSLT
AREHLIFQAMVRMPRHLTYRQRVARVDQVIQELSLSKCQHTIIGVPGRVK
GLSGGERKRLAFASEALTDPPLLICDEPTSGLDSFTAHSVVQVLKKLSQK
GKTVILTIHQPSSELFELFDKILLMAEGRVAFLGTPSEAVDFFSYVGAQC
PTNYNPADFYVQVLAVVPGREIESRDRIAKICDNFAISKVARDMEQLLAT
KNLEKPLEQPENGYTYKATWFMQFRAVLWRSWLSVLKEPLLVKVRLIQTT
MVAILIGLIFLGQQLTQVGVMNINGAIFLFLTNMTFQNVFATINVFTSEL
PVFMREARSRLYRCDTYFLGKTIAELPLFLTVPLVFTAIAYPMIGLRAGV
LHFFNCLALVTLVANVSTSFGYLISCASSSTSMALSVGPPVIIPFLLFGG
FFLNSGSVPVYLKWLSYLSWFRYANEGLLINQWADVEPGEISCTSSNTTC
PSSGKVILETLNFSAADLPLDYVGLAILIVSFRVLAYLALRLRARRKE*
$ fromfasta/infile=gftranspred.genes

FromFastA reformats one or more sequences from FastA format into
individual files in GCG format.

 dmwhiteccand1.  94 aa.
 dmwhiteccand2.  137 aa.
 dmwhitecand1.  749 aa.

 Finished FROMFASTA with 3 files written.
 980 sequence characters were reformatted.
$ rename dmwhitec*.; .pep

Then BLAST them all (not shown) and find that:

How well did Genefinder do? Here is a comparison between what it predicted, and the actual gene:

 Exons:
 Predicted               Known               Result

 -                       7686..7757          missed
 10374..10638            -                   false exon
 10890..11153            10853..11153        close, wrong splice
 (11228..11882)          11228..11882        correct
 11944..12259            11944..12259        correct
 12480..12611            12480..12611        correct
 (12682..13296)          12682..13296        correct

The gene finding tools often get the splices wrong. If you find a very similar gene when searching the databases with a predicted protein, and there are gaps, go back and check the exons.txt file, to see if there are other splice sites which would make the predicted protein more consistent with the other protein.


2. GRAIL.

To use GRAIL you need to first obtain a GRAIL ID. If you use this service on Seqaxp it will be stored for you in a file in your directory, and automatically retrieved when you queue a job. To obtain an ID, or to do a search, use one of these commands:

$ GRAIL
$ GRAIL "1A"
$ GRAIL "2"

and answering the questions. There are three versions of GRAIL called 1 (the default), 1A, and 2. 1A and 2 seem to work a bit better, but still not so well as genefinder. Here is the result of a GRAIL 1 run on DMWHITE, which does find all of the exons, but at the cost of many false positive regions, and less accuracy in calling the splice donor/acceptors.

>GB_IN:DMWHITE, len = 14245

Final Exon Predication:

start/ acceptor/   rf   
donor   stop strand     quality       orf      |     Real
--------------------------------------------------------------
 1732 -  1791   f   1  excellent  1675 -  1791 |   
 2391 -  2479   f   2  good       2390 -  2479 |
 4353 -  4448   f   2  good       4331 -  4474 |
 4680 -  4801   f   2  excellent  4646 -  4801 |
 6200 -  6306   f   2  good       6110 -  6328 |
 6361 -  6605   f   3  good       6312 -  6605 |
 6722 -  6835   f   2  good       6698 -  6835 |
 7686 -  7757   f   3  marginal   7674 -  7802 |  7686..7757
 9843 -  9936   f   3  good       9813 -  9965 |
10882 - 11153   f   2  excellent 10853 - 11179 | 10853..11153
11266 - 11883   f   1  excellent 11221 - 11883 | 11228..11882
11944 - 12259   f   2  excellent 11915 - 12286 | 11944..12259
12480 - 12611   f   3  excellent 12465 - 12626 | 12480..12611 
12709 - 13296   f   1  excellent 12682 - 13296 | 12682..13296

3. Finding Polymerase III promoters.

Dan Prestridge's PROSCAN software can check DNA sequence for polymerase III promoters. This program works best for primates, or at least mammals. It is not very useful for other organisms, especially Drosophila and yeast. The current version has a slight and intermittant bug when it uses GCG formatted sequence, so convert to plain sequence first. It can take quite a while to run, and it only runs interactively, so drop your priority first.


$ totext/infile=gb_pr:HUMHBB/outfile=test.seq/default
$ set proc/prio=2
$ proscan

The file PREDICTIONS.; will hold the results. It has a lot of information, but to see just the predicted locations, and their scores use:

$ search predictions.; promoter

Here is a comparison of the above PROSCAN run with the documented
promoter locations.  (In this context Documented is defined as the space between
genes that were found, or 1 kb upstream, whichever is smallest.)

Documented      Predicted   Hit?
                10520 10732 -
 18289 19289                     epsilon globin
                23290 23540 -
                26570 26820 -
                33504 33754 +???
 33478 34478    34211 34461 +    G-globin
 38414 39414    39147 39397 +    A-globin
 44710 45710                     pseudo-hbp
                46268 46518 -
 53740 54740                     delta globin
                59813 60063 -
                61550 61800 +???
 61137 62137    61875 62125 +    beta-globin
               *72732 72482 -
               *43320 43070 -
               *42857 42607 -
               *11995 11745 -

There are 5 real genes, and the program found promoters for 3 -> 60% sensitivity. However, it also found 9 false positives in 73308 bp. That means for every 8.1 Kb it analyzes, it will call 1 false positive promoter.


Making Primers:

The GCG PRIME program

The default mode for PRIME is to find primers for PCR, if you want sequencing primers use the qualifiers /FORwardprimers or /REVerseprimers. There are a lot of available command line qualifiers for the parameters that vary, like salt concentrations, desired melting temperatures, and so forth.


$ prime/infile=gb_in:dmwhite/begin=1/end=300/forward
$ figure prime.figure

$ type dmwhite.prime

 The contents of this file have been edited to decrease its size.


PRIME of: Gb_In:Dmwhite  ck: 6789  from: 1 to: 300  February  5, 1999 13:56

                                 INPUT SUMMARY
                                 -------------

    Input sequence: Gb_In:Dmwhite
 *** PRIME is set to search for primers on forward strand only. ***
 Primer constraints:
    primer size: 18 - 22
    primer 3' clamp: S
    primer sequence ambiguity: NOT ALLOWED
    primer GC content: 40.0 - 55.0%
    primer Tm: 50.0 - 65.0 degrees Celsius
    primer self-annealing. . .
       3' end: <   8.0    (weight:  2.0)
        total: <  14.0    (weight:  1.0)
    unique primer binding sites: required
    primer-template and primer-repeat annealing. . .
       3' end: ignored
        total: ignored
    repeated sequences screened: none specified

                                           PRIMER SUMMARY
                                           --------------
                                     forward            reverse
 Number of primers considered:           665                  0
 Number of primers rejected for . . .
               primer 3' clamp:          167                  0
     primer sequence ambiguity:            0                  0
             primer GC content:          366                  0
                     primer Tm:           17                  0
      non-unique binding sites:            0                  0
         primer self-annealing:           27                  0
     primer-template annealing:            0                  0
       primer-repeat annealing:            0                  0
 Number of primers accepted:              88                  0
    Number of primers saved:              25                  0
--------------------------------------------------------------------------------
 Primer: 1   forward strand primer (19-mer):    169 AGTTTTTGGGTGGTTGGTG 187   
 [DNA] = 50.000 nM   [salt] = 50.000 mM
 primer %GC:  47.4 Tm (degrees Celsius):  58.3 annealing score:  10.0
--------------------------------------------------------------------------------
 Primer: 2   forward strand primer (18-mer):    170 GTTTTTGGGTGGTTGGTG 187   
 [DNA] = 50.000 nM   [salt] = 50.000 mM
 primer %GC:  50.0 Tm (degrees Celsius):  57.1 annealing score:  10.0
--------------------------------------------------------------------------------
 Primer: 3
 [DNA] = 50.000 nM   [salt] = 50.000 mM
 forward strand primer (21-mer):     24 GCCAGCGACTTGACTTATTAC 44    
 primer %GC:  47.6 Tm (degrees Celsius):  56.7 annealing score:  12.0
--------------------------------------------------------------------------------
etc.


2. Melting temperatures for DNA

Sometimes you also want to know melting temperatures for DNA along it's length (ok, not exactly primers, but related). Here are two EGCG programs:


$ melt/infile=gb_in:dmwhite/begin=1/end=300/default
$ extract/head=10 dmwhite.melt
Melt of: GB_IN:DMWHITE check: 9858  from: 1  to 300


  1 GAATTCTGCTCCAGTTGCCA  20 Tm=61.3 GC%=50.0
  2 AATTCTGCTCCAGTTGCCAG  21 Tm=60.4 GC%=50.0
  3 ATTCTGCTCCAGTTGCCAGT  22 Tm=59.9 GC%=50.0
  4 TTCTGCTCCAGTTGCCAGTT  23 Tm=61.0 GC%=50.0
  5 TCTGCTCCAGTTGCCAGTTG  24 Tm=62.6 GC%=55.0
  6 CTGCTCCAGTTGCCAGTTGC  25 Tm=64.4 GC%=60.0
  7 TGCTCCAGTTGCCAGTTGCC  26 Tm=67.0 GC%=60.0

$ meltplot/infile=gb_in:dmwhite/begin=1/end=300/default

The black line in the plot is %GC, the blue line is the Tm (given at the last point in the sliding window, for that whole window), and the red line is "product". For the latter two, the /FORMamide qualifier is broken.


3. DGGE analysis

We also have some programs froom L. Lerman and W. Fripp for handling denaturing gradient gel electrophoresis (DGGE), specifically for when that gradient is a temperature gradient. DGGE is a physical method for detecting point mutations in a known sequence - it does not require that each potential mutant be sequenced. It depends upon the observation that when DNA is electrophoresed through a denaturing gradient, the mobility drops precipitously at the point in the gradient where the first large domain melts. Heteroduplexes of a point mutation with wild type DNA will melt at a slightly lower point than will the the homoduplex for either the wild type or point mutation. So a large number of potential mutations may be screened by PCRing the region of interest, hybridizing that DNA to a wild type fragment, and then running all of them at once on a denaturing gel. There are a fair number of complications that affect this scheme, and these programs are designed to help you determine if the region you are interested in is amenable to this type of analysis. The programs are pretty straightforward to use, but they are a bit specialized to cover in detail here. Briefly, these are the three programs:

  • MELT calculates Tm along a DNA fragment. (Example output from MELT is shown below.)
  • MU calculates the mobility of a DNA fragment through a temperature gradient. There is a severe drop in mobility at the point where the first domain melts.
  • SQHTX calculates the expected difference in gradient level between a molecule with the wildtype sequence and a heteroduplex carrying a typical mismatch.
  • For more information, see:

    
    $ help @local general openvms melt
    
    

    Here is what some of the output from MELT looks like:

    
    $ run melt:plotmelt
    BG340N.5MM
    
    example
    
    
    

    Next week we'll cover sequencing and sequence assembly.