Fundamentals of Sequence Analysis, 1995-1996
Answer set 4:  Tools for Molecular Biology I.

If you get stuck, refer to the OpenVMS and GCG resources in the 
class home page.

References:
 
 See documentation in the programs.

Problem group 1.  Mapping

You have a cloned a new insert into your favorite vector.  Following single
and double digestions with enzymes Chmp and Bite you find the following 
mobilities (in cm. from the gel origin):

   Chmp cm 2.14    7.36    8.94
   Bite cm 3.62    5.36    6.90
   Chmp + Bite
        cm 5.36    6.90    7.36    8.30    8.94
   Controls
        MW 2200    4300    5700    8600
        cm 8.25    5.48    4.32    2.62

1A.  What are the Molecular weights of the various pieces?
1B.  What is the restriction map?



This is an artificial example and it is still fairly challenging.
The data is derived from DMWHITE, digested with EcoRI (Chmp) and
BamHI (Bite), throwing out the 1 bp fragment that EcoRI would leave
on one end.

Forward: MW -> distance:  a = -9.5 b =  40.0
Reverse: distance -> MW:  a =  -0.105 b = 4.21

Chmp    MW 9659    2726    1859  real
        cm 2.14    7.36    8.94
        MW 9660    2729    1861  calculated

Bite    MW 6759    4431    3055  real
        cm 3.62    5.36    6.90
        MW 6750    4429    3050  calculated

Double  MW 4431    3055    2726    2173    1859 real
        cm 5.36    6.90    7.36    8.30    8.94
        MW 4429    3050    2729    2173    1861 calculated

Control MW 2200    4300    5700    8600 real
        cm 8.25    5.48    4.32    2.62
        MW 2200    4302    5698    8600 calculated

Input for gelfrags:

*********************
8.25,2200
5.48,4300
4.32,5700
2.62,8600
2.14
7.36
8.94
3.62
5.36
6.90
5.36
6.90
7.36
8.30
8.94
8.25
5.48
4.32
2.62

*********************

Input for diged

*******************************
1
1
chmp
bite

9.660 2.729 1.861 /
6.750 4.429 3.050 /
1,3
1,3
4.429 3.050 2.729 2.173 1.861 /
6
test.dat
example
7
*******************************

Then run mapl like

$ mapl
test.dat
test.out

The real map is:

EcoRI         0    1860    4586   14245
BamHI         0    6759    9814   14245
(inverted
EcoRI         0    9659   12385   14245
BamHI         0    4431    7486   14245
)

There are three best maps (same error scores) - without further information
we couldn't choose between them.  The third of these is closest to the real
map: 

T ERROR=0.212E-03  D ERROR=0.212E-03
chmp     1---------------A----------------1---C--1----B----1
chmp   0.000 <-A->   9.660 <-C->  11.521 <-B->
bite     2----C----2-------B-------2-----------A-----------2
bite   0.000 <-C->   3.050 <-B->   7.479 <-A->
chmp   A C B
bite   C B A

T ERROR=0.212E-03  D ERROR=0.212E-03
chmp     1---------------A----------------1---C--1----B----1
chmp   0.000 <-A->   9.660 <-C->  11.521 <-B->
bite     2------B-------2----C-----2-----------A-----------2
bite   0.000 <-B->   4.429 <-C->   7.479 <-A->
chmp   A C B
bite   B C A

T ERROR=0.212E-03  D ERROR=0.212E-03
chmp     1---------------A----------------1----B----1--C---1
chmp   0.000 <-A->   9.660 <-B->  12.389 <-C->
bite     2------B-------2----C-----2-----------A-----------2
bite   0.000 <-B->   4.429 <-C->   7.479 <-A->
chmp   A B C
bite   B C A


Problem group 2.  Silent translation sites

You have cloned an mRNA for the NBLPRZ gene.  The sequence is in the file
"class:nblprz.seq".  You need to make a gene fusion construct taking as
much of the NBLPRZ gene as possible and attaching it to a stub protein.
The clone for that stub protein happens to have a HindIII site in convenient
place, the first base of which begins in the third base position of the 
codons in the open reading frame.  The NBLPRZ protein begins with the Met at
position 57.

2A.  If you engineer a translationally silent HindIII site into NBLPRZ, how
     much of it can you put into the fusion? 


Since we know where the protein starts, all that needs to be done is
to run MAP like this:

 $ map/infile=class:nblprz.seq/begin=57/enzyme=hindiii/silent/default

Then look at the nblprz.map file, where you will find:

                                            h
                                            i
                                            n
                                            d
                                            i
                                            i
                                            i
         ATGGCTTCCCCGACCTCCCCGAAAGTTTTCCCGCTGTCCCTGTGCTCCACCCAGCCGGAC
      57 ---+---------+---------+---------+---------+---------+------ 116
         TACCGAAGGGGCTGGAGGGGCTTTCAAAAGGGCGACAGGGACACGAGGTGGGTCGGCCTG

a         W  L  P  R  P  P  R  K  F  S  R  C  P  C  A  P  P  S  R  T  -
b          G  F  P  D  L  P  E  S  F  P  A  V  P  V  L  H  P  A  G  R -
c        M  A  S  P  T  S  P  K  V  F  P  L  S  L  C  S  T  Q  P  D   -

Which means that one can put an AAGCTT beginning at position 93 and not
change the reading frame.  Let's verify this:

ctgtccctg -> ctAAGCTTg
L  S  L      L  S  L  

So the fusion could contain all of NBLPRZ save the beginning sequence of
(M)ASPTSPKVFPL.  The problem is a bit artificial, since it was a given
that the HindIII site on the stub was in the right frame to match this
site - which would only be true 1/3 of the time by chance.



2B.   How much better could we do if the convenient site on the stub 
      protein was for an XbaI (T'CTAG_A) the first base of which began
      on the first base of a codon in the open reading frame?


This is basically a discussion question.  HindIII cuts like this: 
A'AGCT_T, it is the only enzyme that has a core 'AGCT_, which you can
verify by: 

 $ search gcgdatafiledirs:enzyme.dat AGCT

Xba, on the other hand, leaves a more common overlap, so we can look for 
other enzymes that share that core.  These will, if we are lucky, only 
change a single amino acid at the junction, which is often acceptable, 
we just don't want to engineer in a stop!

 $ search gcgdatafiledirs:enzyme.dat XbaI

Shows that XbaI looks like:  T'CTAG_A, so...

 $ search gcgdatafiledirs:enzyme.dat CTAG

Shows that there are several such enzymes, all six cutters, some common (
beginning with a ";").  So we can find the best one by doing: 

 $ mapselect/infile=class:nblprz.seq/begin=57/end=93/enzyme=**/six -
   /select=2/silent/default

This produces two output files, NBLPRZ.select (the enzymes that cut)
and the NBLPRZ.map.

 $ search nblprz.select CTAG_
AclNI       1       A'CTAG_T              4       ! Cuts in sequence: 0
NheI        1       G'CTAG_C              4       ! Cuts in sequence: 0
PstNHI      1       G'CTAG_C              4       ! Cuts in sequence: 0
SpeI        1       A'CTAG_T              4       ! Cuts in sequence: 0

There are four enzymes, but only two patterns, so we can use just NheI and
SpeI and skip the other two.  It will be a bit hard to find these in the
.MAP file because so many enzymes were used, so rerun MAP, but this time,
just digesting with these two. 

 $ map/infile=class:nblprz.seq/begin=57/end=93/data=nblprz.select/silent

Specify inside the program that only NheI and SpeI should be used.  (Or,
we could have edited the .select file down to just these two entries).

The result is:

            n        s
            h        p
            e        e
            i        i
         ATGGCTTCCCCGACCTCCCCGAAAGTTTTCCCGCTGT
      57 ---+---------+---------+---------+--- 93
         TACCGAAGGGGCTGGAGGGGCTTTCAAAAGGGCGACA

a         W  L  P  R  P  P  R  K  F  S  R  C   -
b          G  F  P  D  L  P  E  S  F  P  A     -
c        M  A  S  P  T  S  P  K  V  F  P  L    -

So, there is an XbaI compatible site (NheI) at 60, which would only
cut off the (M) in the sequence.  What would the new translation be?

 nnnTCTAGA           stub
 ?  S  R
 atgGCTAGCccg....    NBLPRZ, engineered
 M  A  S  P          NBLPRZ protein (wild type or engineered)
 nnnTCTAGCccg        fusion gene
 ?  S  S  P

Ie, the fusion protein would have the stub sequence up to ? S, and the
NBLPRZ sequence from SP on.


Problem group 3.  Primers

3A.  Design a set of primers to amplify at least bases 100 - 900 of
     CLASS:NBLPRZ.SEQ from genomic DNA (assume that it consists of a single
     exon.) 


If you were to try the naive approach

  $ prime/infile=class:nblprz.seq/begin2=100/end2=900 -
    /begin1=1/end1=1177/maxproduct=1177/default

it would fail to find any primers because the GC content of the product
is outside the normal bounds allowed.  You can loosen this up a bit
with:

  $ prime/infile=class:nblprz.seq/begin2=100/end2=900 -
    /begin1=1/end1=1177/gcmaxproduct=70.0/maxproduct=1177/default

Here is one set of primers - there are others:

 [DNA] = 50.000 nM   [salt] = 50.000 mM
                                        5'                   3'
        forward primer (18-mer):     78 AAAGTTTTCCCGCTGTCC     95
        reverse primer (22-mer):    987 TGTCACCTTTTTTCCAGTCTTC 966
                                     forward            reverse
                    primer %GC:         50.0               40.9
   primer Tm (degrees Celsius):         57.7               58.3
                                   PRODUCT
                                   -------
                product length:  910
                   product %GC: 61.5
                    product Tm: 85.1 degrees Celsius
       difference in primer Tm:  0.6 degrees Celsius
               annealing score: 46.0
 optimal annealing temperature: 62.0 degrees Celsius


Problem group 4.  Finding genes

4A.  What are the predicted exons for the Drosophila gene in
     CLASS:UNKNOWN.SEQ?  (Use genefinder, the file is  already in the
     correct format for that program).  Identify the gene.  Do the
     predicted exons agree with the documented ones?


$ genefinder dros
$ define/user sys$output exons.txt
$ gf -seqfile class:unknown.seq

Note first that genefinder generated one warning:

   ERROR or non-standard 5' site at 3959 in gene Unknown.cand.2 
     (CCAGG^AT^AAGCC) (GT?)

This happens if it finds a location where by all other criteria there might 
be a feature, yet the sequence at that location doesn't match.  It isn't a 
bad idea to check the sequence at these sites, although that isn't an 
option here.

$ fromfasta/infile=GFTRANSPRED.GENES

creates two files files called unknowncand1. and unknowncand2.

$ rename unknowncand2.. unknown_cand2.pep
$ blast/infile=unknown_cand2.pep/database=nr/dbproteinonly -
  /list=10/seg=10/default

Find that it matches these:

pir|S42622|S42622      three rows protein - fruit fly (Dr...  1504  0.0       5
gp|X75374|DMTHR_1      three rows protein [Drosophila mel...  1504  0.0       5
sp|P42286|THR_DROME    THREE ROWS PROTEIN. >pir|A49440|A4...  1487  0.0       5

Dig out the Genbank entry (it is from Drosophila, so it's in GB_IN, but GB
would have worked too) - we know the genbank name is DMTHR because genpept
names are "genbank_name"_#, where # is 1 or greater, referring to protein
products described in the Genbank entry.

$ fetch gb_in:dmthr
$ search dmthr.gbin join
     CDS             join(847. .1626,1685. .4423)

If we look in exons.txt we find:

Unknown.cand.2:
 48.09 A 847..1347\/1393..1626, 1685..2239\/2285..2864, 2908..3031,
        3079..3959\/4017..4423

The "\/" means that it put an intron into an ORF.  So the prediction
is slightly different than the published data.  The first publised exon
(847..1626) is broken up into two pieces.  The second is broken up into 5,
interestingly, the 5th predicted exon is out of frame with the published
exon!  Of course, the only definitive way to identify exons is to 
sequence all classes of cDNAs and map those back onto the genomic DNA.

************************************************************
This is the sequence from CLASS:NBLPRZ.SEQ

NBLPRZ
Nblprz.Seq  Length: 1177  January 31, 1996 10:54  Type: N  Check: 8643  ..

       1  ATTAGGCGCT ATACGTCGGT ATACGGATCA TCGGTATCCG TGTATCGATC 
      51  GATGCAATGG CTTCCCCGAC CTCCCCGAAA GTTTTCCCGC TGTCCCTGTG 
     101  CTCCACCCAG CCGGACGGTA ACGTTGTTAT CGCTTGCCTG GTTCAGGGTT 
     151  TCTTCCCGCA GGAACCGCTG TCCGTTACCT GGTCCGAATC CGGTCAGGGT 
     201  GTTACCGCTC GTAACTTCCC GCCGTCCCAG GACGCTTCCG GTGACCTGTA 
     251  CACCACCTCC TCCCAGCTGA CCCTGCCGGC TACCCAGTGC CTGGCTGGTA 
     301  AATCCGTTAC CTGCCACGTT AAACACTACA CCAACCCGTC CCAGGACGTT 
     351  ACCGTTCCGT GCCCGGTTCC GTCCACCCCG CCGACCCCGT CCCCGTCCAC 
     401  CCCGCCGACC CCGTCCCCGT CCTGCTGCCA CCCGCGTCTG TCCCTGCACC 
     451  GTCCGGCTCT GGAAGACCTG CTGCTGGGTT CCGAAGCTAA CCTGACCTGC 
     501  ACCCTGACCG GTCTGCGTGA CGCTTCCGGT GTTACCTTCA CCTGGACCCC 
     551  GTCCTCCGGT AAATCCGCTG TTCAGGGTCC GCCGGAACGT GACCTGTGCG 
     601  GTTGCTACTC CGTTTCCTCC GTTCTGCCGG GTTGCGCTGA ACCGTGGAAC 
     651  CACGGTAAAA CCTTCACCTG CACCGCTGCT TACCCGGAAT CCAAAACCCC 
     701  GCTGACCGCT ACCCTGTCCA AATCCGGTAA CACCTTCCGT CCGGAAGTTC 
     751  ACCTGCTGCC GCCGCCGTCC GAAGAACTGG CTCTGAACGA ACTGGTTACC 
     801  CTGACCTGCC TGGCTCGTGG TTTCTCCCCG AAAGACGTTC TGGTTCGTTG 
     851  GCTGCAGGGT TCCCAGGAAC TGCCGCGTGA AAAATACCTG ACCTGGGCTT 
     901  CCCGTCAGGA ACCGTCCCAG GGTACCACCA CCTTCGCTGT TACCTCCATC 
     951  CTGCGTGTTG CTGCTGAAGA CTGGAAAAAA GGTGACACCT TCTCCTGCAT 
    1001  GGTTGGTCAC GAAGCTCTGC CGCTGGCTTT CACCCAGAAA ACCATCGACC 
    1051  GTCTGGCTGG TAAACCGACC CACGTTAACG TTTCCGTTGT TATGGCTGAA 
    1101  GTTGACGGTA CCTGCTACTA AATCGTAGTC GTATATATGC TAGTCGATGC 
    1151  TATATGCTAG CTGATTTCGC GATTCTA