Fundamentals of Sequence Analysis, 1995-1996
Problem set 6:  Protein.

If you get stuck, refer to the OpenVMS and GCG resources in the 
class home page.

References:
 
 See the GCG and EGCG manuals.


Problem group 1.  Physical properties

You have isolated the cDNA for a protein that interests you.  The
predicted translation is present in CLASS:PROTEIN1.PEP.

1A.  You want to determine if the protein is likely present in a 
subcellular fraction, which you have run on a 2D gel (SDS-PAGE/isoelectric
focusing.)  There are spots at:

Spot   MW (kd)    Isoelectric pH
 A      8.1        6.5
 B     10.7       10.3
 C     22.3        9.6
 D     47.2        5.4
 ( plus an assortment of pH and MW markers)

If it is there, which one is it?


 $ peptidesort/infile=class:protein1.pep/enzyme=nocut
 $ type protein1.pepsort

Find:
   Molecular weight  =   10699.76
   Isoelectric point =      10.28

Which is quite close to the "B" protein spot.


1B. You want to further confirm that the spot you see is actually the
    protein that you have isolated.  You run a preparatory isoelectric
    focusing gel and isolate the band at pH 10.3, then digest it with
    CNBr and run the fragments out on an SDS-PAGE gel.  THEN you calculate
    what the expected fragment sizes should have been.  What happens? 


Probably you kick yourself.

 $ peptidesort/infile=class:protein1.pep/enzyme=CNBr
 $ type protein1.pepsort

Shows that CNBr doesn't cut this protein.  If the protein in the gel was
cut, then you have excluded the possibility that it is the protein of known
sequence.  However, if it didn't cut, you have not learned much.  It would
have been more productive to have done:

 $ peptidesort/infile=class:protein1.pep/enzyme=*
 $ type protein1.pepsort

*before* doing the digest, to see which of the enzymes would have been more
informative.  Trypsin, for instance, would cut the known protein into 
many pieces, most of which would have been too small to resolve on the gel.
In this instance, NH2OH digestions might have been best, since it would
result in only two, fairly large, fragments.  Or you might have used
trypsin, but then put the mix through a HPLC, where you could have
resolved many of the small fragments. 


Problem group 2.  What the heck is it?

After a bit more work you have isolated a second protein from this complex.
One round through the protein sequencer yielded this sequence:    YAACSTPQ


2A.  Design a primer that will be radioactively labeled and used 
     to screen a rat cDNA library for this sequence.


   $ create 2a.pep
   ..
   YAACSTPQ
   ^Z
   $ reformat/protein 2a.pep
   $ backtranslate/infile1=2a.pep/infile2=gcgdatafiledirs:rat.cod -
      /menu=a/default
   $ type 2a.seq

Which contains this:

  TAC 0.60   GCC 0.39   GCC 0.39   TGC 0.56   AGC 0.24   ACC 0.38   CCC 0.32  CAG 0.75
  TAT 0.40   GCT 0.31   GCT 0.31   TGT 0.44   TCC 0.23   ACA 0.27   CCA 0.29  CAA 0.25
             GCA 0.21   GCA 0.21              TCT 0.19   ACT 0.24   CCT 0.29
             GCG 0.09   GCG 0.09              TCA 0.14   ACG 0.10   CCG 0.10
                                              AGT 0.13
                                              TCG 0.06


If you want to try hybridizing with a family of degenerate primers:

  TACGCAGCATGCACAACACCACAA
    T  C  C  TTGC  C  C  G
       T  T     T  T  T
       G  G     G  G  G

Unfortunately, that primer is quite degenerate, and there isn't enough 
sequence to try shifting one way or the other to find a better region.

Otherwise, the most likely sequence, found in the bottom of that file, is:

  TACGCCGCCTGCAGCACCCCCCAG

This is unlikely to be the correct sequence, but might still stick to the 
right place if the stringency was low enough.



You have sequenced the cDNA clone and found an open reading frame that translates
to CLASS:PROTEIN2.PEP.  Assume that a database search failed to turn up any
similar sequences (don't try it - this example sequence is made from bits
and pieces of several proteins and will actually return hits.) 

2B.  What can you say about this protein?



Does it have signal sequence?

   $ sigcleave/infile=class:PROTEIN2.PEP/default
   $ type protein2.sig

   Apparently not.

Does it have a PEST sequence?

   $ pestfind class:protein2.pep
   $ type pestfind.out

   Yes, several, many with very high scores.  It is likely an unstable 
   protein.

Are there any antigenic regions?  (You are thinking about making antibodies
   against peptides.)

   $ antigenic/infile=class:protein2.pep/default
   $ type protein2.anti

   Here are the top 4 scoring regions:
      
      (1) Score 1.199 length 34 at residue 167-200
                                               *
       Sequence:  GHSVCSTSSLYLQDLSAAASECIDPSVVFPYPLN
                  |                                |
                167                                200
      
      (2) Score 1.191 length 25 at residue 289-313
                            *
       Sequence:  KPPHSPLVLKRCHVSTHQHNYAAPP
                  |                       |
                289                       313
      
      (3) Score 1.185 length 16 at residue 67-82
                       *
       Sequence:  SGLCSPSYVAVTPFSL
                  |              |
                 67              82
      
      (4) Score 1.184 length 12 at residue 15-26
                         *
       Sequence:  DYDSVQPYFYCD
                  |          |
                 15          26

Are there any coiled coil regions?

  $! set up your GCG graphics,ie     REGIS VT241 TT
  $ pepcoil/infile=class:protein2.pep/default

  Yes.  There are two coiled-coil locations which score at probabilities
  of 1.0:  403-444, 672-720

Are there any helix-turn-helix regions?

  $ helixturnhelix/infile=class:protein2.pep/default
  $ type protein2.hth

  Apparently not - no high scoring regions were found.

What other known signatures does it have?

  $ motifs/infile=class:protein2.pep/default
  $ type protein2.motifs

  This turns up two motifs:

      bZIP transcription factors basic domain signature
          at 670-684
      Myc-type, 'helix-loop-helix' dimerization domain signature
          at 391-406

Any statistical anomalies in the sequence?

  $ readseq -f4 class:protein2.pep -oprotein2.saps
  $ saps -s rat -o saps.results protein2.saps
  $ type saps.results

      Summary (positive findings only)
       There are a couple of high scoring negative charge segments.
       There are beta compatible repeats (period 2) for E 5X at 253- 262
         and for S 4X at 277- 284.
       There are alpha compatible repeats (period 7 = 3.6 x 2) for
         L...... 4X at 413- 440 and for L...V.. 4X at 692- 719
       There are alpha compatible charge repeats (period 7) for
         (KRED).(neutral).... 6X at 357-398 (with 2 discrepancies in the
         neutral position).


Taken together, this set of characteristics suggests that the protein is 
likely a DNA binding/regulating protein. 


Problem group 3.  Treading on thin ice - what is the secondary structure?

A third protein from this complex has been isolated and fully sequenced.
A series of mutations in this protein have been isolated, and all fall
between amino acids 100 and 200.
 
3A.  What is the secondary structure in that region?


This is actually NRL_3D:1GPAA, glycogen phosphorylase (and the
mutations' positions are completely made up.) Here is its real
secondary structure as determined from the X-ray structure: 

   alpha helix :  95-106, 109-115, 125-141
   beta  sheet : 153-154, 158-162, 165-169, 182-184, 189-200
   turn        : 162-165, 172-181, 184-187

Let's try a couple of the secondary structure servers and see how they
do.  Feed them the entire sequence, but only look at the region
100-200 in the results. 

   $ em_phd
   class:protein3.pep
   $ nnpredict
   
   class:protein3.pep

Wait around for the results to come back in the mail  (For nnpredict
one could also use a Web browser such as Netscape, Mosaic, or LYNX to
connect to "http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html" then
paste the PLAIN text sequence in.
       
HEL: H=helix, E=extended (sheet), T = turn,  -=other

       100      110       120       130       140       150       160       170       180       190       200 
       |         |         |         |         |         |         |         |         |         |         |
seq    DEATYQLGLDMEELEEIEEDAGLGNGGLGRLAACFLDSMATLGLAAYGYGIRYEFGIFNQKICGGWQMEEADDWLRYGNPWEKARPEFTLPVHFYGRVEHT
PHD    HHHHHHHH--HHHHHHHH-------HHHHHHHHHHHHHHHHH--------EEEE---EEEE----------HHHHH------E------EEEEEEEEEEE-
NN     HHHHHH----HHHHHHHHHH---------HHHHHHHHHHHHHHHHH----EEEEE-------------HHH-HHHH---------------------EE--

stru   HHHHHH---HHHHHHH---------HHHHHHHHHHHHHHHHH-----------EE---EEEEETTEEEEE--TTTTTTTTTTEEETTT-EEEEEEEEEEEE

So, in this instance, the two predictions are relatively close, giving
4 alpha helices, and 2 common sheet regions, plus some disagreement on
two other sheets. 

Comparison with the known structure shows that 3 of the 4 predicted
helices are quite close to those found in the model, with the fourth
helix corresponding to a region of turns. The sheet regions predicted
by PHD picked up 4 out of 5 real sheets, nnpredict did less well with
only 2 out of 5. 



3B.  Suppose that replacing any of the amino acids between 171 and 175
     with a proline radically decreases the function of this enzyme.
     Propose a model based on the secondary structure predictions to
     explain this.


Both secondary structure prediction programs suggest that this region
is an alpha helix.  Naturally, if you believed these results, you
would propose that this helix is required for enzyme activity, and
that the proline disrupts it.  Unfortunately, in this case both
predictions are wrong, and so the model is incorrect.  (It is not
uncommon for multiple programs to give the same wrong prediction for a
region - they are "trained" or tested against similar sets of data,
and so tend to predict the same secondary structure for the same
sequence.  However, the protein may have evolved to force that region
into another secondary structure.) 

When it comes to secondary structure predictions. caveat emptor!


---------------------------------------

Protein1.Pep  Length: 100  February 13, 1996 10:34  Type: P  Check: 3643  ..

       1  PKRKVSSAEG AAKEEPKRRS ARLSAKPAPA KVETKPKKAA GKDKSSDKKV 

      51  QTKGKRGAKG KQAEVANQET KEDLPAENGE TKNEESPASD EAEEKEAKSD 


---------------------------------------


Protein2.Pep  Length: 728  February 13, 1996 11:51  Type: P  Check: 5123  ..

       1  MPLNVSFTNR NYDLDYDSVQ PYFYCDEEEN FYQQQQQSEL QPPAPSEDIW 

      51  KKFELLPTPP LSPSRRSGLC SPSYVAVTPF SLRGDNDGGG GSFSTADQLE 

     101  MVTELLGGDM VNQSFICDPD DETFIKNIII QDCMWSGFSA AAKLVSEKLA 

     151  SYQAARKDSG SPNPARGHSV CSTSSLYLQD LSAAASECID PSVVFPYPLN 

     201  DSSSPKSCAS QDSSAFSPSS DSLLSSTESS PQGSPEPLVL HEETPPTTSS 

     251  DSEEEQEDEE EIDVVSVEKR QAPGKRSESG SPSAGGHSKP PHSPLVLKRC 

     301  HVSTHQHNYA APPSTRKDYP AAKRVKLDSV RVLRQISNNR KCTSPRSSDT 

     351  EENVKRRTHN VLERQRRNEL KRSFFALRDQ IPELENNEKA PKVVILKKAT 

     401  AYILSVQAEE QKLISEEDLL RKRREQLKHK LEQLRNSCAM SEYQPSLFAL 

     451  NPMGFSPLDG SKSTNENVSA STSTAKPMVG QLIFDKFIKT EEDPIIKQDT 

     501  PSNLDFDFAL PQTATAPDAK TVLPIPELDD AVVESFFSSS TDSTPMFEYE 

     551  NLEDNSKEWT SLFDNDIPVT TDDVSLADKA IESTEEVSLV PSNLEVSTTS 

     601  FLPTPVLEDA KLTQTRKVKK PNSVVKKSHH VGKDDESRLD HLGVVAYNRK 

     651  QRSIPLSPIV PESSDPAALK RARNTEAARR SRARKLQRMK QLEDKVEELL 

     701  SKNYHLENEV ARLKKLVGER YAACSTPQ

---------------------------------------

Protein3.Pep  Length: 828  February 13, 1996 13:55  Type: P  Check: 7271  ..

       1  RKQISVRGLA GVENVTELKK NFNRHLHFTL VKDRNVATPR DYYFALAHTV 

      51  RDHLVGRWIR TQQHYYEKDP KRIYYLSLEF YMGRTLQNTM VNLALENACD 

     101  EATYQLGLDM EELEEIEEDA GLGNGGLGRL AACFLDSMAT LGLAAYGYGI 

     151  RYEFGIFNQK ICGGWQMEEA DDWLRYGNPW EKARPEFTLP VHFYGRVEHT 

     201  SQGAKWVDTQ VVLAMPYDTP VPGYRNNVVN TMRLWSAKAP NDFNLKDFNV 

     251  GGYIQAVLDR NLAENISRVL YPNDNFFEGK ELRLKQEYFV VAATLQDIIR 

     301  RFKSSKFGCR DPVRTNFDAF PDKVAIQLND THPSLAIPEL MRVLVDLERL 

     351  DWDKAWEVTV KTCAYTNHTV IPEALERWPV HLLETLLPRH LQIIYEINQR 

     401  FLNRVAAAFP GDVDRLRRMS LVEEGAVKRI NMAHLCIAGS HAVNGVARIH 

     451  SEILKKTIFK DFYELEPHKF QNKTNGITPR RWLVLCNPGL AEIIAERIGE 

     501  EYISDLDQLR KLLSYVDDEA FIRDVAKVKQ ENKLKFAAYL EREYKVHINP 

     551  NSLFDVQVKR IHEYKRQLLN CLHVITLYNR IKKEPNKFVV PRTVMIGGKA 

     601  APGYHMAKMI IKLITAIGDV VNHDPVVGDR LRVIFLENYR VSLAEKVIPA 

     651  ADLSEQISTA GTEASGTGNM KFMLNGALTI GTMDGANVEM AEEAGEENFF 

     701  IFGMRVEDVD RLDQRGYNAQ EYYDRIPELR QIIEQLSSGF FSPKQPDLFK 

     751  DIVNMLMHHD RFKVFADYEE YVKCQERVSA LYKNPREWTR MVIRNIATSG 

     801  KFSSDRTIAQ YAREIWGVEP SRQRLPAP