Fundamentals of Sequence Analysis, 1998-1999

Problem set 6: Protein tools.

If you get stuck, refer to the OpenVMS and GCG resources in the class home page.


Sequence data used in this problem set, for offsite readers:

Problem group 1. Physical properties

You have isolated the cDNA for a protein that interests you. The predicted translation is present in CLASS:PROTEIN1.PEP.

1A. You want to determine if the protein is likely present in a subcellular fraction, which you have run on a 2D gel (SDS-PAGE/isoelectric focusing.) There are spots at:

Spot   MW (kd)    Isoelectric pH
 A      8.1        6.5
 B     10.7       10.3
 C     22.3        9.6
 D     47.2        5.4
 ( plus an assortment of pH and MW markers)

If it is there, which one is it?

 $ peptidesort/infile=class:protein1.pep/enzyme=nocut
 $ type protein1.pepsort

   Molecular weight  =   10699.76
   Isoelectric point =      10.28

Which is quite close to the "B" protein spot.

1B. You want to further confirm that the spot you see is actually the protein that you have isolated. You run a preparatory isoelectric focusing gel and isolate the band at pH 10.3, then digest it with CNBr and run the fragments out on an SDS-PAGE gel. THEN you calculate what the expected fragment sizes should have been. What happens?

Probably you kick yourself.

$ peptidesort/infile=class:protein1.pep/enzyme=CNBr
$ type protein1.pepsort

CNBr doesn't cut this protein! If the protein in the gel was cut, then you have excluded the possibility that it is the protein of known sequence. However, if it didn't cut, you have not learned much. It would have been more productive to have done:

$ peptidesort/infile=class:protein1.pep/enzyme=*
$ type protein1.pepsort

*before* doing the digest, to see which of the enzymes would have been more informative. Trypsin, for instance, would cut the known protein into many pieces, most of which would have been too small to resolve on the gel. In this instance, NH2OH digestions might have been best, since it would result in only two, fairly large, fragments. Or you might have used trypsin, but then put the mix through a HPLC, where you could have resolved many of the small fragments.

Problem group 2. What the heck is it?

After a bit more work you have isolated a second protein from this complex. One pass through the protein sequencer yielded this sequence: YAACSTPQ

2A. Design a primer that will be radioactively labeled and used to screen a rat cDNA library for this sequence.

$ create 2a.pep
$ reformat/protein/nodots 2a.pep
$ backtranslate/infile1=2a.pep/infile2=gcgdatafiledirs:rat.cod -
$ type 2a.seq

Which contains this:

  TAC 0.60   GCC 0.39   GCC 0.39   TGC 0.56   AGC 0.24   ACC 0.38   CCC 0.32  CAG 0.75
  TAT 0.40   GCT 0.31   GCT 0.31   TGT 0.44   TCC 0.23   ACA 0.27   CCA 0.29  CAA 0.25
             GCA 0.21   GCA 0.21              TCT 0.19   ACT 0.24   CCT 0.29
             GCG 0.09   GCG 0.09              TCA 0.14   ACG 0.10   CCG 0.10
                                              AGT 0.13
                                              TCG 0.06

If you want to try hybridizing with a family of
 degenerate primers:

    T  C  C  TTGC  C  C  G
       T  T     T  T  T
       G  G     G  G  G

Unfortunately, that primer is quite degenerate, and there isn't enough sequence to try shifting one way or the other to find a better region. Otherwise, the most likely sequence, found in the bottom of that file, is:


This is unlikely to be the correct sequence, but might still stick to the right place if the stringency was low enough.

You have sequenced the cDNA clone and found an open reading frame that translates to CLASS:PROTEIN2.PEP. Assume that a database search failed to turn up any similar sequences (don't try it - this example sequence is made from bits and pieces of several proteins and will actually return hits.)

2B. What can you say about this protein?

Does it have signal sequence?

$ sigcleave/infile=class:PROTEIN2.PEP/default
$ type protein2.sig

   Apparently not.

Does it have a PEST sequence?

$ pestfind class:protein2.pep
$ type pestfind.out

   Yes, several, many with very high scores.  It is likely an unstable 

Are there any antigenic regions?  (You are considering making
antibodies against peptides.)

$ antigenic/infile=class:protein2.pep/default
$ type protein2.anti

   Here are the top 4 scoring regions:
      (1) Score 1.199 length 34 at residue 167-200
                  |                                |
                167                                200
      (2) Score 1.191 length 25 at residue 289-313
                  |                       |
                289                       313
      (3) Score 1.185 length 16 at residue 67-82
       Sequence:  SGLCSPSYVAVTPFSL
                  |              |
                 67              82
      (4) Score 1.184 length 12 at residue 15-26
       Sequence:  DYDSVQPYFYCD
                  |          |
                 15          26

Are there any coiled coil regions?

$! set up your GCG graphics first, ie     REGIS VT241 TT
$ pepcoil/infile=class:protein2.pep/default

  Yes.  There are two coiled-coil locations which score
  at probabilities of 1.0:  403-444, 672-720

Are there any helix-turn-helix regions?

$ helixturnhelix/infile=class:protein2.pep/default
$ type protein2.hth

  Apparently not - no high scoring regions were found.

What other known signatures does it have?

$ motifs/infile=class:protein2.pep/default
$ type protein2.motifs

  This turns up two motifs:

      bZIP transcription factors basic domain signature
          at 670-684
      Myc-type, 'helix-loop-helix' dimerization domain signature
          at 391-406

Any statistical anomalies in the sequence?

$ readseq -f4 class:protein2.pep -oprotein2.saps
$ saps -s rat -o saps.results protein2.saps
$ type saps.results

      Summary (positive findings only)
       There are a couple of high scoring negative charge segments.
       There are beta compatible repeats (period 2) for E 5X at 253- 262
         and for S 4X at 277- 284.
       There are alpha compatible repeats (period 7 = 3.6 x 2) for
         L...... 4X at 413- 440 and for L...V.. 4X at 692- 719
       There are alpha compatible charge repeats (period 7) for
         (KRED).(neutral).... 6X at 357-398 (with 2 discrepancies in the
         neutral position).

What is the protein's subcellular location?

$ tofasta/infile=class:protein2.pep/out=protein2.fsa
$ psort2 protein2.fsa


 psg: 0.44  gvh: 0.16  alm: 0.43  top: 0.53  tms: 0.00  mit: 0.22  mip: 0.03
 nuc: 0.14  erl: 0.00  erm: 0.00  pox: 0.00  px2: 0.00  vac: 0.00  rnp: 0.00
 act: 0.00  caa: 0.00  yqr: 0.00  tyr: 0.00  leu: 0.00  gpi: 0.00  myr: 0.00
 dna: 0.25  rib: 0.00  bac: 0.00  m1a: 0.00  m1b: 0.00  m2 : 0.00  mNt: 0.00
 m3a: 0.00  m3b: 0.00  m_ : 1.00  ncn: 1.00  lps: 0.06  len: 0.17  clr: 1.00

	 91.3 %: nuclear
	  4.3 %: vacuolar
	  4.3 %: plasma membrane

>> prediction for Protein2 is nuc (k=23)

Taken together, this set of characteristics suggests that the protein is 
likely a DNA binding/regulating protein. 

Problem group 3. Treading on thin ice - what is the secondary structure?

A third protein from this complex has been isolated and fully sequenced. The sequence is stored in the file CLASS:PROTEIN3.PEP, A series of mutations in this protein have been isolated, and all fall between amino acids 100 and 200.

3A. What is the secondary structure in that region?

This is actually NRL_3D:1GPAA, glycogen phosphorylase (and the
mutations' positions are completely made up.) Here is its real
secondary structure as determined from the X-ray structure: 

   alpha helix :  95-106, 109-115, 125-141
   beta  sheet : 153-154, 158-162, 165-169, 182-184, 189-200
   turn        : 162-165, 172-181, 184-187

Let's try a couple of the secondary structure servers and see how they
do.  Feed them the entire sequence, but only look at the region
100-200 in the results. 

$ em_phd
$ nnpredict

Wait around for the results to come back in the mail  (For nnpredict
one could also use a Web browser to connect to
"" then paste the PLAIN
text sequence in. 
HEL: H=helix, E=extended (sheet), T = turn,  -=other

       100      110       120       130       140       150       160       170       180       190       200 
       |         |         |         |         |         |         |         |         |         |         |
NN     HHHHHH----HHHHHHHHHH---------HHHHHHHHHHHHHHHHH----EEEEE-------------HHH-HHHH---------------------EE--


So, in this instance, the two predictions are relatively close, giving
4 alpha helices, and 2 common sheet regions, plus some disagreement on
two other sheets. 

Comparison with the known structure shows that 3 of the 4 predicted
helices are quite close to those found in the model, with the fourth
helix corresponding to a region of turns. The sheet regions predicted
by PHD picked up 4 out of 5 real sheets, nnpredict did less well with
only 2 out of 5. 

3B. Suppose that replacing any of the amino acids between 171 and 175 with a proline radically decreases the function of this enzyme. Propose a model based on the secondary structure predictions to explain this.

Both secondary structure prediction programs suggest that this region
is an alpha helix.  Naturally, if you believed these results, you
would propose that this helix is required for enzyme activity, and
that the proline disrupts it.  Unfortunately, in this case both
predictions are wrong, and so the model is incorrect.  (It is not
uncommon for multiple programs to give the same wrong prediction for a
region - they are "trained" or tested against similar sets of data,
and so tend to predict the same secondary structure for the same
sequence.  However, the protein may have evolved to force that region
into another secondary structure.) 

When it comes to secondary structure predictions. caveat emptor!