Fundamentals of Sequence Analysis, 1998-1999

Fundamentals of Sequence Analysis, 1998-1999

Lecture 6. Tools for Molecular Biology III.


Introduction

Today we're going to continue examining tools that you can use for planning and analyzing your molecular Biology experiments. The focus of this lecture will be on analyzing Protein sequences. The protein analysis tools deal with a lot of different properties and kinds of analyses which aren't necessarily related to each other, so the focus of this lecture will jump around a lot as we move from one kind of analysis to another. Here is an overview of what will be covered:


BackTranslate Translate Protein to DNA
PeptideSort Digest Peptides
PeptideMap Map Peptide Digestions
CBRG server Search database by fragment masses
Isoelectric Plot charge properties
SigCleave Find signal sites
PESTFIND Find PEST sequences
Antigenic Find (likely) antigenic sequences
PepCoil Find coiled-coil regions
HelixTurnHelix Find Helix-Turn-Helix regions
Peptide/PlotStucture Predict/plot secondary structure and other properties
PepPlot Predict/plot secondary structure and other properties
Moment Plot hydrophobic moment
HelicalWheel Plot amino acids around alpha helix or beta sheet
PepNet Plot amino acids around alpha helix
SAPS Statistical properties of protein sequences
PSORT2 Predict subcellular location

Back translation

Imagine that you have a chicken protein sequence and want to isolate a similar protein from a human cDNA library. To do so, back translate the protein sequence to a DNA sequence, and make an oliginucleotide probe. Be sure to specify a codon frequency table that is appropriate for the target organism (in this instance, Homo sapiens). The default frequency table is for E. coli highly expressed genes. Use the /infile2=xxxxx.cod qualifier to specify the desired table.

The menu options for BackTranslate are:

  • A. make a table of all back-translations and most probable sequence
  • B. make a table of all back-translations and most ambiguous sequence
  • C. make the most probable sequence only
  • D. make the most ambiguous sequence only
  • Note that the most ambiguous sequence may not even code for the same protein. Such a sequence uses ambiguity codes for some positions, such as N or Y, and sometimes this results in a sequence which will not forward translate back to the original amino acid. For instance, the degenerate code for leucine is YTN , because the 6 leucine codons are CTN, TTA and TTG. But YTN includes TTC and TTT, which code for phenylalanine.

    
    
    leucine   =  CTA,CTG,CTC,CTT,TTA,TTG = {C,T}T{A,C,G,T}  =  YTN
    
    YTN  = {CTN, TTA, TTG} (leucine) + {TTC, TTT} (phenylalanine)
    
    

    In the following example, the known protein is chicken myelin protein and the back translation utilizes human coding preferences.

    $ backtranslate -
         /infile1=sw:mypr_chick -
         /infile2=data:hum.cod -
         /outfile=human.tbl -
         /menu=B -
         /begin=20/end=30
    $ type human.tbl
    
     BACKTRANSLATE of: : Mypr_Chick  check: 278  from: 20  to: 30
     
    ID   MYPR_CHICK     STANDARD;      PRT;   276 AA.
    AC   P23289;
    DT   01-NOV-1991 (REL. 20, CREATED)
    DT   01-NOV-1991 (REL. 20, LAST SEQUENCE UPDATE)
    DT   01-FEB-1996 (REL. 33, LAST ANNOTATION UPDATE)
    DE   MYELIN PROTEOLIPID PROTEIN (PLP) (LIPOPHILIN). . . . 
     
     Using codon frequencies from: Gencoredisk:[Gcgcore.Data.Rundata]Hum.Cod
     CheckFile: 5653
    
        Ala        Thr        Gly        Leu        Cys        Phe        Phe      
      GCC 0.40   ACC 0.38   GGC 0.33   CTG 0.43   TGC 0.58   TTC 0.57   TTC 0.57   
      GCT 0.28   ACA 0.27   GGA 0.26   CTC 0.20   TGT 0.42   TTT 0.43   TTT 0.43   
      GCA 0.22   ACT 0.23   GGG 0.23   TTG 0.12                                    
      GCG 0.10   ACG 0.12   GGT 0.18   CTT 0.12                                    
                                       CTA 0.07                                    
                                       TTA 0.06                                    
      22         31         47         81         62         51         36         
     
       27  -  30
        Gly        Val        Ala        Leu                                       
      GGC 0.33   GTG 0.48   GCC 0.40   CTG 0.43                                    
      GGA 0.26   GTC 0.25   GCT 0.28   CTC 0.20                                    
      GGG 0.23   GTT 0.17   GCA 0.22   TTG 0.12                                    
      GGT 0.18   GTA 0.10   GCG 0.10   CTT 0.12                                    
                                       CTA 0.07                                    
                                       TTA 0.06                                    
      27         0          0          0          
     
    Human.Tbl  Length: 33  February 23, 1999 10:27  Type: N  Check: 4383  ..
    
           1  GCNACNGGNY TNTGYTTYTT YGGNGTNGCN YTN
    
    
    

    The numbers under each column is the product of the probabilities for the most likely codon for that and the next three codons, scaled into the range 0 to 1000. Example (bold codons, above):

    .40 * .38 * .33 * .43 = 0.0215688 = > 22

    From the codons shown you may design degenerate primers. For instance, if we wanted to start at the first position (amino acid 20) in this sequence, begin by editing out the frequencies and spaces to give:

    GCCACCGGCCTGTGCTTCTTC   
    GCTACAGGACTCTGTTTTTTT   
    GCAACTGGGTTG                                 
    GCGACGGGTCTT                                 
             CTA                                   
             TTA                                    
    

    then for each column, leave only the unique nucleotides, to give:

    GCCACCGGCCTGTGCTTCTTC   
      T  A  AT C  T  T  T   
      A  T  G  T                                 
      G  G  T  A                                 
    

    Most synthesizers can make a degenerate oligonucleotide like this. You might also examine the sequence table to see if there are any less degenerate regions. In either case, you would probably want to add some unique sequence on the 5' ends so that you could cut the ends, amplify it again, and so forth.


    Mapping digestion sites, finding fragment properties

    There are an assortment of methods for digesting proteins at sequence specific sites. Often you will want to calculate the properties of the fragments which result from such a digestion so that you can tell which is which on a gel or other analysis. PeptideSort is analogous to the MapSort program, but for proteins. It shows a lot of information, where the cuts were, and for each fragment, the molecular weight, HPLC retention at ph 2.1 and 7.4 (probably not accurate for fragments longer than 20 amino acids long) and assorted compositional information. At the end is a summary of the composition of the full protein. Adding the qualifier /NOCut just outputs a summary for the composition of the whole protein.

    $ PEPTIDESORT/infile=pir1:a1hu/enzyme=tryp/default
    $ type a1hu.pepsort
    
     PEPTIDESORT of: Pir1:A1hu  check: 6072  from: 1  to: 353
    
    P1;A1HU - Ig alpha-1 chain C region - human
    C;Species: Homo sapiens (man)
    C;Date: 22-May-1981 #sequence_revision 03-Oct-1995 #text_change 02-Sep-1997
    C;Accession: A22360; A92249; A91662; S38979; B53110; A02171
    R;Flanagan, J.G.; Lefranc, M.P.; Rabbitts, T.H.
    Cell 36, 681-688, 1984 . . . 
    
     With Enzymes: TRYP 
    
                             February 17, 1999 09:58  ..
    
                   Digest with: Tryp.  Peptides Sorted by Position
    
    
    Pos  From     To   Mol Wt  Ret2.1  Ret7.4    Chg  Aro Acid Base Sulf Phil Phob
      1     1 -    7    686.8    13.3    17.3    1.0    0    0    1    0    6    1
      A1,K1,P2,S2,T1 Iso=9.67 Ext=0
      2     8 -   51   4666.3   115.4    83.9   -2.0    4    3    1    2   21   23
      A2,C2,D1,E2,F3,G4,I1,L4,N1,P4,Q4,R1,S5,T3,V6,W1 Iso=4.00 Ext=5930
      3    52 -   81   3112.4    62.4    51.0   -1.0    2    2    1    1   18   12
      A3,C1,D2,F1,G2,K1,L4,N1,P3,Q3,S4,T4,Y1 Iso=4.11 Ext=1400
      4    82 -   88    772.9     6.3    -1.5    1.0    0    0    1    1    4    3
      C1,H1,K1,S1,T1,V2 Iso=8.44 Ext=120
      5    89 -  126   3967.4    99.0    75.9    0.0    1    1    1    3   31    7
      C3,D1,H2,N1,P13,Q1,R1,S6,T6,V3,Y1 Iso=7.26 Ext=1640
      6   127 -  131    624.7    19.1    15.5    1.0    0    0    1    0    3    2
      H1,L2,R1,S1 Iso=10.53 Ext=0
      7   132 -  153   2300.6    58.0    28.6   -2.0    0    3    1    1   10   12
      A2,C1,D1,E2,G2,L7,N1,P1,R1,S1,T3 Iso=4.00 Ext=120
      8   154 -  168   1540.6    36.1    40.4    0.0    2    1    1    0    9    6
      A1,D1,F1,G2,K1,P1,S3,T3,V1,W1 Iso=6.31 Ext=5690
      9   169 -  177    940.0     7.0    -4.9    0.0    0    1    1    0    6    3
      A1,E1,G1,P2,Q1,R1,S1,V1 Iso=6.44 Ext=0
     10   178 -  200   2422.7    41.8    10.0   -1.0    2    2    1    3   10   13
      A1,C3,D1,E1,G3,H1,K1,L2,N1,P2,S3,V2,W1,Y1 Iso=5.48 Ext=7330
     11   201 -  212   1318.4    24.3    11.5    0.0    2    1    1    1    7    5
      A2,C1,E1,F1,K1,P1,S1,T3,Y1 Iso=6.22 Ext=1400
     12   213 -  221    931.1    32.0    33.0    1.0    0    0    1    0    6    3
      A1,K1,L2,P1,S1,T3 Iso=9.67 Ext=0
     13   222 -  227    680.7    11.6    16.5    1.0    1    0    1    0    4    2
      F1,G1,N1,R1,S1,T1 Iso=10.53 Ext=0
     14   228 -  253   2855.3    76.7    20.5   -3.0    0    4    1    1   14   12
      A2,C1,E4,H1,L7,N1,P4,R1,S1,T2,V2 Iso=4.31 Ext=120
     15   254 -  258    534.6    21.0    20.0    1.0    1    0    1    0    3    2
      F1,G1,K1,P1,S1 Iso=9.67 Ext=0
     16   259 -  263    600.7    15.8     6.2    0.0    0    1    1    0    2    3
      D1,L1,R1,V2 Iso=6.31 Ext=0
     17   264 -  273   1213.4    31.4    13.5    0.0    1    1    1    0    6    4
      E1,G1,L2,P1,Q2,R1,S1,W1 Iso=6.44 Ext=5690
     18   274 -  275    275.3    -4.2   -17.4    0.0    0    1    1    0    2    0
      E1,K1 Iso=6.44 Ext=0
     19   276 -  282    896.0    36.0    34.4    1.0    2    0    1    0    3    4
      A1,L1,R1,S1,T1,W1,Y1 Iso=9.75 Ext=6970
     20   283 -  299   1836.0    34.5    32.1    0.0    1    1    1    0   11    6
      A1,E1,F1,G1,I1,L1,P1,Q2,R1,S2,T4,V1 Iso=6.44 Ext=0
     21   300 -  306    817.9    14.2    -7.0   -1.0    1    2    1    0    3    4
      A2,D1,E1,K1,V1,W1 Iso=4.24 Ext=5690
     22   307 -  307    146.2     3.3    -0.5    1.0    0    0    1    0    1    0
      K1 Iso=9.67 Ext=0
     23   308 -  327   2152.5    53.4    24.5   -1.0    2    2    1    2    9   11
      A2,C1,D1,E1,F2,G2,H1,K1,L2,M1,P1,Q1,S1,T2,V1 Iso=5.49 Ext=120
     24   328 -  331    503.5    12.5     8.6    0.0    0    1    1    0    3    1
      D1,I1,R1,T1 Iso=6.31 Ext=0
     25   332 -  335    387.5    12.7     8.8    1.0    0    0    1    0    1    3
      A1,G1,K1,L1 Iso=9.67 Ext=0
     26   336 -  353   1921.2    31.7     0.8   -2.0    1    2    0    2    8   10
      A1,C1,D1,E1,G1,H1,M1,N1,P1,S1,T2,V5,Y1 Iso=4.23 Ext=1400
    
    
    
                  Digest with: Tryp.  Peptides Sorted by Weight
    
    
    Pos  From     To   Mol Wt  Ret2.1  Ret7.4    Chg  Aro Acid Base Sulf Phil Phob
     22   307 -  307    146.2     3.3    -0.5    1.0    0    0    1    0    1    0
     18   274 -  275    275.3    -4.2   -17.4    0.0    0    1    1    0    2    0
     25   332 -  335    387.5    12.7     8.8    1.0    0    0    1    0    1    3
     24   328 -  331    503.5    12.5     8.6    0.0    0    1    1    0    3    1
     15   254 -  258    534.6    21.0    20.0    1.0    1    0    1    0    3    2
     16   259 -  263    600.7    15.8     6.2    0.0    0    1    1    0    2    3
      6   127 -  131    624.7    19.1    15.5    1.0    0    0    1    0    3    2
     13   222 -  227    680.7    11.6    16.5    1.0    1    0    1    0    4    2
      1     1 -    7    686.8    13.3    17.3    1.0    0    0    1    0    6    1
      4    82 -   88    772.9     6.3    -1.5    1.0    0    0    1    1    4    3
     21   300 -  306    817.9    14.2    -7.0   -1.0    1    2    1    0    3    4
     19   276 -  282    896.0    36.0    34.4    1.0    2    0    1    0    3    4
     12   213 -  221    931.1    32.0    33.0    1.0    0    0    1    0    6    3
      9   169 -  177    940.0     7.0    -4.9    0.0    0    1    1    0    6    3
     17   264 -  273   1213.4    31.4    13.5    0.0    1    1    1    0    6    4
     11   201 -  212   1318.4    24.3    11.5    0.0    2    1    1    1    7    5
      8   154 -  168   1540.6    36.1    40.4    0.0    2    1    1    0    9    6
     20   283 -  299   1836.0    34.5    32.1    0.0    1    1    1    0   11    6
     26   336 -  353   1921.2    31.7     0.8   -2.0    1    2    0    2    8   10
     23   308 -  327   2152.5    53.4    24.5   -1.0    2    2    1    2    9   11
      7   132 -  153   2300.6    58.0    28.6   -2.0    0    3    1    1   10   12
     10   178 -  200   2422.7    41.8    10.0   -1.0    2    2    1    3   10   13
     14   228 -  253   2855.3    76.7    20.5   -3.0    0    4    1    1   14   12
      3    52 -   81   3112.4    62.4    51.0   -1.0    2    2    1    1   18   12
      5    89 -  126   3967.4    99.0    75.9    0.0    1    1    1    3   31    7
      2     8 -   51   4666.3   115.4    83.9   -2.0    4    3    1    2   21   23
    
    
    
                  Digest with: Tryp.  Peptides Sorted by Retention
    
    
    Pos  From     To   Mol Wt  Ret2.1  Ret7.4    Chg  Aro Acid Base Sulf Phil Phob
     18   274 -  275    275.3    -4.2   -17.4    0.0    0    1    1    0    2    0
     22   307 -  307    146.2     3.3    -0.5    1.0    0    0    1    0    1    0
      4    82 -   88    772.9     6.3    -1.5    1.0    0    0    1    1    4    3
      9   169 -  177    940.0     7.0    -4.9    0.0    0    1    1    0    6    3
     13   222 -  227    680.7    11.6    16.5    1.0    1    0    1    0    4    2
     24   328 -  331    503.5    12.5     8.6    0.0    0    1    1    0    3    1
     25   332 -  335    387.5    12.7     8.8    1.0    0    0    1    0    1    3
      1     1 -    7    686.8    13.3    17.3    1.0    0    0    1    0    6    1
     21   300 -  306    817.9    14.2    -7.0   -1.0    1    2    1    0    3    4
     16   259 -  263    600.7    15.8     6.2    0.0    0    1    1    0    2    3
      6   127 -  131    624.7    19.1    15.5    1.0    0    0    1    0    3    2
     15   254 -  258    534.6    21.0    20.0    1.0    1    0    1    0    3    2
     11   201 -  212   1318.4    24.3    11.5    0.0    2    1    1    1    7    5
     17   264 -  273   1213.4    31.4    13.5    0.0    1    1    1    0    6    4
     26   336 -  353   1921.2    31.7     0.8   -2.0    1    2    0    2    8   10
     12   213 -  221    931.1    32.0    33.0    1.0    0    0    1    0    6    3
     20   283 -  299   1836.0    34.5    32.1    0.0    1    1    1    0   11    6
     19   276 -  282    896.0    36.0    34.4    1.0    2    0    1    0    3    4
      8   154 -  168   1540.6    36.1    40.4    0.0    2    1    1    0    9    6
     10   178 -  200   2422.7    41.8    10.0   -1.0    2    2    1    3   10   13
     23   308 -  327   2152.5    53.4    24.5   -1.0    2    2    1    2    9   11
      7   132 -  153   2300.6    58.0    28.6   -2.0    0    3    1    1   10   12
      3    52 -   81   3112.4    62.4    51.0   -1.0    2    2    1    1   18   12
     14   228 -  253   2855.3    76.7    20.5   -3.0    0    4    1    1   14   12
      5    89 -  126   3967.4    99.0    75.9    0.0    1    1    1    3   31    7
      2     8 -   51   4666.3   115.4    83.9   -2.0    4    3    1    2   21   23
    
    
    
         Summary for whole sequence:
    
    Molecular weight =   37654.29     Residues =    353
    Average Residue Weight = 106.669     Charged =   -4
    Isoelectric point =  6.48
    Extinction coefficient =  42720
    
    Residue           Number      Mole Percent     ..
    
    A = Ala            24            6.799
    B = Asx             0            0.000
    C = Cys            15            4.249
    D = Asp            12            3.399
    E = Glu            17            4.816
    F = Phe            11            3.116
    G = Gly            22            6.232
    H = His             8            2.266
    I = Ile             3            0.850
    K = Lys            13            3.683
    L = Leu            36           10.198
    M = Met             2            0.567
    N = Asn             8            2.266
    P = Pro            39           11.048
    Q = Gln            14            3.966
    R = Arg            12            3.399
    S = Ser            38           10.765
    T = Thr            40           11.331
    V = Val            27            7.649
    W = Trp             6            1.700
    Y = Tyr             6            1.700
    Z = Glx             0            0.000
    
    A + G              46           13.031
    S + T              78           22.096
    D + E              29            8.215
    D + E + N +  Q     51           14.448
    H + K + R          33            9.348
    D + E + H + K + R  62           17.564
    I + L + M + V      68           19.263
    F + W + Y          23            6.516
    
     Enzymes that do cut:
    
         Tryp
    
     Enzymes that do not cut: 
    
       NONE
    
    

    Mapping digestion sites in detail

    PeptideMap is analogous to Map. Use it to make detailed maps of the locations in the protein sequence which are cut by site specific enzymes.

    $ PEPTIDEMAP/infile=pir1:a1hu/enzyme=tryp/default
    $ type a1hu.map
    
     (Linear) (Peptide) MAP of: A1hu  check: 6072  from: 1  to: 353
    
     With 2 enzymes: TRYP 
    
                                 February 17, 1999 10:04  ..
     
                   T                                           T
                   r                                           r
                   y                                           y
                   p                                           p
             ASPTSPKVFPLSLCSTQPDGNVVIACLVQGFFPQEPLSVTWSESGQGVTARNFPPSQDAS
           1 ---------+---------+---------+---------+---------+---------+ 60
     
                                 T      T
                                 r      r
                                 y      y
                                 p      p
             GDLYTTSSQLTLPATQCLAGKSVTCHVKHYTNPSQDVTVPCPVPSTPPTPSPSTPPTPSP
          61 ---------+---------+---------+---------+---------+---------+ 120
     
                  T    T                     T              T        T
                  r    r                     r              r        r
                  y    y                     y              y        y
                  p    p                     p              p        p
             SCCHPRLSLHRPALEDLLLGSEANLTCTLTGLRDASGVTFTWTPSSGKSAVQGPPERDLC
         121 ---------+---------+---------+---------+---------+---------+ 180
     
                                T           T        T     T
                                r           r        r     r
                                y           y        y     y
                                p           p        p     p
             GCYSVSSVLPGCAEPWNHGKTFTCTAAYPESKTPLTATLSKSGNTFRPEVHLLPPPSEEL
         181 ---------+---------+---------+---------+---------+---------+ 240
     
                         T    T    T         T T      T                T
                         r    r    r         r r      r                r
                         y    y    y         y y      y                y
                         p    p    p         p p      p                p
             ALNELVTLTCLARGFSPKDVLVRWLQGSQELPREKYLTWASRQEPSQGTTTFAVTSILRV
         241 ---------+---------+---------+---------+---------+---------+ 300
     
                  TT                   T   T   T
                  rr                   r   r   r
                  yy                   y   y   y
                  pp                   p   p   p
             AAEDWKKGDTFSCMVGHEALPLAFTQKTIDRLAGKPTHVNVSVVMAEVDGTCY
         301 ---------+---------+---------+---------+---------+--- 353
    
     Enzymes that do cut:
    
         Tryp
    
     Enzymes that do not cut: 
    
       NONE
    
    
    

    Finding a protein in a database with matching digest fragments

    There is a specialized type of database search available for those cases where you have either digest information, or know the total mass of a protein, and want to find out if it is likely identical to another in the Swiss-Protein database. The key word here is "identical" - it doesn't take too many amino acid substitutions, or especially indels, to shift the molecular weights far enough so that you won't find similar proteins. Also, the molecular weights must be determined on peptides that do not contain mass altering modifications, for instance, glycosylation. This search is provided by the CBRG (Computational Biochemistry Research Group, in Zurich), with results returned via e-mail. It may take several days for the results to be mailed back to you. Note in the following the masses shown were calculated with PeptideSort, which assumes a protonated carboxyl group, so all masses were increased by 1.0, because the CBRG server converts all fragments to the unprotonated carboxyl group (removes the proton.)

    $ cbrg
    7          for TotalMass
    37655      PIR1:A1HU mass
    
    or
    
    $ cbrg
    4          MassSearch, with fragments
    22         Trypsin, there are many other options
    601.        one fragment is .600 kD
    897.        one fragment is .896 kD
    1837.       one fragment is 1.836 kD
    2301.       one fragment is 2.3 kD
                blank line - terminate fragment list
                blank line - send query
    

    I won't show you the results of these searches - they just consist of a list of proteins, sorted by fit to the data provided, so that the best ones are first in the list.


    Calculating a protein's charge

    Use the GCG program Isoelectric to plot a protein's charge at various pHs.

    $ tektronix versaterm term
    $ isoelectric/infile=pir1:a1hu/outfile=a1hu.iso/default
    

    Adjust the vertical scale with /MINCharge=xxx/MAXCharge=yyy. The "+" and "-" in the graph refer to the net charge on the positively and negatively charged residues.

    The output file is optional. Here's what is in it:

    $ type a1hu.iso
    
     ISOELECTRIC of: Pir1:A1hu Check: 6072 from: 1 to: 353  February 17, 1999 10:17
    P1;A1HU - Ig alpha-1 chain C region - human
    
    Amino Acid         Number of
                       Residues
    -----------------  -----------
    Arginine              12
    Lysine                13
    Histidine              8
    Tyrosine               6
    Cysteine              15
    Glutamic Acid         17
    Aspartic Acid         12
    
    Amino Terminus         1
    Carboxyl Terminus      1
    
                           Number of Hydrogen Ions Bound
             ----------------------------------------------------------    Net
      pH     Arg   Lys   His   Tyr   Cys   Glu   Asp   NH2  COOH  Total   Charge ..
    
     1.00  12.00 13.00  8.00  6.00 15.00 16.99 11.99  1.00  1.00  84.97    33.97
     1.50  12.00 13.00  8.00  6.00 15.00 16.97 11.95  1.00  0.99  84.91    33.91
     2.00  12.00 13.00  8.00  6.00 15.00 16.90 11.85  1.00  0.97  84.73    33.73
     2.50  12.00 13.00  8.00  6.00 15.00 16.70 11.55  1.00  0.92  84.17    33.17
     3.00  12.00 13.00  8.00  6.00 15.00 16.09 10.69  1.00  0.78  82.56    31.56
     3.50  12.00 13.00  7.99  6.00 15.00 14.43  8.64  1.00  0.53  78.60    27.60
     4.00  12.00 13.00  7.97  6.00 15.00 10.88  5.38  1.00  0.27  71.50    20.50
     4.50  12.00 13.00  7.92  6.00 15.00  6.12  2.45  1.00  0.10  63.59    12.59
     5.00  12.00 13.00  7.75  6.00 14.99  2.57  0.90  1.00  0.04  58.25     7.25
     5.50  12.00 13.00  7.27  6.00 14.98  0.91  0.30  1.00  0.01  55.47     4.47
     6.00  12.00 13.00  6.08  6.00 14.93  0.30  0.10  1.00  0.00  53.40     2.40
     6.50  12.00 13.00  4.00  6.00 14.77  0.10  0.03  0.99  0.00  50.88    -0.12
     7.00  12.00 13.00  1.92  6.00 14.28  0.03  0.01  0.97  0.00  48.22    -2.78
     7.50  12.00 12.99  0.73  6.00 12.95  0.01  0.00  0.92  0.00  45.60    -5.40
     8.00  12.00 12.98  0.25  5.99  9.99  0.00  0.00  0.78  0.00  42.00    -9.00
     8.50  12.00 12.93  0.08  5.98  5.80  0.00  0.00  0.53  0.00  37.33   -13.67
     9.00  12.00 12.79  0.03  5.93  2.50  0.00  0.00  0.27  0.00  33.51   -17.49
     9.50  11.99 12.37  0.01  5.79  0.89  0.00  0.00  0.10  0.00  31.15   -19.85
    10.00  11.96 11.19  0.00  5.39  0.29  0.00  0.00  0.04  0.00  28.87   -22.13
    10.50  11.88  8.59  0.00  4.43  0.09  0.00  0.00  0.01  0.00  25.01   -25.99
    11.00  11.63  4.96  0.00  2.83  0.03  0.00  0.00  0.00  0.00  19.45   -31.55
    11.50  10.91  2.12  0.00  1.32  0.01  0.00  0.00  0.00  0.00  14.36   -36.64
    12.00   9.12  0.76  0.00  0.49  0.00  0.00  0.00  0.00  0.00  10.37   -40.63
    12.50   6.00  0.25  0.00  0.16  0.00  0.00  0.00  0.00  0.00   6.41   -44.59
    13.00   2.88  0.08  0.00  0.05  0.00  0.00  0.00  0.00  0.00   3.02   -47.98
    
                               Isoelectric Point
    
     6.48  12.00 13.00  4.10  6.00 14.78  0.10  0.03  0.99  0.00  51.00     0.00
    
    

    Finding signal sequences

    In lecture two we discussed how to find PROSITE motifs in protein sequences using the Motifs program. There are a few other sorts of motifs that you may need to search for, but which are difficult or impossible to describe as a simple pattern.

    To find signal sequences use the EGCG SIGCLEAVE program. This is based on the work of von Heijne (1986) Nucleic Acids Res. 14:4683-4690.

    $ sigcleave/infile=sw:1a01_human/default
    $ type 1a01_human.sig
    SIGCLEAVE of SW:1A01_HUMAN Check: 1951 from: 1 to: 365
    
    ID   1A01_HUMAN     STANDARD;      PRT;   365 AA.
    DE   HLA CLASS I HISTOCOMPATIBILITY ANTIGEN, A-1 ALPHA CHAIN PRECURSOR.
    
    Report scores over 3.50
    Maximum score 10.5 at residue 23
    
     Sequence:  LLLLSGALALTQT-WAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQ
                | (signal)    | (mature peptide)
               10             23
    
     Other entries above 3.50
    
    Score 9.7 at residue 25
    
     Sequence:  LLSGALALTQTWA-GSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKM
                | (signal)    | (mature peptide)
               12             25
    
    Score 9.0 at residue 19
    
     Sequence:  PRTLLLLLSGALA-LTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSD
                | (signal)    | (mature peptide)
                6             19
    
    Score 9.0 at residue 21
    
     Sequence:  TLLLLLSGALALT-QTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAA
                | (signal)    | (mature peptide)
                8             21
    
    Score 5.7 at residue 29
    
     Sequence:  ALALTQTWAGSHS-MRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRA
                | (signal)    | (mature peptide)
               16             29
    
    Score 5.7 at residue 329
    
     Sequence:  VLLGAVITGAVVA-AVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV             
                | (signal)    | (mature peptide)
              316             329
    
    Score 5.5 at residue 324
    
     Sequence:  IIAGLVLLGAVIT-GAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV        
                | (signal)    | (mature peptide)
              311             324
    
    Score 5.0 at residue 325
    
     Sequence:  IAGLVLLGAVITG-AVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV         
                | (signal)    | (mature peptide)
              312             325
    
    Score 4.4 at residue 22
    
     Sequence:  LLLLLSGALALTQ-TWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAAS
                | (signal)    | (mature peptide)
                9             22
    
    Score 3.5 at residue 20
    
     Sequence:  RTLLLLLSGALAL-TQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDA
                | (signal)    | (mature peptide)
                7             20
    

    The documentation for this entry says that the signal is 1 through 24. The two highest scoring predictions by the program are for signals of length 23 and 25, which is pretty close. It is claimed that the method correctly predicts the cleavage site in 75 to 80% of the cases.


    Finding PEST sequences

    PEST sequences are often found in short-lived proteins. The original program was in BASICA, from M. Rechsteiner's laboratory, the current version is something I wrote, which is a translation of the original to ANSI C. The example is for the human proto-oncogene c-myc. Reference: Rogers S., Wells R., Rechsteiner M. (1986)Science 234:364-368. PESTFIND outputs the potential PEST regions in the order in which they occur within the sequence - they are not sorted to show the best regions first.

    $ pestfind sw:Myc_Human
    Now processing:  SW:MYC_HUMAN
    
     myc_human 439 bp.
    
    Run completed, results are in PESTFIND.OUT
    $ type pestfind.out
    Pestfind analysis of SW:MYC_HUMAN
    Processing began at 17-FEB-1999 10:22:04.57
     
    Results on:  SW:MYC_HUMAN
    ====================================================
    Potential PEST sequence 10-51 (flank_dist=40)
      RNYDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPSEDIWK
      The weight percent of PEDST is: 31.469387
      The hydrophobicity index is: 34.330616
      The PEST-FIND score is: 0.142855
    ---------------------------------
    Poor PEST sequence 52-65 (flank_dist=12)
      KFELLPTPPLSPSR
      The best PEST-FIND score is: -3.863851
    ---------------------------------
    Poor PEST sequence 83-126 (flank_dist=42)
      RGDNDGGGGSFSTADQLEMVTELLGGDMVNQSFICDPDDETFIK
      The best PEST-FIND score is: -1.662364
    ---------------------------------
    Poor PEST sequence 157-166 (flank_dist=8)
      KDSGSPNPAR
      The best PEST-FIND score is: -2.641381
    ---------------------------------
    Poor PEST sequence 168-206 (flank_dist=37)
      HSVCSTSSLYLQDLSAAASECIDPSVVFPYPLNDSSSPK
      The best PEST-FIND score is: -2.391488
    ---------------------------------
    Potential PEST sequence 206-241 (flank_dist=34)
      KSCASQDSSAFSPSSDSLLSSTESSPQGSPEPLVLH
      The weight percent of PEDST is: 52.942924
      The hydrophobicity index is: 41.022591
      The PEST-FIND score is: 8.607313
    ---------------------------------
    Potential PEST sequence 241-269 (flank_dist=27)
      HEETPPTTSSDSEEEQEDEEEIDVVSVEK
      The weight percent of PEDST is: 71.338638
      The hydrophobicity index is: 27.603380
      The PEST-FIND score is: 25.434561
    ---------------------------------
    Poor PEST sequence 276-287 (flank_dist=10)
      RSESGSPSAGGH
      The best PEST-FIND score is: -0.579045
    ---------------------------------
    

    Finding antigenic sequences

    Antigenic sequences are regions which are likely to elicit an immune response. You might want to predict this if you needed to make antibodies against a protein by innoculating an animal with peptides rather than whole protein.

    The EGCG Antigenic program can find such sequences. The program follows from a study which examined the sequence of about 156 peptides of 20 amino acids or less from antigenic regions, yielding about 2000 total amino acids. The reference is: Kolaskar, AS and Tongaonkar, PC (1990). FEBS Letters 276, 172-174.

    $ antigenic/infile=sw:H2b0_Human/default
    $ type h2b0_human.anti
    ANTIGENIC of SW:H2B0_HUMAN Check: 9639 from: 1 to: 125
    
    ID   H2B0_HUMAN     STANDARD;      PRT;   125 AA.
    DE   HISTONE H2B.1. . . . 
    
    Length 125 residues, score calculated from 4 to 122
    
    Report all peptides over 6 residues
    
    Found 2 hits scoring over 1.00 (true average 1.01)
    Maximum length 21 at residues 95-115
    
     Sequence:  QTAVRLLLPGELAKHAVSEGT
                |                   |
               95                   115
    
     Entries in score order, max score at "*"
    
    
    (1) Score 1.203 length 15 at residue 37-51
                     *         
     Sequence:  YSIYVYKVLKQVHPD
                |             |
               37             51
    
    (2) Score 1.162 length 21 at residue 95-115
                     *               
     Sequence:  QTAVRLLLPGELAKHAVSEGT
                |                   |
               95                   115
    
    

    This program calculates an antigenicity value which is assigned to the center residue in a sliding window of 7 amino acids. The antigenicity value is based on the frequency with which each amino acid was found in an antigenic determinant of less than 20 amino acids. Note also that antigenic in this sense means not only that it elicited an immune response, but probably that it was on the surface of the protein as well. If the average of these measurements over the whole protein is 1.00 or greater, then any residues having a value >= 1.0 are potentially antigenic. If the average is < 1.0, then residues having a value greater than the average are potentially antigenic.


    Finding coiled coil regions

    PepCoil uses the method of Lupas, van Dyke, and Stock (1991) Science 252:1162-1164, to find coiled coil regions. It both plots the probability that a region is a coiled coil and puts the positions in a file.

    $ pepcoil/infile=SW:GCN4_YEAST/default
    

    $ type GCN4_YEAST.COIL
    PEPCOIL of SW:GCN4_YEAST February 17, 1999 10:26
       using a window of 28 residues
    
    Other structures from 1 to 232 (232 residues)
       Max score: 1.283 (probability 0.21)
    
    Prediction starts at 233
    Probable coiled-coil from 233 to 281 (49 residues)
       Max score: 1.910 (probability 1.00)
    
    
    

    There is a minor bug in this program that causes it to create TWO output files with the same name but different version numbers. The lower numbered one is empty, so you can ignore it.

    We also have a variant of this which is the original Pascal program written by Lupas. It requires a file in GCG format, but will not work with a GCG database reference such as that used above.

    $ fetch sw:gcn4_yeast
    $ coils
    Y
    Gcn4_Yeast.Sw
    GCN4_YEAST.COILS
    21
    
    $ type GCN4_YEAST.COILS
    Window size is         21
    Input file was GCN4_YEAST.SW                 
    
           Residue  Frame  Score        Probability
             1 M      f    1.95242E-01  1.45801E-06
             2 S      g    1.96838E-01  1.47430E-06
             3 E      a    2.11342E-01  1.63197E-06
             4 Y      b    2.15172E-01  1.67668E-06
             5 Q      c    2.37282E-01  1.96300E-06
    long region omitted for clarity
           237 E      b    1.55936E+00  7.85872E-01
           238 A      c    1.55936E+00  7.85872E-01
           239 A      d    1.63957E+00  9.23701E-01
           240 R      e    1.67121E+00  9.51408E-01
           241 R      f    1.67121E+00  9.51408E-01
           242 S      g    1.67121E+00  9.51408E-01
           243 R      a    1.67121E+00  9.51408E-01
           244 A      b    1.68228E+00  9.58676E-01
           245 R      c    1.68228E+00  9.58676E-01
           246 K      d    1.68228E+00  9.58676E-01
           247 L      e    1.68228E+00  9.58676E-01
           248 Q      f    1.83864E+00  9.96362E-01
           249 R      g    1.87778E+00  9.98078E-01
           250 M      a    1.92558E+00  9.99129E-01
           251 K      b    1.92558E+00  9.99129E-01
           252 Q      c    1.92558E+00  9.99129E-01
           253 L      d    1.92558E+00  9.99129E-01
           254 E      e    1.92558E+00  9.99129E-01
           255 D      f    1.92558E+00  9.99129E-01
           256 K      g    1.92558E+00  9.99129E-01
           257 V      a    1.92558E+00  9.99129E-01
           258 E      b    1.92558E+00  9.99129E-01
           259 E      c    1.92558E+00  9.99129E-01
           260 L      d    1.92558E+00  9.99129E-01
           261 L      e    1.92558E+00  9.99129E-01
           262 S      f    1.92558E+00  9.99129E-01
           263 K      g    1.92558E+00  9.99129E-01
           264 N      a    1.92558E+00  9.99129E-01
           265 Y      b    1.92558E+00  9.99129E-01
           266 H      c    1.92558E+00  9.99129E-01
           267 L      d    1.92558E+00  9.99129E-01
           268 E      e    1.92558E+00  9.99129E-01
           269 N      f    1.92558E+00  9.99129E-01
           270 E      g    1.92558E+00  9.99129E-01
           271 V      a    1.89857E+00  9.98635E-01
           272 A      b    1.85112E+00  9.97028E-01
           273 R      c    1.81750E+00  9.94887E-01
           274 L      d    1.81750E+00  9.94887E-01
           275 K      e    1.73705E+00  9.81892E-01
           276 K      f    1.73705E+00  9.81892E-01
           277 L      g    1.59774E+00  8.66090E-01
           278 V      a    1.59774E+00  8.66090E-01
           279 G      b    1.41555E+00  3.21084E-01
           280 E      c    1.41555E+00  3.21084E-01
           281 R      d    1.12442E+00  1.05666E-02
    

    Finding helix-turn-helix regions

    The EGCG program HelixTurnHelix find helix-turn-helix regions using the method of Dodd and Egan (1990) Nucl. Acids. Res. 18:5019-5026 Here is an example with the C2 repressor protein (from P22 rather than lambda):

    $ helixturnhelix/infile=SW:RPC2_BPP22/default
    $ type RPC2_BPP22.hth
    HELIXTURNHELIX of Sw:Rpc2_Bpp22 from: 1 to: 216
    
    Using distribution mean: 238.71 and SD: 293.61
    
    Report scores beyond +2.50 standard deviations
    
    Hits above +2.50 SD (972.73)
    
    Score 2035 (+6.12 SD) in SW:RPC2_BPP22 at residue 20
    P03035 bacteriophage p22, and bacteriophage p21 (bacteriophage 21). 
    repressor protein c2. 11/97
    
     Sequence:  RQAALGKMVGVSNVAISQWERS
                |                    |
               20                    41
    $ fetch sw:RPC2_BPP22
    $ search RPC2_BPP22.sw helix,turn
    FT   HELIX         6     17
    FT   TURN         18     18
    FT   HELIX        21     28
    FT   TURN         29     29
    FT   HELIX        32     39
    FT   TURN         40     41
    FT   HELIX        47     56
    FT   TURN         57     58
    FT   HELIX        61     66
    FT   TURN         67     67
    FT   TURN         73     75
    

    Note, there is a helix-loop-helix motif in Prosite, similar names not withstanding, these are not the same things!


    Predicting secondary structure

    There are also an assortment of properties that one can measure along the protein chain, as well as various predictions of secondary structure that can be made. The measurements are reasonably reliable, but the predictions are not. Part of the reason for this is that short peptides, in lengths that are common for a stretch of alpha helix or beta sheet in a protein, don't have defined structures in solution. That is, their structure in the protein depends in great measure on sequence outside of the subsequence itself. These algorithms work on small windows of a size that corresponds to these wobbly peptides, so they can at best only give indications of what sequences are likely to be in a helix or beta sheet in a protein.

    Here is a set of secondary structure predictions, made using a variety of methods:

    
    Comparisons of predicted secondary structure with that
    observed in an actual structure (C2 fragment.)  In each
    case, the first 100 amino acids were put into the predicting
    program, but only the first 70 are compared, as the NMR
    fragment only covered that region. 
    
    stru:       Actual structure, h = helix, t = turn, as marked
                in the PDB entry 1ADR (NMR structure).
    
    nn:         H = helix, E = strand, - = no prediction
                method: neural network
                where: http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html
    
    psa:        probabilities rounded to the nearest 10, ie 7 = 70%.
                method: Type-2 discrete state-space models
                Where:  http://bmerc-www.bu.edu/psa/request.htm
                        psa-request@darwin.bu.edu
    
    SSPREDICT
     Pred SS:   Raw prediction
     Clean SS:  After filters applied to raw prediction   
                method: Algorithm based on sequence usage in 3D alignments
                Where: http://www.embl-heidelberg.de/sspred/sspred_info.html
    PHD sec:    H = helix, E = strand, - = no prediction
                method: neural network
                PhD method       : Rost & Sander, (1994) Proteins, 19, 55-72
                Where:  phd@EMBL-Heidelberg.de, or via WWW
    
    SOPM
                H = helix, S = strand, C = coil, T = turn
                method: Self Optimized prediction method (plus others)
                Gibrat method    : Gibrat et al., (1987) J.Mol.Biol. 198, 425-443
                Levin method     : Levin et al., (1986) Febs Lett. 205, 303-308
                DPM method       : Deleage & Roux, (1987) Prot. Engng. 1, 289-294
                Where:  http://www.ibcp.fr/serv_pred.html
            1         11        21        31        41        51        61
            MNTQLMGERIRARRKKLKIRQAALGKMVGVSNVAISQWERSETEPNGENLLALSKALQCSPDYLLKGDLS
    stru         hhhhhhhhhhhht  hhhhhhhht   hhhhhhhtt     hhhhhhhhhht   hhhhhht
    
    nn      --------HHHHH---HHHHHHHH--EE-----EE-H-----------HHHHHHHH-------E------
    psa
     Loop   3332222211111222211222222222333333234333446665455433333335655545444445
     Helix  6667777777777777888777776766554444544444332211223445545544211112222222
     Turn   0000001110001110000011111100011110011122221234431100111111024320134311
     Strand 1100110011110000111100110112122223321211111100011222221111111123311122
    SSPRED
     Pred   HHEEHHEH-HHHHHHHHHHHHHHHHHEHHH-EEEEEEH----------EEHHHHHHHHH-HEEEHH----
     Clean  HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-EEEEEE-------------HHHHHHHHHHHEEE------
    
    PHD sec -------HHHHHHHHHHHHHHHHHHHH--EEEEEEEEE---------HHHHHHHHHHH---HHHE-----
    SOPM:
     Gibrat HHHHHHHHHHHHHHHHHHHHHHHHCHEEEEEEEEEEEEECCCCCCCHHHHHHHHHHHHCCCCHEEECCHC
     Levin  CSHHHHHHHHHHHCSHHHHHHHHHHHHHECCCCEECCHCCCCCCCTCCHHHHHHHHCCCCCHEEEESCSC
     DPM    CCCHHHCHHHHHHHHHHHHHHHHHCHEEEEEEEEEEHHHHCCCTTTTCCHHHHHHHHHCTCCCCTCCCCC
     SOPMA  HHHHHHHHHHHHHHHHHHHHHHHHHHHEECCHHHHHHHCCCCCCCCCHHHHHHHHHHHHCCCEEEECCCC
     Cons.  CCHHHHHHHHHHHHHHHHHHHHHHCHHEECCEEEEEHHCCCCCCCCCCHH HHHHHHHHCCCEEEECCCC
    

    Most of these methods are about 70% accurate. So if you only need to predict a single region of helix or sheet, those are the odds that you'll get it right. However, if you want to build a model of a protein that has, say 6 of those regions, your odds of having it all correct aren't very good, at .7 **6 = .12, or only 12% probability of it being right. And that's only the chance of it being generally correct, the odds of getting the exact region boundaries right are essentially zero.

    The take home lesson from this is that you should really only base secondary structure predictions on sequence similarity to proteins whose structures have been determined. That being said, here are some tools which try to predict secondary structures directly from the amino acid sequence.


    Remote secondary structure prediction

    You can run some of the secondary structure prediction methods described above on remote servers. In both cases, the results will be mailed back to you.

    $ nnpredict
    $ em_phd
    


    PeptideStructure and PlotStructure

    The GCG package provides a pair of programs which can make a plot representing the predicted secondary structure of a protein. These programs implement the methods described in Chou and Fasman (1978), Advances in Enzymol., 47:45-148; Kyte and Doolittle (1982), J. Mol. Biol. 157:105-132, and others. For more references, see

    $ GENHELP PEPTIDESTRUCTURE DESCRIPTION
    

    This is how these two programs are used:

    $ peptidestructure/infile=SW:RPC2_BPP22
    $ plotstructure/infile=Rpc2_Bpp22.p2s -
      /begin=1/end=70/menu=1/default
    

    $ plotstructure/infile=Rpc2_Bpp22.p2s -
      /begin=1/end=70/menu=2 -
      /number=10/default
    

    In the squiggly plot, the four predicted types of motifs are:

  • Helix = sine wave
  • Beta = sharp saw tooth
  • turn = 180 degree turn in direction
  • coils = (random coils) dull saw tooth
  • Note that every region will be classified into one of these or the other. Also, it is hard to tell sharp saw tooth from from dull unless both are present in the plot.

    This is the same region as we looked at when comparing different secondary structure predictions. As you know from that, the predicted structures here are largely incorrect!!!!


    Another GCG program which predicts secondary structure

    PepPlot is fairly similar to the above. It calculates a bunch of values which supposedly are indicators of probable secondary structure.

    $ pepplot/infile=SW:RPC2_BPP22/begin=175/end=187/default
    


    Hydrophobic moment of a peptide sequence

    The hydrophobic moment is a sort of vector average of the known hydrophobicities of the amino acids, assuming a fixed rotation angle between consecutive residues. Normally one plots over the range of rotation angles from 0 to 180. In some instances this piles up hydrophobic residues at "magic" angles, such as 100 for alpha helix or 180 for beta. The GCG program Moment will plot this value for a protein.

    $ moment/infile=SW:RPC2_BPP22/begin=150/end=200/default
    

    The plots tend to be fairly noisy, and rarely do they give as clean a result as that shown. Also, the ends tend to be a bit blurred since the method requires the use of a window, which changes values slowly, rather than abruptly, when it slides off the end of a high scoring region.


    Wheel plots of a peptide sequence

    HelicalWheel is another graphical tool. Here is the same region as in the previous example, but this time "zoomed in" on the region which generates the signal:

    $ helicalwheel/infile=SW:RPC2_BPP22 -
      /begin=175/end=187 -
      /default
    

    With HelicalWheel plots - even random sequences will often seem to have a hydrophobic side and a hydrophilic side. This is because the eye can pick any division plane to split the helix, which allows considerable latitude for finding the most extreme distribution.

    This tool may also be used to visualize beta sheets, once the /ANGLE and /RADIUS have been set appropriately. The following example shows the same region to see if a beta sheet with one hydrophobic side might be present here. The plot doesn't support that guess.

    $ helicalwheel/infile=SW:RPC2_BPP22 -
      /begin=175/end=187 -
      /angle=180/radius=5 -
      /default
    
    

    If two alpha-helices twist around each other, amino acids which are located every 7 residues along the helix can line up with similarly spaced residues on the other chain. The leucine zipper has this geometry, with Leucines spaced out at intervals of 7 residues. In the absence of wrapping, an offset of 7 residues corresponds to 700 degrees, 20 degrees shy of two full turns. It is possible to take the twist angle into account, sort of, by adjusting the angle to 103 degrees. The result is a plot viewed down the twisting axis of the pair of helices. Here's one such region:

    $ helicalwheel/infile=SW:ap1_chick -
      /begin=259/end=287 -
      /angle=103/radius=5 -
      /default
    

    which is really a much better illustration of the effect than is:

    $ helicalwheel/infile=SW:ap1_chick -
      /begin=259/end=287 -
      /default
    

    The latter form is, however, appropriate if, despite the presence of the 7 residue repeat pattern, the peptide is not in a coiled coil conformation.

    The EGCG program PepWheel may be used to make similar plots. PepWheel gives you a bit more control over the shapes drawn around each residue than does HelicalWheel.


    Plots of peptide sequence which are related to Wheel plots

    The EGCG program PepNet is essentially a variant on the HelicalWheel program. Instead of looking down the axis, it splits the sequence with a period of 3.6 - corresponding to the period of an alpha helix - and unwraps it.

    $ pepnet/infile=SW:RPC2_BPP22 -
      /begin=175/end=187 -
      /nocircles/nodiamonds -
      /squares="ILVMGAF" -
      /default
    

    The default list of hydrophobic amino acids used by PepNet is not the same as for HelicalWheel. There is also a slight bug in this version which causes it to draw a couple of extra amino acids past where you tell it to stop.


    Statistical analysis of protein sequences (SAPS)

    SAPS is a program from Karlin's laboratory at Stanford which does an assortment of statistical analyses on a protein sequence. It *may* give you a hint as to how the protein is structured that you would not otherwise have picked using the other tools we have available. However, SAPS is very much a last resort - something you can fall back on when your protein has no homology to any other protein and contains none of the known motifs. Reference:Brendel, V., Bucher, P., Nourbakhsh, I., Blaisdell, B.E., Karlin, S. (1992) PNAS 89:2002-2006.

    SAPS is somewhat complicated to use. For more information, see the HELP listing for SAPS:

    $ help @saps saps
    

    First, convert a GCG formatted file into SAPS format (actually EMBL format). Then run the SAPS program, telling it which organism this sequence is from. The species it knows about are:

    
      BACSU   Bacillus subtilis
      CHICK   chicken
      DROME   Drosophila melanogaster
      ECOLI   Escherichia coli
      HUMAN   human
      MOUSE   mouse
      RAT     rat
      XENLA   frog
      YEAST   Saccharomyces  cerevisiae
      swp23s  random sample of proteins  from  SWISS-PROT
              Release  23.0 (default)
    

    $ readseq -f4 rpc2_bpp22.sw -orpc2_bpp22.saps
    $ saps -s ecoli -o saps.results rpc2_bpp22.saps
    $ type saps.results
    SAPS.  Version of March 26, 1993.
    Date run: Wed Feb 17 11:37:50 1999
    
    File: rpc2_bpp22.saps
    ID   Rpc2_Bpp22
    DE   Rpc2_Bpp22, 216 bases, B36B8C41 checksum.
    
    number of residues:  216
       1  MNTQLMGERI RARRKKLKIR QAALGKMVGV SNVAISQWER SETEPNGENL LALSKALQCS 
      61  PDYLLKGDLS QTNVAYHSRH EPRGSYPLIS WVSAGQWMEA VEPYHKRAIE NWHDTTVDCS 
     121  EDSFWLDVQG DSMTAPAGLS IPEGMIILVD PEVEPRNGKL VVAKLEGENE ATFKKLVMDA 
     181  GRKFLKPLNP QYPMIEINGN CKIIGVVVDA KLANLP
    -------------------------------------------------------------------------
    COMPOSITIONAL ANALYSIS (extremes relative to: ecoli.q)
    A  :17( 7.9%); C  : 3( 1.4%); D  :10( 4.6%); E  :17( 7.9%); F- : 3( 1.4%)
    G  :15( 6.9%); H  : 4( 1.9%); I  :12( 5.6%); K  :15( 6.9%); L  :21( 9.7%)
    M  : 8( 3.7%); N  :12( 5.6%); P  :13( 6.0%); Q  : 8( 3.7%); R  :11( 5.1%)
    S  :14( 6.5%); T  : 7( 3.2%); V  :16( 7.4%); W  : 5( 2.3%); Y  : 5( 2.3%)
    
    KR      :26 ( 12.0%);   ED      :27 ( 12.5%);   AGP     :45 ( 20.8%);
    KRED    :53 ( 24.5%);   KR-ED   :-1 ( -0.5%);   FIKMNY  :55 ( 25.5%);
    LVIFM   :60 ( 27.8%);   ST      :21 (  9.7%).
    -------------------------------------------------------------------------
    CHARGE DISTRIBUTIONAL ANALYSIS
       1  0000000-+0 +0++++0+0+ 00000+0000 00000000-+ 0-0-000-00 0000+00000 
      61  0-000+0-00 00000000+0 -0+0000000 00000000-0 0-000++00- 000-000-00 
     121  --0000-000 -000000000 00-000000- 0-0-0+00+0 000+0-0-0- 000++000-0 
     181  0++00+0000 00000-0000 0+000000-0 +00000
    A. CHARGE CLUSTERS.
      Positive charge clusters (cmin = 10/30 or 14/45 or 17/60):  none
      Negative charge clusters (cmin = 10/30 or 14/45 or 17/60):  none
      Mixed charge clusters (cmin = 16/30 or 22/45 or 28/60):  none
    B. HIGH SCORING (UN)CHARGED SEGMENTS.
      There are no high scoring positive charge segments.
      There are no high scoring negative charge segments.
      There are no high scoring mixed charge segments.
      There are no high scoring uncharged segments.
    C. CHARGE RUNS AND PATTERNS.
      pattern  (+)|  (-)|  (*)|  (0)| (+0)| (-0)| (*0)|(+00)|(-00)|(*00)|
      lmin0     5 |   5 |   7 |  30 |   9 |   9 |  12 |  10 |  11 |  14 | 
      lmin1     6 |   6 |   9 |  37 |  11 |  11 |  15 |  13 |  13 |  17 | 
      lmin2     7 |   7 |  10 |  41 |  12 |  12 |  17 |  15 |  15 |  19 | 
    There are no charge runs or patterns exceeding the given minimal lengths.
    Run count statistics:
      +  runs >=   3:   1, at   13;
      -  runs >=   3:   0
      *  runs >=   5:   0
      0  runs >=  20:   0
    -------------------------------------------------------------------------
    DISTRIBUTION OF OTHER AMINO ACID TYPES
    1. HIGH SCORING SEGMENTS.
      There are no high scoring hydrophobic segments.
      There are no high scoring transmembrane segments.
    2. SPACINGS OF C.
      H2N-58-C-59-C-81-C-15-COOH
    -------------------------------------------------------------------------
    REPETITIVE STRUCTURES.
    A. SEPARATED, TANDEM, AND PERIODIC REPEATS: amino acid alphabet.
      Repeat core block length:  4
    B. SEPARATED AND TANDEM REPEATS: 11-letter reduced alphabet.
      (i= LVIF; += KR; -= ED; s= AG; o= ST; n= NQ; a= YW; p= P; h= H; m= M; c= C)
      Repeat core block length:  8
    -------------------------------------------------------------------------
    MULTIPLETS.
    A. AMINO ACID ALPHABET.
      1. Total number of amino acid multiplets:  11  (Expected range:   2-- 23)
      2. Histogram of spacings between consecutive amino acid multiplets:
         (1-5) 2   (6-10) 2   (11-20) 4   (>=21) 4
      3. Clusters of amino acid multiplets (cmin = 10/30 or 13/45 or 15/60):  none
    B. CHARGE ALPHABET.
      1. Total number of charge multiplets:   5  (Expected range:   0-- 13)
         4 +plets (f+: 12.0%), 1 -plets (f-: 12.5%)
         Total number of charge altplets: 2 (Critical number: 16)
      2. Histogram of spacings between consecutive charge multiplets:
         (1-5) 0   (6-10) 1   (11-20) 2   (>=21) 3
    -------------------------------------------------------------------------
    PERIODICITY ANALYSIS.
    A. AMINO ACID ALPHABET (core:  4; !-core: 5)
      Location      Period  Element         Copies  Core    Errors
      There are no periodicities of the prescribed length.
    B. CHARGE ALPHABET ({+= KR; -= ED; 0}; core:  5; !-core: 6)
      and HYDROPHOBICITY ALPHABET ({*=KRED; i=LVIF; 0}; core: 6; !-core:9)
      Location      Period  Element         Copies  Core    Errors
       102- 125      4       *00.           6       6       /0/1/1/./
    -------------------------------------------------------------------------
    SPACING ANALYSIS.
    Location (Quartile) Spacing     Rank       P-value   Interpretation
      25-  29  (1.)     G(   4)G    16 of  16   0.0051   large minimal spacing
    

    In this instance, SAPS didn't tell us anything very useful.


    Subcellular localization

    PSORT2 is a program which attempts to classify the subcellular localization of a protein. It must first be trained with a set of proteins whose subcellular locations are known. Two databases have been created, the default is for animal/yeast proteins, PLANTC is for plants (especially dicots). The reference for PSORT2 is Horton, P. and Nakai, K. (1996) Intelligent Systems for Molecular Biology 4:109-115. (Many thanks to K. Nakai for providing us with the source code and the cholorplast fixes!)

    $ tofasta/infile=sw:H2a2_Human/out=h2a2.fsa/default
    $ psort2 h2a2.fsa
    ---------------------------------------------------------------------------
    H2A2_HUMAN      P28001 homo sapiens (human). histone h2a.2. 8/92
    
     psg: 0.50  gvh: 0.52  alm: 0.39  top: 0.53  tms: 0.00  mit: 0.44  mip: 0.08
     nuc: 0.12  erl: 0.00  erm: 0.60  pox: 0.00  px2: 0.50  vac: 0.00  rnp: 0.00
     act: 0.00  caa: 0.00  yqr: 0.00  tyr: 0.00  leu: 0.00  gpi: 0.00  myr: 0.00
     dna: 0.13  rib: 0.00  bac: 0.00  m1a: 0.00  m1b: 0.00  m2 : 0.00  mNt: 0.00
     m3a: 0.00  m3b: 0.00  m_ : 1.00  ncn: 0.88  lps: 0.00  len: 0.03  clr: 1.00
    
    
             39.1 %: nuclear
             30.4 %: cytoplasmic
             30.4 %: mitochondrial
    
    >> prediction for H2A2_HUMAN is nuc (k=23)
    $ psort2 -d PLANTC h2a2.fsa
    ---------------------------------------------------------------------------
    H2A2_HUMAN      P28001 homo sapiens (human). histone h2a.2. 8/92
    
     psg: 0.37  gvh: 0.57  alm: 0.40  top: 0.51  tms: 0.00  mit: 0.45  mip: 0.21
     nuc: 0.18  erl: 0.00  erm: 0.60  pox: 0.00  px2: 0.50  vac: 0.00  rnp: 0.00
     act: 0.00  caa: 0.00  yqr: 0.00  tyr: 0.00  leu: 0.00  gpi: 0.00  myr: 0.00
     dna: 0.33  rib: 0.00  bac: 0.00  m1a: 0.00  m1b: 0.00  m2 : 0.00  mNt: 0.00
     m3a: 0.00  m3b: 0.00  m_ : 1.00  ncn: 0.88  lps: 0.00  len: 0.05  clr: 1.00
    
    
             87.0 %: nuclear
              8.7 %: mitochondrial
              4.3 %: cytoplasmic
    
    >> prediction for H2A2_HUMAN is nuc (k=23)
    

    Next week we'll cover some phylogenetic tools. Are there any questions?