Today we're going to continue examining tools that you can use for planning and analyzing your molecular Biology experiments. The focus of this lecture will be on analyzing Protein sequences. The protein analysis tools deal with a lot of different properties and kinds of analyses which aren't necessarily related to each other, so the focus of this lecture will jump around a lot as we move from one kind of analysis to another. Here is an overview of what will be covered:
| BackTranslate | Translate Protein to DNA |
| PeptideSort | Digest Peptides |
| PeptideMap | Map Peptide Digestions |
| CBRG server | Search database by fragment masses |
| Isoelectric | Plot charge properties |
| SigCleave | Find signal sites |
| PESTFIND | Find PEST sequences |
| Antigenic | Find (likely) antigenic sequences |
| PepCoil | Find coiled-coil regions |
| HelixTurnHelix | Find Helix-Turn-Helix regions |
| Peptide/PlotStucture | Predict/plot secondary structure and other properties |
| PepPlot | Predict/plot secondary structure and other properties |
| Moment | Plot hydrophobic moment |
| HelicalWheel | Plot amino acids around alpha helix or beta sheet |
| PepNet | Plot amino acids around alpha helix |
| SAPS | Statistical properties of protein sequences |
| PSORT2 | Predict subcellular location |
Imagine that you have a chicken protein sequence and want to isolate a similar protein from a human cDNA library. To do so, back translate the protein sequence to a DNA sequence, and make an oliginucleotide probe. Be sure to specify a codon frequency table that is appropriate for the target organism (in this instance, Homo sapiens). The default frequency table is for E. coli highly expressed genes. Use the /infile2=xxxxx.cod qualifier to specify the desired table.
The menu options for BackTranslate are:
Note that the most ambiguous sequence may not even code for the same protein. Such a sequence uses ambiguity codes for some positions, such as N or Y, and sometimes this results in a sequence which will not forward translate back to the original amino acid. For instance, the degenerate code for leucine is YTN , because the 6 leucine codons are CTN, TTA and TTG. But YTN includes TTC and TTT, which code for phenylalanine.
leucine = CTA,CTG,CTC,CTT,TTA,TTG = {C,T}T{A,C,G,T} = YTN
YTN = {CTN, TTA, TTG} (leucine) + {TTC, TTT} (phenylalanine)
In the following example, the known protein is chicken myelin protein and the back translation utilizes human coding preferences.
$ backtranslate -
/infile1=sw:mypr_chick -
/infile2=data:hum.cod -
/outfile=human.tbl -
/menu=B -
/begin=20/end=30
$ type human.tbl
BACKTRANSLATE of: : Mypr_Chick check: 278 from: 20 to: 30
ID MYPR_CHICK STANDARD; PRT; 276 AA.
AC P23289;
DT 01-NOV-1991 (REL. 20, CREATED)
DT 01-NOV-1991 (REL. 20, LAST SEQUENCE UPDATE)
DT 01-FEB-1996 (REL. 33, LAST ANNOTATION UPDATE)
DE MYELIN PROTEOLIPID PROTEIN (PLP) (LIPOPHILIN). . . .
Using codon frequencies from: Gencoredisk:[Gcgcore.Data.Rundata]Hum.Cod
CheckFile: 5653
Ala Thr Gly Leu Cys Phe Phe
GCC 0.40 ACC 0.38 GGC 0.33 CTG 0.43 TGC 0.58 TTC 0.57 TTC 0.57
GCT 0.28 ACA 0.27 GGA 0.26 CTC 0.20 TGT 0.42 TTT 0.43 TTT 0.43
GCA 0.22 ACT 0.23 GGG 0.23 TTG 0.12
GCG 0.10 ACG 0.12 GGT 0.18 CTT 0.12
CTA 0.07
TTA 0.06
22 31 47 81 62 51 36
27 - 30
Gly Val Ala Leu
GGC 0.33 GTG 0.48 GCC 0.40 CTG 0.43
GGA 0.26 GTC 0.25 GCT 0.28 CTC 0.20
GGG 0.23 GTT 0.17 GCA 0.22 TTG 0.12
GGT 0.18 GTA 0.10 GCG 0.10 CTT 0.12
CTA 0.07
TTA 0.06
27 0 0 0
Human.Tbl Length: 33 February 23, 1999 10:27 Type: N Check: 4383 ..
1 GCNACNGGNY TNTGYTTYTT YGGNGTNGCN YTN
The numbers under each column is the product of the probabilities for the most likely codon for that and the next three codons, scaled into the range 0 to 1000. Example (bold codons, above):
.40 * .38 * .33 * .43 = 0.0215688 = > 22
From the codons shown you may design degenerate primers. For instance, if we wanted to start at the first position (amino acid 20) in this sequence, begin by editing out the frequencies and spaces to give:
GCCACCGGCCTGTGCTTCTTC
GCTACAGGACTCTGTTTTTTT
GCAACTGGGTTG
GCGACGGGTCTT
CTA
TTA
then for each column, leave only the unique nucleotides, to give:
GCCACCGGCCTGTGCTTCTTC T A AT C T T T A T G T G G T A
Most synthesizers can make a degenerate oligonucleotide like this. You might also examine the sequence table to see if there are any less degenerate regions. In either case, you would probably want to add some unique sequence on the 5' ends so that you could cut the ends, amplify it again, and so forth.
There are an assortment of methods for digesting proteins at sequence specific sites. Often you will want to calculate the properties of the fragments which result from such a digestion so that you can tell which is which on a gel or other analysis. PeptideSort is analogous to the MapSort program, but for proteins. It shows a lot of information, where the cuts were, and for each fragment, the molecular weight, HPLC retention at ph 2.1 and 7.4 (probably not accurate for fragments longer than 20 amino acids long) and assorted compositional information. At the end is a summary of the composition of the full protein. Adding the qualifier /NOCut just outputs a summary for the composition of the whole protein.
$ PEPTIDESORT/infile=pir1:a1hu/enzyme=tryp/default
$ type a1hu.pepsort
PEPTIDESORT of: Pir1:A1hu check: 6072 from: 1 to: 353
P1;A1HU - Ig alpha-1 chain C region - human
C;Species: Homo sapiens (man)
C;Date: 22-May-1981 #sequence_revision 03-Oct-1995 #text_change 02-Sep-1997
C;Accession: A22360; A92249; A91662; S38979; B53110; A02171
R;Flanagan, J.G.; Lefranc, M.P.; Rabbitts, T.H.
Cell 36, 681-688, 1984 . . .
With Enzymes: TRYP
February 17, 1999 09:58 ..
Digest with: Tryp. Peptides Sorted by Position
Pos From To Mol Wt Ret2.1 Ret7.4 Chg Aro Acid Base Sulf Phil Phob
1 1 - 7 686.8 13.3 17.3 1.0 0 0 1 0 6 1
A1,K1,P2,S2,T1 Iso=9.67 Ext=0
2 8 - 51 4666.3 115.4 83.9 -2.0 4 3 1 2 21 23
A2,C2,D1,E2,F3,G4,I1,L4,N1,P4,Q4,R1,S5,T3,V6,W1 Iso=4.00 Ext=5930
3 52 - 81 3112.4 62.4 51.0 -1.0 2 2 1 1 18 12
A3,C1,D2,F1,G2,K1,L4,N1,P3,Q3,S4,T4,Y1 Iso=4.11 Ext=1400
4 82 - 88 772.9 6.3 -1.5 1.0 0 0 1 1 4 3
C1,H1,K1,S1,T1,V2 Iso=8.44 Ext=120
5 89 - 126 3967.4 99.0 75.9 0.0 1 1 1 3 31 7
C3,D1,H2,N1,P13,Q1,R1,S6,T6,V3,Y1 Iso=7.26 Ext=1640
6 127 - 131 624.7 19.1 15.5 1.0 0 0 1 0 3 2
H1,L2,R1,S1 Iso=10.53 Ext=0
7 132 - 153 2300.6 58.0 28.6 -2.0 0 3 1 1 10 12
A2,C1,D1,E2,G2,L7,N1,P1,R1,S1,T3 Iso=4.00 Ext=120
8 154 - 168 1540.6 36.1 40.4 0.0 2 1 1 0 9 6
A1,D1,F1,G2,K1,P1,S3,T3,V1,W1 Iso=6.31 Ext=5690
9 169 - 177 940.0 7.0 -4.9 0.0 0 1 1 0 6 3
A1,E1,G1,P2,Q1,R1,S1,V1 Iso=6.44 Ext=0
10 178 - 200 2422.7 41.8 10.0 -1.0 2 2 1 3 10 13
A1,C3,D1,E1,G3,H1,K1,L2,N1,P2,S3,V2,W1,Y1 Iso=5.48 Ext=7330
11 201 - 212 1318.4 24.3 11.5 0.0 2 1 1 1 7 5
A2,C1,E1,F1,K1,P1,S1,T3,Y1 Iso=6.22 Ext=1400
12 213 - 221 931.1 32.0 33.0 1.0 0 0 1 0 6 3
A1,K1,L2,P1,S1,T3 Iso=9.67 Ext=0
13 222 - 227 680.7 11.6 16.5 1.0 1 0 1 0 4 2
F1,G1,N1,R1,S1,T1 Iso=10.53 Ext=0
14 228 - 253 2855.3 76.7 20.5 -3.0 0 4 1 1 14 12
A2,C1,E4,H1,L7,N1,P4,R1,S1,T2,V2 Iso=4.31 Ext=120
15 254 - 258 534.6 21.0 20.0 1.0 1 0 1 0 3 2
F1,G1,K1,P1,S1 Iso=9.67 Ext=0
16 259 - 263 600.7 15.8 6.2 0.0 0 1 1 0 2 3
D1,L1,R1,V2 Iso=6.31 Ext=0
17 264 - 273 1213.4 31.4 13.5 0.0 1 1 1 0 6 4
E1,G1,L2,P1,Q2,R1,S1,W1 Iso=6.44 Ext=5690
18 274 - 275 275.3 -4.2 -17.4 0.0 0 1 1 0 2 0
E1,K1 Iso=6.44 Ext=0
19 276 - 282 896.0 36.0 34.4 1.0 2 0 1 0 3 4
A1,L1,R1,S1,T1,W1,Y1 Iso=9.75 Ext=6970
20 283 - 299 1836.0 34.5 32.1 0.0 1 1 1 0 11 6
A1,E1,F1,G1,I1,L1,P1,Q2,R1,S2,T4,V1 Iso=6.44 Ext=0
21 300 - 306 817.9 14.2 -7.0 -1.0 1 2 1 0 3 4
A2,D1,E1,K1,V1,W1 Iso=4.24 Ext=5690
22 307 - 307 146.2 3.3 -0.5 1.0 0 0 1 0 1 0
K1 Iso=9.67 Ext=0
23 308 - 327 2152.5 53.4 24.5 -1.0 2 2 1 2 9 11
A2,C1,D1,E1,F2,G2,H1,K1,L2,M1,P1,Q1,S1,T2,V1 Iso=5.49 Ext=120
24 328 - 331 503.5 12.5 8.6 0.0 0 1 1 0 3 1
D1,I1,R1,T1 Iso=6.31 Ext=0
25 332 - 335 387.5 12.7 8.8 1.0 0 0 1 0 1 3
A1,G1,K1,L1 Iso=9.67 Ext=0
26 336 - 353 1921.2 31.7 0.8 -2.0 1 2 0 2 8 10
A1,C1,D1,E1,G1,H1,M1,N1,P1,S1,T2,V5,Y1 Iso=4.23 Ext=1400
Digest with: Tryp. Peptides Sorted by Weight
Pos From To Mol Wt Ret2.1 Ret7.4 Chg Aro Acid Base Sulf Phil Phob
22 307 - 307 146.2 3.3 -0.5 1.0 0 0 1 0 1 0
18 274 - 275 275.3 -4.2 -17.4 0.0 0 1 1 0 2 0
25 332 - 335 387.5 12.7 8.8 1.0 0 0 1 0 1 3
24 328 - 331 503.5 12.5 8.6 0.0 0 1 1 0 3 1
15 254 - 258 534.6 21.0 20.0 1.0 1 0 1 0 3 2
16 259 - 263 600.7 15.8 6.2 0.0 0 1 1 0 2 3
6 127 - 131 624.7 19.1 15.5 1.0 0 0 1 0 3 2
13 222 - 227 680.7 11.6 16.5 1.0 1 0 1 0 4 2
1 1 - 7 686.8 13.3 17.3 1.0 0 0 1 0 6 1
4 82 - 88 772.9 6.3 -1.5 1.0 0 0 1 1 4 3
21 300 - 306 817.9 14.2 -7.0 -1.0 1 2 1 0 3 4
19 276 - 282 896.0 36.0 34.4 1.0 2 0 1 0 3 4
12 213 - 221 931.1 32.0 33.0 1.0 0 0 1 0 6 3
9 169 - 177 940.0 7.0 -4.9 0.0 0 1 1 0 6 3
17 264 - 273 1213.4 31.4 13.5 0.0 1 1 1 0 6 4
11 201 - 212 1318.4 24.3 11.5 0.0 2 1 1 1 7 5
8 154 - 168 1540.6 36.1 40.4 0.0 2 1 1 0 9 6
20 283 - 299 1836.0 34.5 32.1 0.0 1 1 1 0 11 6
26 336 - 353 1921.2 31.7 0.8 -2.0 1 2 0 2 8 10
23 308 - 327 2152.5 53.4 24.5 -1.0 2 2 1 2 9 11
7 132 - 153 2300.6 58.0 28.6 -2.0 0 3 1 1 10 12
10 178 - 200 2422.7 41.8 10.0 -1.0 2 2 1 3 10 13
14 228 - 253 2855.3 76.7 20.5 -3.0 0 4 1 1 14 12
3 52 - 81 3112.4 62.4 51.0 -1.0 2 2 1 1 18 12
5 89 - 126 3967.4 99.0 75.9 0.0 1 1 1 3 31 7
2 8 - 51 4666.3 115.4 83.9 -2.0 4 3 1 2 21 23
Digest with: Tryp. Peptides Sorted by Retention
Pos From To Mol Wt Ret2.1 Ret7.4 Chg Aro Acid Base Sulf Phil Phob
18 274 - 275 275.3 -4.2 -17.4 0.0 0 1 1 0 2 0
22 307 - 307 146.2 3.3 -0.5 1.0 0 0 1 0 1 0
4 82 - 88 772.9 6.3 -1.5 1.0 0 0 1 1 4 3
9 169 - 177 940.0 7.0 -4.9 0.0 0 1 1 0 6 3
13 222 - 227 680.7 11.6 16.5 1.0 1 0 1 0 4 2
24 328 - 331 503.5 12.5 8.6 0.0 0 1 1 0 3 1
25 332 - 335 387.5 12.7 8.8 1.0 0 0 1 0 1 3
1 1 - 7 686.8 13.3 17.3 1.0 0 0 1 0 6 1
21 300 - 306 817.9 14.2 -7.0 -1.0 1 2 1 0 3 4
16 259 - 263 600.7 15.8 6.2 0.0 0 1 1 0 2 3
6 127 - 131 624.7 19.1 15.5 1.0 0 0 1 0 3 2
15 254 - 258 534.6 21.0 20.0 1.0 1 0 1 0 3 2
11 201 - 212 1318.4 24.3 11.5 0.0 2 1 1 1 7 5
17 264 - 273 1213.4 31.4 13.5 0.0 1 1 1 0 6 4
26 336 - 353 1921.2 31.7 0.8 -2.0 1 2 0 2 8 10
12 213 - 221 931.1 32.0 33.0 1.0 0 0 1 0 6 3
20 283 - 299 1836.0 34.5 32.1 0.0 1 1 1 0 11 6
19 276 - 282 896.0 36.0 34.4 1.0 2 0 1 0 3 4
8 154 - 168 1540.6 36.1 40.4 0.0 2 1 1 0 9 6
10 178 - 200 2422.7 41.8 10.0 -1.0 2 2 1 3 10 13
23 308 - 327 2152.5 53.4 24.5 -1.0 2 2 1 2 9 11
7 132 - 153 2300.6 58.0 28.6 -2.0 0 3 1 1 10 12
3 52 - 81 3112.4 62.4 51.0 -1.0 2 2 1 1 18 12
14 228 - 253 2855.3 76.7 20.5 -3.0 0 4 1 1 14 12
5 89 - 126 3967.4 99.0 75.9 0.0 1 1 1 3 31 7
2 8 - 51 4666.3 115.4 83.9 -2.0 4 3 1 2 21 23
Summary for whole sequence:
Molecular weight = 37654.29 Residues = 353
Average Residue Weight = 106.669 Charged = -4
Isoelectric point = 6.48
Extinction coefficient = 42720
Residue Number Mole Percent ..
A = Ala 24 6.799
B = Asx 0 0.000
C = Cys 15 4.249
D = Asp 12 3.399
E = Glu 17 4.816
F = Phe 11 3.116
G = Gly 22 6.232
H = His 8 2.266
I = Ile 3 0.850
K = Lys 13 3.683
L = Leu 36 10.198
M = Met 2 0.567
N = Asn 8 2.266
P = Pro 39 11.048
Q = Gln 14 3.966
R = Arg 12 3.399
S = Ser 38 10.765
T = Thr 40 11.331
V = Val 27 7.649
W = Trp 6 1.700
Y = Tyr 6 1.700
Z = Glx 0 0.000
A + G 46 13.031
S + T 78 22.096
D + E 29 8.215
D + E + N + Q 51 14.448
H + K + R 33 9.348
D + E + H + K + R 62 17.564
I + L + M + V 68 19.263
F + W + Y 23 6.516
Enzymes that do cut:
Tryp
Enzymes that do not cut:
NONE
PeptideMap is analogous to Map. Use it to make detailed maps of the locations in the protein sequence which are cut by site specific enzymes.
$ PEPTIDEMAP/infile=pir1:a1hu/enzyme=tryp/default
$ type a1hu.map
(Linear) (Peptide) MAP of: A1hu check: 6072 from: 1 to: 353
With 2 enzymes: TRYP
February 17, 1999 10:04 ..
T T
r r
y y
p p
ASPTSPKVFPLSLCSTQPDGNVVIACLVQGFFPQEPLSVTWSESGQGVTARNFPPSQDAS
1 ---------+---------+---------+---------+---------+---------+ 60
T T
r r
y y
p p
GDLYTTSSQLTLPATQCLAGKSVTCHVKHYTNPSQDVTVPCPVPSTPPTPSPSTPPTPSP
61 ---------+---------+---------+---------+---------+---------+ 120
T T T T T
r r r r r
y y y y y
p p p p p
SCCHPRLSLHRPALEDLLLGSEANLTCTLTGLRDASGVTFTWTPSSGKSAVQGPPERDLC
121 ---------+---------+---------+---------+---------+---------+ 180
T T T T
r r r r
y y y y
p p p p
GCYSVSSVLPGCAEPWNHGKTFTCTAAYPESKTPLTATLSKSGNTFRPEVHLLPPPSEEL
181 ---------+---------+---------+---------+---------+---------+ 240
T T T T T T T
r r r r r r r
y y y y y y y
p p p p p p p
ALNELVTLTCLARGFSPKDVLVRWLQGSQELPREKYLTWASRQEPSQGTTTFAVTSILRV
241 ---------+---------+---------+---------+---------+---------+ 300
TT T T T
rr r r r
yy y y y
pp p p p
AAEDWKKGDTFSCMVGHEALPLAFTQKTIDRLAGKPTHVNVSVVMAEVDGTCY
301 ---------+---------+---------+---------+---------+--- 353
Enzymes that do cut:
Tryp
Enzymes that do not cut:
NONE
There is a specialized type of database search available for those cases where you have either digest information, or know the total mass of a protein, and want to find out if it is likely identical to another in the Swiss-Protein database. The key word here is "identical" - it doesn't take too many amino acid substitutions, or especially indels, to shift the molecular weights far enough so that you won't find similar proteins. Also, the molecular weights must be determined on peptides that do not contain mass altering modifications, for instance, glycosylation. This search is provided by the CBRG (Computational Biochemistry Research Group, in Zurich), with results returned via e-mail. It may take several days for the results to be mailed back to you. Note in the following the masses shown were calculated with PeptideSort, which assumes a protonated carboxyl group, so all masses were increased by 1.0, because the CBRG server converts all fragments to the unprotonated carboxyl group (removes the proton.)
$ cbrg
7 for TotalMass
37655 PIR1:A1HU mass
or
$ cbrg
4 MassSearch, with fragments
22 Trypsin, there are many other options
601. one fragment is .600 kD
897. one fragment is .896 kD
1837. one fragment is 1.836 kD
2301. one fragment is 2.3 kD
blank line - terminate fragment list
blank line - send query
I won't show you the results of these searches - they just consist of a list of proteins, sorted by fit to the data provided, so that the best ones are first in the list.
Use the GCG program Isoelectric to plot a protein's charge at various pHs.
$ tektronix versaterm term $ isoelectric/infile=pir1:a1hu/outfile=a1hu.iso/default
Adjust the vertical scale with /MINCharge=xxx/MAXCharge=yyy. The "+" and "-" in the graph refer to the net charge on the positively and negatively charged residues.
The output file is optional. Here's what is in it:
$ type a1hu.iso
ISOELECTRIC of: Pir1:A1hu Check: 6072 from: 1 to: 353 February 17, 1999 10:17
P1;A1HU - Ig alpha-1 chain C region - human
Amino Acid Number of
Residues
----------------- -----------
Arginine 12
Lysine 13
Histidine 8
Tyrosine 6
Cysteine 15
Glutamic Acid 17
Aspartic Acid 12
Amino Terminus 1
Carboxyl Terminus 1
Number of Hydrogen Ions Bound
---------------------------------------------------------- Net
pH Arg Lys His Tyr Cys Glu Asp NH2 COOH Total Charge ..
1.00 12.00 13.00 8.00 6.00 15.00 16.99 11.99 1.00 1.00 84.97 33.97
1.50 12.00 13.00 8.00 6.00 15.00 16.97 11.95 1.00 0.99 84.91 33.91
2.00 12.00 13.00 8.00 6.00 15.00 16.90 11.85 1.00 0.97 84.73 33.73
2.50 12.00 13.00 8.00 6.00 15.00 16.70 11.55 1.00 0.92 84.17 33.17
3.00 12.00 13.00 8.00 6.00 15.00 16.09 10.69 1.00 0.78 82.56 31.56
3.50 12.00 13.00 7.99 6.00 15.00 14.43 8.64 1.00 0.53 78.60 27.60
4.00 12.00 13.00 7.97 6.00 15.00 10.88 5.38 1.00 0.27 71.50 20.50
4.50 12.00 13.00 7.92 6.00 15.00 6.12 2.45 1.00 0.10 63.59 12.59
5.00 12.00 13.00 7.75 6.00 14.99 2.57 0.90 1.00 0.04 58.25 7.25
5.50 12.00 13.00 7.27 6.00 14.98 0.91 0.30 1.00 0.01 55.47 4.47
6.00 12.00 13.00 6.08 6.00 14.93 0.30 0.10 1.00 0.00 53.40 2.40
6.50 12.00 13.00 4.00 6.00 14.77 0.10 0.03 0.99 0.00 50.88 -0.12
7.00 12.00 13.00 1.92 6.00 14.28 0.03 0.01 0.97 0.00 48.22 -2.78
7.50 12.00 12.99 0.73 6.00 12.95 0.01 0.00 0.92 0.00 45.60 -5.40
8.00 12.00 12.98 0.25 5.99 9.99 0.00 0.00 0.78 0.00 42.00 -9.00
8.50 12.00 12.93 0.08 5.98 5.80 0.00 0.00 0.53 0.00 37.33 -13.67
9.00 12.00 12.79 0.03 5.93 2.50 0.00 0.00 0.27 0.00 33.51 -17.49
9.50 11.99 12.37 0.01 5.79 0.89 0.00 0.00 0.10 0.00 31.15 -19.85
10.00 11.96 11.19 0.00 5.39 0.29 0.00 0.00 0.04 0.00 28.87 -22.13
10.50 11.88 8.59 0.00 4.43 0.09 0.00 0.00 0.01 0.00 25.01 -25.99
11.00 11.63 4.96 0.00 2.83 0.03 0.00 0.00 0.00 0.00 19.45 -31.55
11.50 10.91 2.12 0.00 1.32 0.01 0.00 0.00 0.00 0.00 14.36 -36.64
12.00 9.12 0.76 0.00 0.49 0.00 0.00 0.00 0.00 0.00 10.37 -40.63
12.50 6.00 0.25 0.00 0.16 0.00 0.00 0.00 0.00 0.00 6.41 -44.59
13.00 2.88 0.08 0.00 0.05 0.00 0.00 0.00 0.00 0.00 3.02 -47.98
Isoelectric Point
6.48 12.00 13.00 4.10 6.00 14.78 0.10 0.03 0.99 0.00 51.00 0.00
In lecture two we discussed how to find PROSITE motifs in protein sequences using the Motifs program. There are a few other sorts of motifs that you may need to search for, but which are difficult or impossible to describe as a simple pattern.
To find signal sequences use the EGCG SIGCLEAVE program. This is based on the work of von Heijne (1986) Nucleic Acids Res. 14:4683-4690.
$ sigcleave/infile=sw:1a01_human/default
$ type 1a01_human.sig
SIGCLEAVE of SW:1A01_HUMAN Check: 1951 from: 1 to: 365
ID 1A01_HUMAN STANDARD; PRT; 365 AA.
DE HLA CLASS I HISTOCOMPATIBILITY ANTIGEN, A-1 ALPHA CHAIN PRECURSOR.
Report scores over 3.50
Maximum score 10.5 at residue 23
Sequence: LLLLSGALALTQT-WAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQ
| (signal) | (mature peptide)
10 23
Other entries above 3.50
Score 9.7 at residue 25
Sequence: LLSGALALTQTWA-GSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKM
| (signal) | (mature peptide)
12 25
Score 9.0 at residue 19
Sequence: PRTLLLLLSGALA-LTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSD
| (signal) | (mature peptide)
6 19
Score 9.0 at residue 21
Sequence: TLLLLLSGALALT-QTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAA
| (signal) | (mature peptide)
8 21
Score 5.7 at residue 29
Sequence: ALALTQTWAGSHS-MRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRA
| (signal) | (mature peptide)
16 29
Score 5.7 at residue 329
Sequence: VLLGAVITGAVVA-AVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV
| (signal) | (mature peptide)
316 329
Score 5.5 at residue 324
Sequence: IIAGLVLLGAVIT-GAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV
| (signal) | (mature peptide)
311 324
Score 5.0 at residue 325
Sequence: IAGLVLLGAVITG-AVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV
| (signal) | (mature peptide)
312 325
Score 4.4 at residue 22
Sequence: LLLLLSGALALTQ-TWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAAS
| (signal) | (mature peptide)
9 22
Score 3.5 at residue 20
Sequence: RTLLLLLSGALAL-TQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDA
| (signal) | (mature peptide)
7 20
The documentation for this entry says that the signal is 1 through 24. The two highest scoring predictions by the program are for signals of length 23 and 25, which is pretty close. It is claimed that the method correctly predicts the cleavage site in 75 to 80% of the cases.
PEST sequences are often found in short-lived proteins. The original program was in BASICA, from M. Rechsteiner's laboratory, the current version is something I wrote, which is a translation of the original to ANSI C. The example is for the human proto-oncogene c-myc. Reference: Rogers S., Wells R., Rechsteiner M. (1986)Science 234:364-368. PESTFIND outputs the potential PEST regions in the order in which they occur within the sequence - they are not sorted to show the best regions first.
$ pestfind sw:Myc_Human Now processing: SW:MYC_HUMAN myc_human 439 bp. Run completed, results are in PESTFIND.OUT $ type pestfind.out Pestfind analysis of SW:MYC_HUMAN Processing began at 17-FEB-1999 10:22:04.57 Results on: SW:MYC_HUMAN ==================================================== Potential PEST sequence 10-51 (flank_dist=40) RNYDLDYDSVQPYFYCDEEENFYQQQQQSELQPPAPSEDIWK The weight percent of PEDST is: 31.469387 The hydrophobicity index is: 34.330616 The PEST-FIND score is: 0.142855 --------------------------------- Poor PEST sequence 52-65 (flank_dist=12) KFELLPTPPLSPSR The best PEST-FIND score is: -3.863851 --------------------------------- Poor PEST sequence 83-126 (flank_dist=42) RGDNDGGGGSFSTADQLEMVTELLGGDMVNQSFICDPDDETFIK The best PEST-FIND score is: -1.662364 --------------------------------- Poor PEST sequence 157-166 (flank_dist=8) KDSGSPNPAR The best PEST-FIND score is: -2.641381 --------------------------------- Poor PEST sequence 168-206 (flank_dist=37) HSVCSTSSLYLQDLSAAASECIDPSVVFPYPLNDSSSPK The best PEST-FIND score is: -2.391488 --------------------------------- Potential PEST sequence 206-241 (flank_dist=34) KSCASQDSSAFSPSSDSLLSSTESSPQGSPEPLVLH The weight percent of PEDST is: 52.942924 The hydrophobicity index is: 41.022591 The PEST-FIND score is: 8.607313 --------------------------------- Potential PEST sequence 241-269 (flank_dist=27) HEETPPTTSSDSEEEQEDEEEIDVVSVEK The weight percent of PEDST is: 71.338638 The hydrophobicity index is: 27.603380 The PEST-FIND score is: 25.434561 --------------------------------- Poor PEST sequence 276-287 (flank_dist=10) RSESGSPSAGGH The best PEST-FIND score is: -0.579045 ---------------------------------
Antigenic sequences are regions which are likely to elicit an immune response. You might want to predict this if you needed to make antibodies against a protein by innoculating an animal with peptides rather than whole protein.
The EGCG Antigenic program can find such sequences. The program follows from a study which examined the sequence of about 156 peptides of 20 amino acids or less from antigenic regions, yielding about 2000 total amino acids. The reference is: Kolaskar, AS and Tongaonkar, PC (1990). FEBS Letters 276, 172-174.
$ antigenic/infile=sw:H2b0_Human/default
$ type h2b0_human.anti
ANTIGENIC of SW:H2B0_HUMAN Check: 9639 from: 1 to: 125
ID H2B0_HUMAN STANDARD; PRT; 125 AA.
DE HISTONE H2B.1. . . .
Length 125 residues, score calculated from 4 to 122
Report all peptides over 6 residues
Found 2 hits scoring over 1.00 (true average 1.01)
Maximum length 21 at residues 95-115
Sequence: QTAVRLLLPGELAKHAVSEGT
| |
95 115
Entries in score order, max score at "*"
(1) Score 1.203 length 15 at residue 37-51
*
Sequence: YSIYVYKVLKQVHPD
| |
37 51
(2) Score 1.162 length 21 at residue 95-115
*
Sequence: QTAVRLLLPGELAKHAVSEGT
| |
95 115
This program calculates an antigenicity value which is assigned to the center residue in a sliding window of 7 amino acids. The antigenicity value is based on the frequency with which each amino acid was found in an antigenic determinant of less than 20 amino acids. Note also that antigenic in this sense means not only that it elicited an immune response, but probably that it was on the surface of the protein as well. If the average of these measurements over the whole protein is 1.00 or greater, then any residues having a value >= 1.0 are potentially antigenic. If the average is < 1.0, then residues having a value greater than the average are potentially antigenic.
PepCoil uses the method of Lupas, van Dyke, and Stock (1991) Science 252:1162-1164, to find coiled coil regions. It both plots the probability that a region is a coiled coil and puts the positions in a file.
$ pepcoil/infile=SW:GCN4_YEAST/default
$ type GCN4_YEAST.COIL PEPCOIL of SW:GCN4_YEAST February 17, 1999 10:26 using a window of 28 residues Other structures from 1 to 232 (232 residues) Max score: 1.283 (probability 0.21) Prediction starts at 233 Probable coiled-coil from 233 to 281 (49 residues) Max score: 1.910 (probability 1.00)
There is a minor bug in this program that causes it to create TWO output files with the same name but different version numbers. The lower numbered one is empty, so you can ignore it.
We also have a variant of this which is the original Pascal program written by Lupas. It requires a file in GCG format, but will not work with a GCG database reference such as that used above.
$ fetch sw:gcn4_yeast
$ coils
Y
Gcn4_Yeast.Sw
GCN4_YEAST.COILS
21
$ type GCN4_YEAST.COILS
Window size is 21
Input file was GCN4_YEAST.SW
Residue Frame Score Probability
1 M f 1.95242E-01 1.45801E-06
2 S g 1.96838E-01 1.47430E-06
3 E a 2.11342E-01 1.63197E-06
4 Y b 2.15172E-01 1.67668E-06
5 Q c 2.37282E-01 1.96300E-06
long region omitted for clarity
237 E b 1.55936E+00 7.85872E-01
238 A c 1.55936E+00 7.85872E-01
239 A d 1.63957E+00 9.23701E-01
240 R e 1.67121E+00 9.51408E-01
241 R f 1.67121E+00 9.51408E-01
242 S g 1.67121E+00 9.51408E-01
243 R a 1.67121E+00 9.51408E-01
244 A b 1.68228E+00 9.58676E-01
245 R c 1.68228E+00 9.58676E-01
246 K d 1.68228E+00 9.58676E-01
247 L e 1.68228E+00 9.58676E-01
248 Q f 1.83864E+00 9.96362E-01
249 R g 1.87778E+00 9.98078E-01
250 M a 1.92558E+00 9.99129E-01
251 K b 1.92558E+00 9.99129E-01
252 Q c 1.92558E+00 9.99129E-01
253 L d 1.92558E+00 9.99129E-01
254 E e 1.92558E+00 9.99129E-01
255 D f 1.92558E+00 9.99129E-01
256 K g 1.92558E+00 9.99129E-01
257 V a 1.92558E+00 9.99129E-01
258 E b 1.92558E+00 9.99129E-01
259 E c 1.92558E+00 9.99129E-01
260 L d 1.92558E+00 9.99129E-01
261 L e 1.92558E+00 9.99129E-01
262 S f 1.92558E+00 9.99129E-01
263 K g 1.92558E+00 9.99129E-01
264 N a 1.92558E+00 9.99129E-01
265 Y b 1.92558E+00 9.99129E-01
266 H c 1.92558E+00 9.99129E-01
267 L d 1.92558E+00 9.99129E-01
268 E e 1.92558E+00 9.99129E-01
269 N f 1.92558E+00 9.99129E-01
270 E g 1.92558E+00 9.99129E-01
271 V a 1.89857E+00 9.98635E-01
272 A b 1.85112E+00 9.97028E-01
273 R c 1.81750E+00 9.94887E-01
274 L d 1.81750E+00 9.94887E-01
275 K e 1.73705E+00 9.81892E-01
276 K f 1.73705E+00 9.81892E-01
277 L g 1.59774E+00 8.66090E-01
278 V a 1.59774E+00 8.66090E-01
279 G b 1.41555E+00 3.21084E-01
280 E c 1.41555E+00 3.21084E-01
281 R d 1.12442E+00 1.05666E-02
The EGCG program HelixTurnHelix find helix-turn-helix regions using the method of Dodd and Egan (1990) Nucl. Acids. Res. 18:5019-5026 Here is an example with the C2 repressor protein (from P22 rather than lambda):
$ helixturnhelix/infile=SW:RPC2_BPP22/default
$ type RPC2_BPP22.hth
HELIXTURNHELIX of Sw:Rpc2_Bpp22 from: 1 to: 216
Using distribution mean: 238.71 and SD: 293.61
Report scores beyond +2.50 standard deviations
Hits above +2.50 SD (972.73)
Score 2035 (+6.12 SD) in SW:RPC2_BPP22 at residue 20
P03035 bacteriophage p22, and bacteriophage p21 (bacteriophage 21).
repressor protein c2. 11/97
Sequence: RQAALGKMVGVSNVAISQWERS
| |
20 41
$ fetch sw:RPC2_BPP22
$ search RPC2_BPP22.sw helix,turn
FT HELIX 6 17
FT TURN 18 18
FT HELIX 21 28
FT TURN 29 29
FT HELIX 32 39
FT TURN 40 41
FT HELIX 47 56
FT TURN 57 58
FT HELIX 61 66
FT TURN 67 67
FT TURN 73 75
Note, there is a helix-loop-helix motif in Prosite, similar names not withstanding, these are not the same things!
There are also an assortment of properties that one can measure along the protein chain, as well as various predictions of secondary structure that can be made. The measurements are reasonably reliable, but the predictions are not. Part of the reason for this is that short peptides, in lengths that are common for a stretch of alpha helix or beta sheet in a protein, don't have defined structures in solution. That is, their structure in the protein depends in great measure on sequence outside of the subsequence itself. These algorithms work on small windows of a size that corresponds to these wobbly peptides, so they can at best only give indications of what sequences are likely to be in a helix or beta sheet in a protein.
Here is a set of secondary structure predictions, made using a variety of methods:
Comparisons of predicted secondary structure with that
observed in an actual structure (C2 fragment.) In each
case, the first 100 amino acids were put into the predicting
program, but only the first 70 are compared, as the NMR
fragment only covered that region.
stru: Actual structure, h = helix, t = turn, as marked
in the PDB entry 1ADR (NMR structure).
nn: H = helix, E = strand, - = no prediction
method: neural network
where: http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html
psa: probabilities rounded to the nearest 10, ie 7 = 70%.
method: Type-2 discrete state-space models
Where: http://bmerc-www.bu.edu/psa/request.htm
psa-request@darwin.bu.edu
SSPREDICT
Pred SS: Raw prediction
Clean SS: After filters applied to raw prediction
method: Algorithm based on sequence usage in 3D alignments
Where: http://www.embl-heidelberg.de/sspred/sspred_info.html
PHD sec: H = helix, E = strand, - = no prediction
method: neural network
PhD method : Rost & Sander, (1994) Proteins, 19, 55-72
Where: phd@EMBL-Heidelberg.de, or via WWW
SOPM
H = helix, S = strand, C = coil, T = turn
method: Self Optimized prediction method (plus others)
Gibrat method : Gibrat et al., (1987) J.Mol.Biol. 198, 425-443
Levin method : Levin et al., (1986) Febs Lett. 205, 303-308
DPM method : Deleage & Roux, (1987) Prot. Engng. 1, 289-294
Where: http://www.ibcp.fr/serv_pred.html
1 11 21 31 41 51 61
MNTQLMGERIRARRKKLKIRQAALGKMVGVSNVAISQWERSETEPNGENLLALSKALQCSPDYLLKGDLS
stru hhhhhhhhhhhht hhhhhhhht hhhhhhhtt hhhhhhhhhht hhhhhht
nn --------HHHHH---HHHHHHHH--EE-----EE-H-----------HHHHHHHH-------E------
psa
Loop 3332222211111222211222222222333333234333446665455433333335655545444445
Helix 6667777777777777888777776766554444544444332211223445545544211112222222
Turn 0000001110001110000011111100011110011122221234431100111111024320134311
Strand 1100110011110000111100110112122223321211111100011222221111111123311122
SSPRED
Pred HHEEHHEH-HHHHHHHHHHHHHHHHHEHHH-EEEEEEH----------EEHHHHHHHHH-HEEEHH----
Clean HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-EEEEEE-------------HHHHHHHHHHHEEE------
PHD sec -------HHHHHHHHHHHHHHHHHHHH--EEEEEEEEE---------HHHHHHHHHHH---HHHE-----
SOPM:
Gibrat HHHHHHHHHHHHHHHHHHHHHHHHCHEEEEEEEEEEEEECCCCCCCHHHHHHHHHHHHCCCCHEEECCHC
Levin CSHHHHHHHHHHHCSHHHHHHHHHHHHHECCCCEECCHCCCCCCCTCCHHHHHHHHCCCCCHEEEESCSC
DPM CCCHHHCHHHHHHHHHHHHHHHHHCHEEEEEEEEEEHHHHCCCTTTTCCHHHHHHHHHCTCCCCTCCCCC
SOPMA HHHHHHHHHHHHHHHHHHHHHHHHHHHEECCHHHHHHHCCCCCCCCCHHHHHHHHHHHHCCCEEEECCCC
Cons. CCHHHHHHHHHHHHHHHHHHHHHHCHHEECCEEEEEHHCCCCCCCCCCHH HHHHHHHHCCCEEEECCCC
Most of these methods are about 70% accurate. So if you only need to predict a single region of helix or sheet, those are the odds that you'll get it right. However, if you want to build a model of a protein that has, say 6 of those regions, your odds of having it all correct aren't very good, at .7 **6 = .12, or only 12% probability of it being right. And that's only the chance of it being generally correct, the odds of getting the exact region boundaries right are essentially zero.
The take home lesson from this is that you should really only base secondary structure predictions on sequence similarity to proteins whose structures have been determined. That being said, here are some tools which try to predict secondary structures directly from the amino acid sequence.
You can run some of the secondary structure prediction methods described above on remote servers. In both cases, the results will be mailed back to you.
$ nnpredict $ em_phd
The GCG package provides a pair of programs which can make a plot representing the predicted secondary structure of a protein. These programs implement the methods described in Chou and Fasman (1978), Advances in Enzymol., 47:45-148; Kyte and Doolittle (1982), J. Mol. Biol. 157:105-132, and others. For more references, see
$ GENHELP PEPTIDESTRUCTURE DESCRIPTION
This is how these two programs are used:
$ peptidestructure/infile=SW:RPC2_BPP22 $ plotstructure/infile=Rpc2_Bpp22.p2s - /begin=1/end=70/menu=1/default
$ plotstructure/infile=Rpc2_Bpp22.p2s - /begin=1/end=70/menu=2 - /number=10/default
In the squiggly plot, the four predicted types of motifs are:
Note that every region will be classified into one of these or the other. Also, it is hard to tell sharp saw tooth from from dull unless both are present in the plot.
This is the same region as we looked at when comparing different secondary structure predictions. As you know from that, the predicted structures here are largely incorrect!!!!
PepPlot is fairly similar to the above. It calculates a bunch of values which supposedly are indicators of probable secondary structure.
$ pepplot/infile=SW:RPC2_BPP22/begin=175/end=187/default
The hydrophobic moment is a sort of vector average of the known hydrophobicities of the amino acids, assuming a fixed rotation angle between consecutive residues. Normally one plots over the range of rotation angles from 0 to 180. In some instances this piles up hydrophobic residues at "magic" angles, such as 100 for alpha helix or 180 for beta. The GCG program Moment will plot this value for a protein.
$ moment/infile=SW:RPC2_BPP22/begin=150/end=200/default
The plots tend to be fairly noisy, and rarely do they give as clean a result as that shown. Also, the ends tend to be a bit blurred since the method requires the use of a window, which changes values slowly, rather than abruptly, when it slides off the end of a high scoring region.
HelicalWheel is another graphical tool. Here is the same region as in the previous example, but this time "zoomed in" on the region which generates the signal:
$ helicalwheel/infile=SW:RPC2_BPP22 - /begin=175/end=187 - /default
With HelicalWheel plots - even random sequences will often seem to have a hydrophobic side and a hydrophilic side. This is because the eye can pick any division plane to split the helix, which allows considerable latitude for finding the most extreme distribution.
This tool may also be used to visualize beta sheets, once the /ANGLE and /RADIUS have been set appropriately. The following example shows the same region to see if a beta sheet with one hydrophobic side might be present here. The plot doesn't support that guess.
$ helicalwheel/infile=SW:RPC2_BPP22 - /begin=175/end=187 - /angle=180/radius=5 - /default
If two alpha-helices twist around each other, amino acids which are located every 7 residues along the helix can line up with similarly spaced residues on the other chain. The leucine zipper has this geometry, with Leucines spaced out at intervals of 7 residues. In the absence of wrapping, an offset of 7 residues corresponds to 700 degrees, 20 degrees shy of two full turns. It is possible to take the twist angle into account, sort of, by adjusting the angle to 103 degrees. The result is a plot viewed down the twisting axis of the pair of helices. Here's one such region:
$ helicalwheel/infile=SW:ap1_chick - /begin=259/end=287 - /angle=103/radius=5 - /default
which is really a much better illustration of the effect than is:
$ helicalwheel/infile=SW:ap1_chick - /begin=259/end=287 - /default
The latter form is, however, appropriate if, despite the presence of the 7 residue repeat pattern, the peptide is not in a coiled coil conformation.
The EGCG program PepWheel may be used to make similar plots. PepWheel gives you a bit more control over the shapes drawn around each residue than does HelicalWheel.
The EGCG program PepNet is essentially a variant on the HelicalWheel program. Instead of looking down the axis, it splits the sequence with a period of 3.6 - corresponding to the period of an alpha helix - and unwraps it.
$ pepnet/infile=SW:RPC2_BPP22 - /begin=175/end=187 - /nocircles/nodiamonds - /squares="ILVMGAF" - /default
The default list of hydrophobic amino acids used by PepNet is not the same as for HelicalWheel. There is also a slight bug in this version which causes it to draw a couple of extra amino acids past where you tell it to stop.
SAPS is a program from Karlin's laboratory at Stanford which does an assortment of statistical analyses on a protein sequence. It *may* give you a hint as to how the protein is structured that you would not otherwise have picked using the other tools we have available. However, SAPS is very much a last resort - something you can fall back on when your protein has no homology to any other protein and contains none of the known motifs. Reference:Brendel, V., Bucher, P., Nourbakhsh, I., Blaisdell, B.E., Karlin, S. (1992) PNAS 89:2002-2006.
SAPS is somewhat complicated to use. For more information, see the HELP listing for SAPS:
$ help @saps saps
First, convert a GCG formatted file into SAPS format (actually EMBL format). Then run the SAPS program, telling it which organism this sequence is from. The species it knows about are:
BACSU Bacillus subtilis
CHICK chicken
DROME Drosophila melanogaster
ECOLI Escherichia coli
HUMAN human
MOUSE mouse
RAT rat
XENLA frog
YEAST Saccharomyces cerevisiae
swp23s random sample of proteins from SWISS-PROT
Release 23.0 (default)
$ readseq -f4 rpc2_bpp22.sw -orpc2_bpp22.saps
$ saps -s ecoli -o saps.results rpc2_bpp22.saps
$ type saps.results
SAPS. Version of March 26, 1993.
Date run: Wed Feb 17 11:37:50 1999
File: rpc2_bpp22.saps
ID Rpc2_Bpp22
DE Rpc2_Bpp22, 216 bases, B36B8C41 checksum.
number of residues: 216
1 MNTQLMGERI RARRKKLKIR QAALGKMVGV SNVAISQWER SETEPNGENL LALSKALQCS
61 PDYLLKGDLS QTNVAYHSRH EPRGSYPLIS WVSAGQWMEA VEPYHKRAIE NWHDTTVDCS
121 EDSFWLDVQG DSMTAPAGLS IPEGMIILVD PEVEPRNGKL VVAKLEGENE ATFKKLVMDA
181 GRKFLKPLNP QYPMIEINGN CKIIGVVVDA KLANLP
-------------------------------------------------------------------------
COMPOSITIONAL ANALYSIS (extremes relative to: ecoli.q)
A :17( 7.9%); C : 3( 1.4%); D :10( 4.6%); E :17( 7.9%); F- : 3( 1.4%)
G :15( 6.9%); H : 4( 1.9%); I :12( 5.6%); K :15( 6.9%); L :21( 9.7%)
M : 8( 3.7%); N :12( 5.6%); P :13( 6.0%); Q : 8( 3.7%); R :11( 5.1%)
S :14( 6.5%); T : 7( 3.2%); V :16( 7.4%); W : 5( 2.3%); Y : 5( 2.3%)
KR :26 ( 12.0%); ED :27 ( 12.5%); AGP :45 ( 20.8%);
KRED :53 ( 24.5%); KR-ED :-1 ( -0.5%); FIKMNY :55 ( 25.5%);
LVIFM :60 ( 27.8%); ST :21 ( 9.7%).
-------------------------------------------------------------------------
CHARGE DISTRIBUTIONAL ANALYSIS
1 0000000-+0 +0++++0+0+ 00000+0000 00000000-+ 0-0-000-00 0000+00000
61 0-000+0-00 00000000+0 -0+0000000 00000000-0 0-000++00- 000-000-00
121 --0000-000 -000000000 00-000000- 0-0-0+00+0 000+0-0-0- 000++000-0
181 0++00+0000 00000-0000 0+000000-0 +00000
A. CHARGE CLUSTERS.
Positive charge clusters (cmin = 10/30 or 14/45 or 17/60): none
Negative charge clusters (cmin = 10/30 or 14/45 or 17/60): none
Mixed charge clusters (cmin = 16/30 or 22/45 or 28/60): none
B. HIGH SCORING (UN)CHARGED SEGMENTS.
There are no high scoring positive charge segments.
There are no high scoring negative charge segments.
There are no high scoring mixed charge segments.
There are no high scoring uncharged segments.
C. CHARGE RUNS AND PATTERNS.
pattern (+)| (-)| (*)| (0)| (+0)| (-0)| (*0)|(+00)|(-00)|(*00)|
lmin0 5 | 5 | 7 | 30 | 9 | 9 | 12 | 10 | 11 | 14 |
lmin1 6 | 6 | 9 | 37 | 11 | 11 | 15 | 13 | 13 | 17 |
lmin2 7 | 7 | 10 | 41 | 12 | 12 | 17 | 15 | 15 | 19 |
There are no charge runs or patterns exceeding the given minimal lengths.
Run count statistics:
+ runs >= 3: 1, at 13;
- runs >= 3: 0
* runs >= 5: 0
0 runs >= 20: 0
-------------------------------------------------------------------------
DISTRIBUTION OF OTHER AMINO ACID TYPES
1. HIGH SCORING SEGMENTS.
There are no high scoring hydrophobic segments.
There are no high scoring transmembrane segments.
2. SPACINGS OF C.
H2N-58-C-59-C-81-C-15-COOH
-------------------------------------------------------------------------
REPETITIVE STRUCTURES.
A. SEPARATED, TANDEM, AND PERIODIC REPEATS: amino acid alphabet.
Repeat core block length: 4
B. SEPARATED AND TANDEM REPEATS: 11-letter reduced alphabet.
(i= LVIF; += KR; -= ED; s= AG; o= ST; n= NQ; a= YW; p= P; h= H; m= M; c= C)
Repeat core block length: 8
-------------------------------------------------------------------------
MULTIPLETS.
A. AMINO ACID ALPHABET.
1. Total number of amino acid multiplets: 11 (Expected range: 2-- 23)
2. Histogram of spacings between consecutive amino acid multiplets:
(1-5) 2 (6-10) 2 (11-20) 4 (>=21) 4
3. Clusters of amino acid multiplets (cmin = 10/30 or 13/45 or 15/60): none
B. CHARGE ALPHABET.
1. Total number of charge multiplets: 5 (Expected range: 0-- 13)
4 +plets (f+: 12.0%), 1 -plets (f-: 12.5%)
Total number of charge altplets: 2 (Critical number: 16)
2. Histogram of spacings between consecutive charge multiplets:
(1-5) 0 (6-10) 1 (11-20) 2 (>=21) 3
-------------------------------------------------------------------------
PERIODICITY ANALYSIS.
A. AMINO ACID ALPHABET (core: 4; !-core: 5)
Location Period Element Copies Core Errors
There are no periodicities of the prescribed length.
B. CHARGE ALPHABET ({+= KR; -= ED; 0}; core: 5; !-core: 6)
and HYDROPHOBICITY ALPHABET ({*=KRED; i=LVIF; 0}; core: 6; !-core:9)
Location Period Element Copies Core Errors
102- 125 4 *00. 6 6 /0/1/1/./
-------------------------------------------------------------------------
SPACING ANALYSIS.
Location (Quartile) Spacing Rank P-value Interpretation
25- 29 (1.) G( 4)G 16 of 16 0.0051 large minimal spacing
In this instance, SAPS didn't tell us anything very useful.
PSORT2 is a program which attempts to classify the subcellular localization of a protein. It must first be trained with a set of proteins whose subcellular locations are known. Two databases have been created, the default is for animal/yeast proteins, PLANTC is for plants (especially dicots). The reference for PSORT2 is Horton, P. and Nakai, K. (1996) Intelligent Systems for Molecular Biology 4:109-115. (Many thanks to K. Nakai for providing us with the source code and the cholorplast fixes!)
$ tofasta/infile=sw:H2a2_Human/out=h2a2.fsa/default
$ psort2 h2a2.fsa
---------------------------------------------------------------------------
H2A2_HUMAN P28001 homo sapiens (human). histone h2a.2. 8/92
psg: 0.50 gvh: 0.52 alm: 0.39 top: 0.53 tms: 0.00 mit: 0.44 mip: 0.08
nuc: 0.12 erl: 0.00 erm: 0.60 pox: 0.00 px2: 0.50 vac: 0.00 rnp: 0.00
act: 0.00 caa: 0.00 yqr: 0.00 tyr: 0.00 leu: 0.00 gpi: 0.00 myr: 0.00
dna: 0.13 rib: 0.00 bac: 0.00 m1a: 0.00 m1b: 0.00 m2 : 0.00 mNt: 0.00
m3a: 0.00 m3b: 0.00 m_ : 1.00 ncn: 0.88 lps: 0.00 len: 0.03 clr: 1.00
39.1 %: nuclear
30.4 %: cytoplasmic
30.4 %: mitochondrial
>> prediction for H2A2_HUMAN is nuc (k=23)
$ psort2 -d PLANTC h2a2.fsa
---------------------------------------------------------------------------
H2A2_HUMAN P28001 homo sapiens (human). histone h2a.2. 8/92
psg: 0.37 gvh: 0.57 alm: 0.40 top: 0.51 tms: 0.00 mit: 0.45 mip: 0.21
nuc: 0.18 erl: 0.00 erm: 0.60 pox: 0.00 px2: 0.50 vac: 0.00 rnp: 0.00
act: 0.00 caa: 0.00 yqr: 0.00 tyr: 0.00 leu: 0.00 gpi: 0.00 myr: 0.00
dna: 0.33 rib: 0.00 bac: 0.00 m1a: 0.00 m1b: 0.00 m2 : 0.00 mNt: 0.00
m3a: 0.00 m3b: 0.00 m_ : 1.00 ncn: 0.88 lps: 0.00 len: 0.05 clr: 1.00
87.0 %: nuclear
8.7 %: mitochondrial
4.3 %: cytoplasmic
>> prediction for H2A2_HUMAN is nuc (k=23)
Next week we'll cover some phylogenetic tools. Are there any questions?