SAPS. Version of March 24, 1993. Date run: Wed Mar 24 16:12:59 1993 SAPS (Statistical Analysis of Protein Sequences) evaluates by statistical criteria a wide variety of protein sequence properties. A full description of the methods is given in the paper referred to below. The output is or- ganized in the following sections: file name, sequence printout, composi- tional analysis, charge distributional analysis (charge clusters; high scoring (un)charged segments; charge runs and patterns), distribution of other amino acid types (high scoring hydrophobic and transmembrane seg- ments; cysteine spacings), repetitive structures (in the amino acid alpha- bet and in a 11-letter reduced alphabet), multiplets (counts, spacings, and clusters in the amino acid and charge alphabets), periodicity analysis, spacing analysis. Each section is annotated below under its sec- tion title. The SAPS program was developed in the group of Prof. Samuel Karlin at Stanford University. Correspondence relating to SAPS should be addressed to either Volker Brendel or Samuel Karlin at the Department of Mathemat- ics, Stanford University, Stanford CA 94305, U.S.A.; phone: (415) 723- 2209; fax: (415) 725-2040; email: volker@gnomic.stanford.edu. Users of the program should cite the following reference: Brendel, V., Bucher, P., Nourbakhsh, I., Blaisdell, B.E., Karlin, S. (1992) Methods and algorithms for statistical analysis of protein sequences. Proc. Natl. Acad. Sci. USA 89: 2002-2006. File: testpro SWISS-PROT ANNOTATION: ID HMCU_DROME STANDARD; PRT; 2175 AA. DE HOMEOBOX PROTEIN CUT. RA BLOCHLINGER K., BODMER R., JACK J., JAN L.Y., JAN Y.N.; RL NATURE 333:629-635(1988). CC -!- FUNCTION: CUT IS INVOLVED IN SPECIFYING SENSORY ORGAN IDENTITY IN CC FRUIT FLY. IN ABSENCE OF CUT GENE EXTERNAL SENSORY ORGANS ARE CC TRANSFORMED INTO CHORDOTONAL ORGANS. FT DOMAIN 194 210 ALA/GLN-RICH. FT DOMAIN 235 243 ALA-RICH. FT DOMAIN 271 293 ASP/GLU-RICH (ACIDIC). FT DOMAIN 384 428 ASN-RICH. FT DOMAIN 547 554 ASP/GLU-RICH (ACIDIC). FT DOMAIN 574 584 ASP/GLU-RICH (ACIDIC). FT DOMAIN 616 630 ALA-RICH. FT DOMAIN 665 699 HIS/GLN-RICH. FT REPEAT 886 945 'CUT'-REPEAT. FT REPEAT 1339 1398 'CUT'-REPEAT. FT REPEAT 1617 1676 'CUT'-REPEAT. FT DNA_BIND 1745 1804 HOMEOBOX. FT DOMAIN 2004 2014 ALA-RICH. FT DOMAIN 2071 2077 ASP/GLU-RICH (ACIDIC). FT DOMAIN 2124 2136 ALA/PRO-RICH. number of residues: 2175 1 MQPTLPQAAG TADMDLTAVQ SINDWFFKKE QIYLLAQFWQ QRATLAEKEV NTLKEQLSTG 61 NPDSNLNSEN SDTAAAAATA AAVAAVVAGA TATNDIEDEQ QQQLQQTASG GILESDSDKL 121 LNSSIVAAAI TLQQQNGSNL LANTNTPSPS PPLLSAEQQQ QLQSSLQQSG GVGGACLNPK 181 LFFNHAQQMM MMEAAAAAAA AALQQQQQQQ SPLHSPANEV AIPTEQPAAT VATGAAAAAA 241 AAATPIATGN VKSGSTTSNA NHTNSNNSHQ DEEELDDEEE DEEEDEDEDD EEENASMQSN 301 ADDMELDAQQ ETRTEPSATT QQQHQQQDTE DLEENKDAGE ASLNVSNNHN TTDSNNSCSR 361 KNNNGGNESE QHVASSAEDD DCANNNTNTS NNNNTSNTAT SNTNNNNNNN SSSGNSEKRK 421 KKNNNNNNGQ PAVLLAAKDK EIKALLDELQ RLRAQEQTHL VQIQRLEEHL EVKRQHIIRL 481 EARLDKQQIN EALAEATALS AAASTNNNNN SQSSDNNKKL NTAAERPMDA SSNADLPEST 541 KAPVPAEDDE EDEDQAMLVD SEEAEDKPED SHHDDDEDED EDREAVNATT TDSNELKIKK 601 EQHSPLDLNV LSPNSAIAAA AAAAAAAACA NDPNKFQALL IERTKALAAE ALKNGASDAL 661 SEDAHHQQQQ HHQQQHQHQQ QHHQQQHLHQ QHHHHLQQQP NSGSNSNPAS NDHHHGHHLH 721 GHGLLHPSSA HHLHHQTTES NSNSSTPTAA GNNNGSNNSS SNTNANSTAQ LAASLASTLN 781 GTKSLMQEDS NGLAAVAMAA HAQHAAALGP GFLPGLPAFQ FAAAQVAAGG DGRGHYRFAD 841 SELQLPPGAS MAGRLGESLI PKGDPMEAKL QEMLRYNMDK YANQALDTLH ISRRVRELLS 901 VHNIGQRLFA KYILGLSQGT VSELLSKPKP WDKLTEKGRD SYRKMHAWAC DDNAVMLLKS 961 LIPKKDSGLP QYAGRGAGGA GGDDSMSEDR IAHILSEASS LMKQSSVAQH REQERRSHGG 1021 EDSHSNEDSK SPPQSCTSPF FKVENQLKQH QHLNPEQAAA QQREREREQR EREQQQRLRH 1081 DDQDKMARLY QELIARTPRE TAFPSFLFSP SLFGGAAGMP GAASNAFPAM ADENMRHVFE 1141 REIAKLQQHQ QQQQAAQAQA QFPNFSSLMA LQQQVLNGAQ DLSLAAAAAK DIKLNGQRSS 1201 LEHSAGSSSC SKDGERDDAY PSSLHGRKSE GGGTPAPPAP PSGPGTGAGA PPTAAPPTGG 1261 ASSNSAAPSP LSNSILPPAL SSQGEEFAAT ASPLQRMASI TNSLITQPPV TPHHSTPQRP 1321 TKAVLPPITQ QQFDMFNNLN TEDIVRRVKE ALSQYSISQR LFGESVLGLS QGSVSDLLAR 1381 PKPWHMLTQK GREPFIRMKM FLEDENAVHK LVASQYKIAP EKLMRTGSYS GSPQMPQGLA 1441 SKMQAASLPM QKMMSELKLQ EPAQAQHLMQ QMQAAAMSAA MQQQQVAQAQ QQAQQAQQAQ 1501 QHLQQQAQQH LQQQQHLAQQ QHPHQQHHQA AAAAAALHHQ SMLLTSPGLP PQHAISLPPS 1561 AGGAQPGGPG GNQGSSNPSN SEKKPMLMPV HGTNAMRSLH QHMSPTVYEM AALTQDLDTH 1621 DITTKIKEAL LANNIGQKIF GEAVLGLSQG SVSELLSKPK PWHMLSIKGR EPFIRMQLWL 1681 SDANNVERLQ LLKNERREAS KRRRSTGPNQ QDNSSDTSSN DTNDFYTSSP GPGSVGSGVG 1741 GAPPSKKQRV LFSEEQKEAL RLAFALDPYP NVGTIEFLAN ELGLATRTIT NWFHNHRMRL 1801 KQQVPHGPAG QDNPIPSRES TSATPFDPVQ FRILLQQRLL ELHKERMGMS GAPIPYPPYF 1861 AAAAILGRSL AGIPGAAAAA GAAAAAAAVG ASGGDELQAL NQAFKEQMSG LDLSMPTLKR 1921 ERSDDYQDDL ELEGGGHNLS DNESLEGQEP EDKTTDYEKV LHKSALAAAA AYMSNAVRSS 1981 RRKPAAPQWV NPAGAVTNPS AVVAAVAAAA AAAADNERII NGVCVMQASE YGRDDTDSNK 2041 PTDGGNDSDH EHAQLEIDQR FMEPEVHIKQ EEDDDEEQSG SVNLDNEDNA TSEQKLKVIN 2101 EEKLRMVRVR RLSSTGGGSS EEMPAPLAPP PPPPAASSSI VSGESTTSSS SSSNTSSSTP 2161 AVTTAAATAA AGWNY -------------------------------------------------------------------------------- COMPOSITIONAL ANALYSIS (extremes relative to: DROME.q) The composition of the input sequence is evaluated relative to the residue usage quantile table specified with the `-s species' flag. Low usage in the 1% quantile is indicated by the label -- (e.g., Y-- means that the input sequence uses tyrosine as little as the 1% least tyrosine contain- ing proteins in the reference set); low usage in the 5% quantile is indi- cated by the label `-' (e.g., L-); high usage above the 95% quantile point is indicated by the label `+' (e.g., A+); and high usage above the 99% quantile point is indicated by the label `++' (e.g., LIVFM++). The usage is evaluated for all 20 amino acids, positive (KR) and negative (ED) charge, total charge (KRED), net charge (KR-ED), major hydrophobics (LVIFM), and the groupings ST, AGP (encoded by CCN, GCN, and GGN codons), and FIKMNY (encoded by AAN, AUN, UAN, and UUN codons). A+ :294(13.5%); C : 8( 0.4%); D :114( 5.2%); E :150( 6.9%); F : 39( 1.8%) G :127( 5.8%); H : 85( 3.9%); I : 54( 2.5%); K : 88( 4.0%); L :182( 8.4%) M : 58( 2.7%); N :144( 6.6%); P :129( 5.9%); Q :203( 9.3%); R : 83( 3.8%) S :215( 9.9%); T :104( 4.8%); V : 66( 3.0%); W : 10( 0.5%); Y- : 22( 1.0%) KR : 171 ( 7.9%); ED : 264 ( 12.1%); AGP : 550 ( 25.3%); KRED : 435 ( 20.0%); KR-ED : -93 ( -4.3%); FIKMNY : 405 ( 18.6%); LVIFM : 399 ( 18.3%); ST : 319 ( 14.7%). -------------------------------------------------------------------------------- CHARGE DISTRIBUTIONAL ANALYSIS The distribution of charges in the protein sequence is evaluated in terms of clusters, high scoring segments, and runs and periodic patterns. Clus- ters indicate regions of typically 30 to 60 residues exhibiting a rela- tively high charge concentration. For high scoring charge segments, posi- tive scores are assigned to charge residues of the appropriate type and negative scores to all other residues. A significant cumulative positive score again indicates a region of high charge concentration. The cluster method and the scoring method will generally pick out the same segments (with the scoring method often delimiting the segment to a narrower range), conferring robustness to the results. Short segments of high charge concentration are displayed as runs (with errors). Periodic pat- terns focus on those with charges every second or third position, with possible relevance to amphipathic secondary structures; other periodic patterns are displayed in the general periodicity analysis section of the output. 1 0000000000 00-0-00000 000-000++- 0000000000 0+0000-+-0 000+-00000 61 00-00000-0 0-00000000 0000000000 0000-0---0 0000000000 000-0-0-+0 121 0000000000 0000000000 0000000000 000000-000 0000000000 000000000+ 181 0000000000 00-0000000 0000000000 00000000-0 0000-00000 0000000000 241 0000000000 0+00000000 0000000000 ----0----- ---------- ---0000000 301 0--0-0-000 -0+0-00000 0000000-0- -0--0+-00- 0000000000 00-000000+ 361 +000000-0- 0000000--- -000000000 0000000000 0000000000 000000-+++ 421 ++00000000 0000000+-+ -0+000--00 +0+00-0000 0000+0--00 -0++0000+0 481 -0+0-+0000 -000-00000 0000000000 0000-00++0 0000-+00-0 0000-00-00 541 +00000---- ----00000- 0--0--+0-- 000------- --+-000000 0-00-0+0++ 601 -00000-000 0000000000 0000000000 0-00+00000 0-+0+0000- 00+0000-00 661 0--0000000 0000000000 0000000000 0000000000 0000000000 0-00000000 721 0000000000 00000000-0 0000000000 0000000000 0000000000 0000000000 781 00+0000--0 0000000000 0000000000 0000000000 0000000000 -0+000+00- 841 0-00000000 000+00-000 0+0-00-0+0 0-00+000-+ 000000-000 00++0+-000 901 000000+000 +000000000 00-000+0+0 0-+00-+0+- 00++000000 --000000+0 961 000++-0000 0000+00000 00--000--+ 000000-000 00+0000000 +-0-++0000 1021 --0000--0+ 0000000000 0+0-000+00 00000-0000 00+-+-+-0+ -+-000+0+0 1081 --0-+00+00 0-000+00+- 0000000000 0000000000 0000000000 0--00+000- 1141 +-00+00000 0000000000 0000000000 0000000000 -00000000+ -0+0000+00 1201 0-00000000 0+-0-+--00 000000++0- 0000000000 0000000000 0000000000 1261 0000000000 0000000000 0000--0000 00000+0000 0000000000 00000000+0 1321 0+00000000 000-000000 0--00++0+- 000000000+ 000-000000 00000-000+ 1381 0+0000000+ 0+-000+0+0 00---0000+ 000000+000 -+00+00000 0000000000 1441 0+00000000 0+000-0+00 -000000000 0000000000 0000000000 0000000000 1501 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 1561 0000000000 0000000000 0-++000000 000000+000 00000000-0 00000-0-00 1621 -000+0+-00 0000000+00 0-00000000 000-000+0+ 0000000+0+ -000+00000 1681 0-0000-+00 00+0-++-00 ++++000000 0-000-0000 -00-000000 0000000000 1741 00000++0+0 000--0+-00 +00000-000 00000-0000 -00000+000 000000+0+0 1801 +000000000 0-00000+-0 000000-000 0+00000+00 -00+-+0000 0000000000 1861 0000000+00 0000000000 0000000000 0000--0000 0000+-0000 0-000000++ 1921 -+0--00--0 -0-0000000 -0-00-00-0 --+00-0-+0 00+0000000 0000000+00 1981 +++0000000 0000000000 0000000000 0000-0-+00 000000000- 00+--0-00+ 2041 00-000-0-0 -0000-0-0+ 00-0-000+0 -------000 0000-0--00 00-0+0+000 2101 --+0+00+0+ +000000000 --00000000 0000000000 000-000000 0000000000 2161 0000000000 00000 A. CHARGE CLUSTERS. Positive, negative, and mixed charge clusters are distinguished. In each case, cmin indicates the minimum number of charges required for a signifi- cant charge cluster corresponding to the given window size; e.g., cmin = 9/30 or 12/45 or 15/60 means that significance requires at least 9 charges in a segment of 30 (or fewer) residues, or 12 charges in a segment of length 45, or 15 charges in a segment of length 60. In the case of posi- tive and negative charge clusters, these counts refer to net charge, i.e., charges of the opposite sign within the window are counted as -1. The sizes of the clusters are optimized for display to indicate the segment of highest charge concentration, but a minimum size of 20 residues is required. A mixed charge cluster that begins and ends within 15 residues of the endpoints of a pure charge cluster is not displayed (since its sig- nificance rests mostly on the charged residues comprising the displayed pure charge cluster), unless the -v (verbose output) flag is set, in which case both the pure and the mixed charge cluster are displayed. On the other hand, pure charge clusters that are embedded in mixed charge clus- ters are displayed separately (indicated by a * preceding the specifica- tion of location). For each cluster are given its location in the sequence (From, to), the quartile of the location (1st, 2nd, 3rd, or 4th quarter of the sequence), length, count, and t-value (standard deviations above the mean; to accommodate the multiple tests performed, the t-value significance threshold is set to 4.0 for sequences up to 750 residues, to 4.5 for sequences of length 750-1500 residues, and to 5.0 for longer sequences); also indicated are residues comprising at least 10% of the cluster. Positive charge clusters (cmin = 9/30 or 12/45 or 15/60): none Negative charge clusters (cmin = 12/30 or 16/45 or 19/60): 1) From 271 to 293: DEEELDDEEEDEEEDEDEDDEEE ----0------------------ quartile: 1; size: 23, +count: 0, -count: 22, 0count: 1; t-value: 12.26 * E: 14 (60.9%); D: 8 (34.8%); 2) From 547 to 582: EDDEEDEDQAMLVDSEEAEDKPEDSHHDDDEDEDED --------00000-0--0--+0--000--------- quartile: 2; size: 36, +count: 1, -count: 24, 0count: 11; t-value: 9.51 * E: 11 (30.6%); D: 13 (36.1%); 3) From 1924 to 1952: DDYQDDLELEGGGHNLSDNESLEGQEPED --00--0-0-0000000-0-00-00-0-- quartile: 4; size: 29, +count: 0, -count: 12, 0count: 17; t-value: 4.82 L: 4 (13.8%); G: 4 (13.8%); E: 6 (20.7%); D: 6 (20.7%); 4)*From 2034 to 2088: DDTDSNKPTDGGNDSDHEHAQLEIDQRFMEPEVHIKQEEDDDEEQSGSVNLDNED --0-00+00-000-0-0-0000-0-0+00-0-000+0-------0000000-0-- quartile: 4; size: 55, +count: 3, -count: 21, 0count: 31; t-value: 4.68 E: 9 (16.4%); D: 12 (21.8%); Mixed charge clusters (cmin = 16/30 or 22/45 or 27/60): 1) From 1063 to 1085: REREREQREREQQQRLRHDDQDK +-+-+-0+-+-000+0+0--0-+ quartile: 2; size: 23, +count: 8, -count: 8, 0count: 7; t-value: 5.94 * E: 5 (21.7%); D: 3 (13.0%); R: 7 (30.4%); Q: 5 (21.7%); 2)*From 2033 to 2111: see sequence above see sequence above quartile: 4; size: 79, +count: 11, -count: 24, 0count: 44; t-value: 5.40 * E: 12 (15.2%); D: 12 (15.2%); B. HIGH SCORING (UN)CHARGED SEGMENTS. For each scoring scheme (scores assigned to residues as displayed), SAPS displays segments of the sequence with aggregate score exceeding the par- ticular threshold values M_0.01 (1% significance level, segments labeled with **), M_0.05 (5% significance level, segments labeled *), or other- wise as indicated. A minimal segment length is set as shown. The expected score/letter should be sufficiently large negative, and the average infor- mation per letter should be sufficiently large positive in order for the scoring statistics to apply properly (the program prints out when the con- ditions are not met and skips evaluations). ______________________________________ High scoring positive charge segments: score= 2.00 frequency= 0.079 ( KR ) score= 0.00 frequency= 0.000 ( BZX ) score= -1.00 frequency= 0.800 ( LAGSVTIPNFQYHMCW ) score= -2.00 frequency= 0.121 ( ED ) Expected score/letter: -0.886; Average information/letter: 1.850 Minimal length of displayed segments set to: 20 M_0.01= 10.22 (cv= 6.93, lambda= 1.10938, k= 0.38710, x= 3.29; 90% confidence interval for segment length: 9 +- 7) M_0.05= 8.75 (x= 1.82) # of segments (>=20 residues) exceeding M_0.05: none ______________________________________ High scoring negative charge segments: score= 2.00 frequency= 0.121 ( ED ) score= 0.00 frequency= 0.000 ( BZX ) score= -1.00 frequency= 0.800 ( LAGSVTIPNFQYHMCW ) score= -2.00 frequency= 0.079 ( KR ) Expected score/letter: -0.714; Average information/letter: 1.064 Minimal length of displayed segments set to: 20 M_0.01= 13.36 (cv= 9.29, lambda= 0.82758, k= 0.29370, x= 4.08; 90% confidence interval for segment length: 15 +- 13) M_0.05= 11.39 (x= 2.11) 1) From 271 to 293: length= 23, score=43.00 ** 271 DEEELDDEEE DEEEDEDEDD EEE E: 14(60.9%); D: 8(34.8%); 2) From 547 to 582: length= 36, score=35.00 ** 547 EDDEEDEDQA MLVDSEEAED KPEDSHHDDD EDEDED E: 11(30.6%); D: 13(36.1%); # of segments (>=20 residues) exceeding M_0.05: 2 ___________________________________ High scoring mixed charge segments: score= 1.00 frequency= 0.200 ( KEDR ) score= 0.00 frequency= 0.000 ( BZX ) score= -1.00 frequency= 0.800 ( LAGSVTIPNFQYHMCW ) Expected score/letter: -0.600; Average information/letter: 1.200 Minimal length of displayed segments set to: 20 M_0.01= 8.29 (cv= 5.54, lambda= 1.38629, k= 0.45000, x= 2.74; 90% confidence interval for segment length: 14 +- 10) M_0.05= 7.11 (x= 1.57) 1) From 271 to 293: length= 23, score=21.00 ** 271 DEEELDDEEE DEEEDEDEDD EEE E: 14(60.9%); D: 8(34.8%); 2) From 547 to 584: length= 38, score=16.00 ** (pocket at 555 to 559: length= 5, score=-5.00) 547 EDDEEDED |QA MLV| DSEEAED KPEDSHHDDD EDEDEDRE E: 12(31.6%); D: 13(34.2%); 3) From 1063 to 1082: length= 20, score= 8.00 * (pocket at 1074 to 1076: length= 3, score=-3.00) 1063 REREREQRER E |QQQ| RLRHDD E: 5(25.0%); D: 2(10.0%); R: 7(35.0%); Q: 4(20.0%); # of segments (>=20 residues) exceeding M_0.05: 3 ________________________________ High scoring uncharged segments: score= 1.00 frequency= 0.800 ( LAGSVTIPNFQYHMCW ) score= 0.00 frequency= 0.000 ( BZX ) score= -8.00 frequency= 0.200 ( KEDR ) Expected score/letter: -0.800; Average information/letter: 0.133 Minimal length of displayed segments set to: 20 M_0.01= 57.49 (cv= 45.00, lambda= 0.17079, k= 0.08492, x= 12.50; 90% confidence interval for segment length: 106 +- 74) M_0.05= 47.95 (x= 2.95) 1) From 120 to 270: length=151, score=97.00 ** 120 LLNSSIVAAA ITLQQQNGSN LLANTNTPSP SPPLLSAEQQ QQLQSSLQQS 170 GGVGGACLNP KLFFNHAQQM MMMEAAAAAA AAALQQQQQQ QSPLHSPANE 220 VAIPTEQPAA TVATGAAAAA AAAATPIATG NVKSGSTTSN ANHTNSNNSH 270 Q A: 32(21.2%); S: 16(10.6%); Q: 21(13.9%); 2) From 664 to 830: length=167, score=122.00 ** 664 AHHQQQQHHQ QQHQHQQQHH QQQHLHQQHH HHLQQQPNSG SNSNPASNDH 714 HHGHHLHGHG LLHPSSAHHL HHQTTESNSN SSTPTAAGNN NGSNNSSSNT 764 NANSTAQLAA SLASTLNGTK SLMQEDSNGL AAVAMAAHAQ HAAALGPGFL 814 PGLPAFQFAA AQVAAGG A: 25(15.0%); S: 19(11.4%); Q: 25(15.0%); H: 28(16.8%); 3) From 1231 to 1318: length= 88, score=61.00 ** (pocket at 1285 to 1286: length= 2, score=-16.00) 1231 GGGTPAPPAP PSGPGTGAGA PPTAAPPTGG ASSNSAAPSP LSNSILPPAL 1281 SSQG |EE| FAAT ASPLQRMASI TNSLITQPPV TPHHSTPQ A: 14(15.9%); G: 10(11.4%); S: 13(14.8%); T: 9(10.2%); P: 19(21.6%); 4) From 1462 to 1581: length=120, score=120.00 ** 1462 PAQAQHLMQQ MQAAAMSAAM QQQQVAQAQQ QAQQAQQAQQ HLQQQAQQHL 1512 QQQQHLAQQQ HPHQQHHQAA AAAAALHHQS MLLTSPGLPP QHAISLPPSA 1562 GGAQPGGPGG NQGSSNPSNS A: 24(20.0%); Q: 38(31.7%); # of segments (>=20 residues) exceeding M_0.05: 4 C. CHARGE RUNS AND PATTERNS. The table below shows the charge runs and patterns searched for (* stands for + or -) and the required minimum number of matches to the pattern allowing for at most 0 (lmin0), 1 (lmin1), or 2 (lmin2) mismatches or insertions/deletions (1% significance level). Occurrences are arranged in the order in which they appear in the sequence. For each run or pattern are displayed its length (number of matches) and a triplet giving the number of mismatches, insertions and deletions. 0-runs are further charac- terized by their composition (residues comprising more than 10% of the run). Run count statistics are compiled for runs of lengths at least 2/3 of the minimal significant length (lmin0); given are the number and locations of such runs. pattern (+)| (-)| (*)| (0)| (+0)| (-0)| (*0)|(+00)|(-00)|(*00)| lmin0 5 | 6 | 7 | 48 | 10 | 11 | 14 | 12 | 14 | 17 | lmin1 6 | 7 | 9 | 57 | 12 | 14 | 17 | 14 | 17 | 21 | lmin2 7 | 9 | 11 | 64 | 13 | 15 | 19 | 16 | 19 | 23 | (0) 59(1,0,0); at 120- 179: see sequence above (1. quartile) L: 10 (16.7%); A: 6 (10.0%); S: 9 (15.0%); N: 6 (10.0%); Q: 10 (16.7%); ST: 12 (20.0%); (-) 22(1,0,0); at 271- 293: DEEELDDEEEDEEEDEDEDDEEE (1. quartile) ----0------------------ (+) 5(0,0,0); at 418- 422: KRKKK (1. quartile) +++++ (-) 8(0,0,0); at 547- 554: EDDEEDED (2. quartile) -------- (-) 9(0,0,0); at 574- 582: DDDEDEDED (2. quartile) --------- (0) 74(1,0,0); at 664- 738: see sequence above (2. quartile) Q: 20 (26.7%); H: 26 (34.7%); (0) 73(2,0,0); at 713- 787: see sequence above (2. quartile) L: 8 (10.7%); A: 8 (10.7%); S: 14 (18.7%); T: 8 (10.7%); N: 11 (14.7%); H: 12 (16.0%); ST: 22 (29.3%); (*) 10(1,0,0); at 1063-1073: REREREQRERE (2. quartile) +-+-+-0+-+- (0) 54(0,0,0); at 1231-1284: see sequence above (3. quartile) A: 10 (18.5%); G: 10 (18.5%); S: 9 (16.7%); P: 14 (25.9%); ST: 13 (24.1%); (0) 122(1,0,0); at 1459-1581: see sequence above (3. quartile) A: 24 (19.5%); Q: 39 (31.7%); (-) 7(0,0,0); at 2071-2077: EEDDDEE (4. quartile) ------- Run count statistics: + runs >= 3: 3, at 418; 1701; 1981; - runs >= 4: 6, at 271; 276; 378; 547; 574; 2071; * runs >= 5: 6, at 276; 417; 547; 574; 1063; 2071; 0 runs >= 32: 8, at 120; 382; 664; 740; 790; 1146; 1231; 1462; -------------------------------------------------------------------------------- DISTRIBUTION OF OTHER AMINO ACID TYPES Routinely, SAPS indicates high scoring hydrophobic and transmembrane seg- ments. The display is as desribed above for high scoring charge segments. The scores for the hydrophobic segments correspond to a digitized hydro- pathy scale. The transmembrane scores were derived from target frequen- cies in putative transmembrane proteins (see the paper referred to above; note, however, that the scores used in the program have been rederived and differ from the ones given in the paper). With the -a command line flag, the user can invoke a similar analysis for other residue types. In view of the special role of cysteines for protein structure, the spacings of the cysteine residues in the sequence are displayed separately, with par- ticular emphasis on close pairs of cysteines and distances between such pairs. 1. HIGH SCORING SEGMENTS. __________________________________ High scoring hydrophobic segments: 2.00 (LVIFM) 1.00 (AGYCW) 0.00 (BZX) -2.00 (PH) -4.00 (STNQ) -8.00 (KEDR) Expected score/letter: -2.443; Average information/letter: 1.035 Minimal length of displayed segments set to: 15 M_0.01= 20.26 (cv= 13.73, lambda= 0.55979, k= 0.38863, x= 6.53; 90% confidence interval for segment length: 16 +- 8) M_0.05= 17.35 (x= 3.62) 1) From 792 to 830: length= 39, score=19.00 * (pocket at 801 to 804: length= 4, score=-7.00) 792 GLAAVAMAA |H AQH| AAALGPG FLPGLPAFQF AAAQVAAGG L: 4(10.3%); A: 15(38.5%); G: 6(15.4%); 2) From 1870 to 1891: length= 22, score=22.00 ** 1870 LAGIPGAAAA AGAAAAAAAV GA A: 14(63.6%); G: 4(18.2%); # of segments (>=15 residues) exceeding M_0.05: 2 ____________________________________ High scoring transmembrane segments: 5.00 (LVIF) 2.00 (AGM) 0.00 (BZX) -1.00 (YCW) -2.00 (ST) -6.00 (P) -8.00 (H) -10.00 (NQ) -16.00 (KR) -17.00 (ED) Expected score/letter: -4.673; Average information/letter: 0.890 Minimal length of displayed segments set to: 15 M_0.01= 47.40 (cv= 33.02, lambda= 0.23276, k= 0.28611, x= 14.39; 90% confidence interval for segment length: 18 +- 10) M_0.05= 40.40 (x= 7.38); M_0.30= 32.07 (x= -0.95) 1) From 74 to 90: length= 17, score=39.00 74 AAAAATAAAV AAVVAGA A: 12(70.6%); V: 3(17.6%); 2) From 1860 to 1894: length= 35, score=54.00 ** (pocket at 1868 to 1869: length= 2, score=-18.00) 1860 FAAAAILG |RS| LAGIPGAAAA AGAAAAAAAV GASGG A: 18(51.4%); G: 7(20.0%); 3) From 2000 to 2014: length= 15, score=35.00 2000 SAVVAAVAAA AAAAA A: 11(73.3%); V: 3(20.0%); # of segments (>=15 residues) exceeding M_0.30: 3 2. SPACINGS OF C. H2N-175-C-181-C-23-C-246-C-320-C-85-C-173-C-813-C-151-COOH -------------------------------------------------------------------------------- REPETITIVE STRUCTURES. Repeats are indicated for two alphabets: the 20-letter amino acid alpha- bet, and a reduced 11-letter alphabet in which the major hydrophobics LVIF, the charged residues KR and ED, the small residues AG, the hydroxyl group residues ST, the amid group residues NQ, and the aromatics YW are treated as combined letters. For each alphabet, three classes of repeats are distinguished: separated repeats, simple tandem repeats, and periodic repeats. The separated repeats are largely non-overlapping. They are displayed in groups of matching blocks (exceeding a given core block length of contiguous exact matches) and intervening spacer distances (which may be negative, signifying a partial overlap). The core block length in case of the amino acid alphabet is set to 4 for sequences up to 500 residues, to 5 for sequences between 500 and 2000 residues, and to 6 for longer sequences (same values increased by 4 for the reduced alpha- bet). Simple tandem repeats are displayed in similar layout, but separately. Sequence segments that are highly repetitive with relatively short repeats are displayed as periodic repeats. A. SEPARATED, TANDEM, AND PERIODIC REPEATS: amino acid alphabet. Repeat core block length: 6 Aligned matching blocks: [ 99- 105] EQQQQLQ [ 157- 163] EQQQQLQ ______________________________ [ 284- 289] EDEDED [ 577- 582] EDEDED ______________________________ [ 403- 413] TNNNNNNNSSS [ 505- 514] TNNNNNSQSS with superset: [ 389- 394] TSNNNN [ 403- 408] TNNNNN [ 505- 510] TNNNNN ______________________________ [ 667- 690] QQ_QQHHQQQHQHQQQHHQQQHLHQ [1505-1529] QQAQQHLQQQQHLAQQQHPHQQHHQ with superset: [ 204- 209] QQQQQQ [ 322- 327] QQHQQQ [ 669- 674] QQHHQQ [ 674- 679] QQHQHQ [ 680- 685] QQHHQQ [1147-1152] QQHQQQ [1500-1505] QQHLQQ [1508-1513] QQHLQQ and: [ 669- 680] QQH_HQQQHQHQQ [ 674- 685] QQHQHQQQH_HQQ [ 680- 691] QQH_HQQQHLHQQ [1508-1520] QQHLQQQQHLAQQ and: [ 669- 687] QQH_HQQQHQHQQQHHQQQH [ 674- 692] QQH_QHQQQHHQQQHLHQQH [1508-1527] QQHLQQQQHLAQQQHPHQQH ______________________________ [ ]--------[ 914- 931] [1362-1398]-( -32)-[1367-1384] [1640-1676]-( -32)-[1645-1662] [1362-1398] FGESVLGLSQGSVSDLLARPKPWHMLTQKGREPFIRM [1640-1676] FGEAVLGLSQGSVSELLSKPKPWHMLSIKGREPFIRM [ 914- 931] LGLSQGTVSELLSKPKPW [1367-1384] LGLSQGSVSDLLARPKPW [1645-1662] LGLSQGSVSELLSKPKPW ______________________________ [1152-1161] QQQAAQA_QAQ [1490-1500] QQQAQQAQQAQ with superset: [ 101- 106] QQQLQQ [ 204- 209] QQQQQQ [ 321- 326] QQQHQQ [1152-1157] QQQAAQ [1490-1495] QQQAQQ [1504-1509] QQQAQQ ______________________________ Simple tandem repeats: [ 668- 676] Q__QQHHQQQH [ 677- 687] QHQQQHHQQQH [ 688- 693] LHQ_QHH [1490-1503] QQQAQQAQQAQQHL [1504-1517] QQQAQQHLQQQQHL [1518-1526] AQ__QQHPHQQ B. SEPARATED AND TANDEM REPEATS: 11-letter reduced alphabet. (i= LVIF; += KR; -= ED; s= AG; o= ST; n= NQ; a= YW; p= P; h= H; m= M; c= C) Repeat core block length: 10 Aligned matching blocks: [ 576- 587] -------+-s_in [2071-2083] -------nosoin with superset: [ 286- 295] --------ns [ 547- 556] --------ns [ 576- 585] -------+-s [2071-2080] -------nos ______________________________ [1359-1398] n+iis-oiisionsoio-iis+p+pahmion+s+-pii+m [1637-1676] n+iis-siisionsoio-iio+p+pahmioi+s+-pii+m with superset: [ 913- 930] iisionsoio-iio+p+p [1366-1383] iisionsoio-iis+p+p [1644-1661] iisionsoio-iio+p+p -------------------------------------------------------------------------------- MULTIPLETS. Multiplets refer to homooligopeptides of any length (e.g., A2, Q7, etc.); altplets refer to reiterations of two different residues (e.g., RG, EAEAEA, etc.). The multiplet composition of the protein sequence is evaluated for both the amino acid and the charge alphabet. (High) Aggre- gate altplet counts are evalued only for the charge alphabet. The multi- plet sequence is displayed whenever the total multiplet count of the sequence falls outside the expected range (i.e., beyond 3 standard devia- tions of the mean). Printed are also the histogram of the spacings between consecutive multiplets (differences between starting positions) as well as clusters of multiplets (multiplet clusters are determined in the same way as charge clusters are determined; the binomial test is applied to a compressed sequence over the alphabet {M,S}, where M signifies a multiplet and S signifies a singlet; i.e., the amino acid sequence AADFFFGHRRT... is translated as MSMSSMS..., and the binomial cluster test is applied to the latter sequence). Multiplets and altplets of specific residue content that individually show an unusually high count are indicated, and the positions of all multiplets exceeding a minimum length of 5 residues are shown. A. AMINO ACID ALPHABET. 1. Total number of amino acid multiplets: 247 (Expected range: 108--178) high 1 .......AA. .......... .....FFKK. ...LL....Q Q......... .......... 61 .......... ...AAAAA.A AA.AAVV... .........Q QQQ.QQ...G G........L 121 L.SS..AAA. ..QQQ....L L......... PPLL...QQQ Q..SS.QQ.G G.GG...... 181 .FF...QQMM MM.AAAAAAA AA.QQQQQQQ .......... .......AA. ....AAAAAA 241 AAA....... .....TT... .....NN... .EEE.DDEEE .EEE....DD EEE....... 301 .DD.....QQ ........TT QQQ.QQQ... ..EE...... ......NN.. TT..NN.... 361 .NNNGG.... ....SS..DD D..NNN.... NNNN...... ...NNNNNNN SSS......K 421 KKNNNNNN.. ...LLAA... ....LL.... .......... ......EE.. ......II.. 481 ......QQ.. .......... AAA..NNNNN ..SS.NNKK. ..AA...... SS........ 541 .......DDE E......... .EE....... .HHDDD.... ........TT T.......KK 601 .......... .......AAA AAAAAAAA.. ........LL .......AA. .......... 661 ....HHQQQQ HHQQQ...QQ QHHQQQ...Q QHHHH.QQQ. .......... ..HHH.HH.. 721 ...LL..SS. HH.HH.TT.. ...SS...AA .NNN..NNSS S......... .AA....... 781 .......... ...AA...AA ....AAA... .......... .AAA..AAGG .......... 841 .....PP... .......... .......... .......... .......... ..RR...LL. 901 .......... .......... ...LL..... .......... .......... DD....LL.. 961 ...KK..... .......GG. GGDD...... ........SS ....SS.... ....RR..GG 1021 .......... .PP......F F......... .......AAA QQ........ ...QQQ.... 1081 DD........ .......... .......... ...GGAA... .AA....... .......... 1141 ......QQ.Q QQQQAA.... .....SS... .QQQ...... ....AAAAA. ........SS 1201 ......SSS. ......DD.. .SS....... GGG...PP.P P......... PP.AAPP.GG 1261 .SS..AA... ......PP.. SS..EE.AA. .......... .......PP. ..HH...... 1321 .....PP..Q QQ....NN.. .....RR... .......... .......... ......LL.. 1381 .......... .......... .......... .......... .......... .......... 1441 ....AA.... ..MM...... .........Q Q..AAA..AA .QQQQ....Q QQ.QQ.QQ.Q 1501 Q..QQQ.QQ. .QQQQ...QQ Q...QQHH.A AAAAAA.HH. ..LL.....P P......PP. 1561 .GG...GG.G G...SS.... ..KK...... .......... .......... AA........ 1621 ..TT.....L L.NN...... .......... ....LL.... .......... .......... 1681 ...NN..... LL...RR... .RRR.....Q Q..SS..SS. .......SS. .........G 1741 G.PP.KK... ...EE..... .......... .......... .......... .......... 1801 .QQ....... .......... .......... ...LLQQ.LL .......... ......PP.. 1861 AAAA...... .....AAAAA .AAAAAAA.. ..GG...... .......... .......... 1921 ...DD..DD. ...GGG.... .......... ...TT..... ......AAAA A.......SS 1981 RR..AA.... .......... .VVAA.AAAA AAAA....II .......... ...DD..... 2041 ...GG..... .......... .......... EEDDDEE... .......... .......... 2101 EE.......R R.SS.GGGSS EE......PP PPPPAASSS. .....TTSSS SSS..SSS.. 2161 ..TTAAA.AA A.... 2. Histogram of spacings between consecutive amino acid multiplets: (1-5) 148 (6-10) 49 (11-20) 33 (>=21) 18 3. Clusters of amino acid multiplets (cmin = 17/30 or 22/45 or 27/60): none 4. Significant specific amino acid multiplet counts: Letter Count % Observed (Critical number) G 127 5.8 19 (18) at 110 (l= 2) 170 (l= 2) 173 (l= 2) 365 (l= 2) 829 (l= 2) 978 (l= 2) 981 (l= 2) 1019 (l= 2) 1114 (l= 2) 1231 (l= 3) 1259 (l= 2) 1562 (l= 2) 1567 (l= 2) 1570 (l= 2) 1740 (l= 2) 1893 (l= 2) 1934 (l= 3) 2044 (l= 2) 2116 (l= 3) Q 203 9.3 38 (34) at 40 (l= 2) 100 (l= 4) 105 (l= 2) 133 (l= 3) 158 (l= 4) 167 (l= 2) 187 (l= 2) 204 (l= 7) 309 (l= 2) 321 (l= 3) 325 (l= 3) 487 (l= 2) 667 (l= 4) 673 (l= 3) 679 (l= 3) 684 (l= 3) 690 (l= 2) 697 (l= 3) 1061 (l= 2) 1074 (l= 3) 1147 (l= 2) 1150 (l= 5) 1172 (l= 3) 1330 (l= 3) 1470 (l= 2) 1482 (l= 4) 1490 (l= 3) 1494 (l= 2) 1497 (l= 2) 1500 (l= 2) 1504 (l= 3) 1508 (l= 2) 1512 (l= 4) 1519 (l= 3) 1525 (l= 2) 1710 (l= 2) 1802 (l= 2) 1836 (l= 2) H 85 3.9 12 (11) at 572 (l= 2) 665 (l= 2) 671 (l= 2) 682 (l= 2) 692 (l= 4) 713 (l= 3) 717 (l= 2) 731 (l= 2) 734 (l= 2) 1313 (l= 2) 1527 (l= 2) 1538 (l= 2) 5. Long amino acid multiplets (>= 5; Letter/Length/Position): A/5/74 A/9/194 Q/7/204 A/9/235 N/7/404 N/6/423 N/5/506 A/11/618 Q/5/1150 A/5/1185 A/7/1530 A/5/1876 A/7/1882 A/5/1967 A/8/2007 P/6/2129 S/6/2148 B. CHARGE ALPHABET. 1. Total number of charge multiplets: 56 (Expected range: 21-- 60) 19 +plets (f+: 7.9%), 37 -plets (f-: 12.1%) Total number of charge altplets: 49 (Critical number: 62) 2. Histogram of spacings between consecutive charge multiplets: (1-5) 13 (6-10) 8 (11-20) 6 (>=21) 30 3. Long charge multiplets (>= 5; Letter/Length/Position): -/18/276 +/5/418 -/8/547 -/9/574 -/7/2071 -------------------------------------------------------------------------------- PERIODICITY ANALYSIS. The program identifies periodic elements of periods between 1 and 10 for the amino acid alphabet, for the charge alphabet, and for a hydrophobicity alphabet. Each periodic element consists of an error-free core pattern (of length at least 4 for the amino acid alphabet, 5 for the charge alphabet, and 6 for the hydrophobicity alphabet) which is extended allowing for errors. The numbers of errors are given for each position in the con- sensus of a periodic pattern involving more than one letter. The displayed periodic patterns would generally not be statistically significant but are listed for the sake of a general interactive appraisal of the sequence. Periodicities of exceptionally high copy number are indicated with a !- mark. A. AMINO ACID ALPHABET (core: 4; !-core: 6) Location Period Element Copies Core Errors 74- 85 1 A 10 5 2 74- 93 2 A. 9 6 ! 1 100- 106 1 Q 6 4 1 189- 192 1 M 4 4 0 194- 202 1 A 9 9 ! 0 204- 210 1 Q 7 7 ! 0 235- 243 1 A 9 9 ! 0 274- 289 4 E.ED 4 4 /0/./1/0/ 321- 328 2 Q. 4 4 0 344- 367 6 N..... 4 4 0 388- 432 9 N........ 5 5 0 391- 394 1 N 4 4 0 402- 411 2 N. 5 5 0 423- 428 1 N 6 6 ! 0 506- 510 1 N 5 5 0 566- 585 4 D... 5 5 0 616- 631 2 A. 8 8 ! 0 622- 653 8 A....... 4 4 0 667- 670 1 Q 4 4 0 669- 688 4 Q... 5 5 0 670- 704 7 Q...... 5 5 0 692- 695 1 H 4 4 0 714- 737 6 H..... 4 4 0 754- 769 4 N... 4 4 0 1062-1089 7 QR.R... 4 4 /0/1/./1/./././ 1147-1154 1 Q 7 5 1 1185-1189 1 A 5 5 0 1235-1246 3 P.. 4 4 0 1243-1250 2 G. 4 4 0 1301-1325 5 T.... 5 5 0 1464-1517 6 Q..Q.. 8 6 ! /1/././3/././ 1482-1485 1 Q 4 4 0 1482-1523 3 Q.. 12 7 ! 2 1482-1515 2 Q. 13 4 4 1497-1532 4 Q... 8 5 1 1512-1515 1 Q 4 4 0 1530-1536 1 A 7 7 ! 0 1729-1760 8 S....... 4 4 0 1861-1864 1 A 4 4 0 1876-1888 1 A 12 7 ! 1 1965-1972 2 A. 4 4 0 2004-2014 1 A 10 8 ! 1 2129-2134 1 P 6 6 ! 0 2148-2153 1 S 6 6 ! 0 B. CHARGE ALPHABET ({+= KR; -= ED; 0}; core: 5; !-core: 7) and HYDROPHOBICITY ALPHABET ({*= KRED; i= LVIF; 0}; core: 6; !-core: 9) Location Period Element Copies Core Errors 271- 293 1 - 22 18 ! 1 272- 311 5 --... 7 5 /1/2/./././ 328- 342 3 -0. 5 5 0 418- 422 1 + 5 5 0 547- 554 1 - 8 8 ! 0 550- 585 4 -... 8 6 1 574- 584 1 * 11 11 ! 0 1063-1073 1 * 10 6 1 2071-2077 1 - 7 7 ! 0 -------------------------------------------------------------------------------- SPACING ANALYSIS. The spacings between consecutive residues of the same type (all 20 amino acids, + and - charge, and combined charge *) are evaluated for signifi- cantly large or small maximal and minimal spacings. The output is ordered by the beginning point of the significant spacing. Entries are identified by the residue type, spacing (number of amino acids between the identified positions), rank of the displayed spacing (e.g., 50 alanines in the sequence induce 51 spacings, ranked by decreasing length from 1 to 51), and p-value (probability of exceeding the displayed spacing). A maximal spacing with p-value 0.01 or less is considered significantly large; a maximal spacing with p-value 0.99 or larger is considered significantly small. Similarly, a minimal spacing with p-value 0.99 or larger is con- sidered significantly small, and a minimal spacing with p-value 0.01 or less is considered significantly large (excluding doublets). If the first maximal spacing (rank 1) of a residue is significantly large or small, then also the second maximal spacing (rank 2) is evaluated. Large maximal and small minimal spacings indicate clustering effects, whereas small max- imal and large minimal spacings indicate excessive evenness in the distri- bution of the residues. Location (Quartile) Spacing Rank P-value Interpretation 33- 836 (1.) Y( 803)Y 1 of 23 0.0009 large 1. maximal spacing 42- 313 (1.) R( 271)R 1 of 84 0.0011 large 1. maximal spacing 118- 271 (1.) D( 153)D 2 of 115 0.0001 large 2. maximal spacing 183- 636 (1.) F( 453)F 1 of 40 0.0040 large 1. maximal spacing 429- 655 (1.) G( 226)G 1 of 128 0.0001 large 1. maximal spacing 643- 833 (2.) R( 190)R 2 of 84 0.0003 large 2. maximal spacing 653- 783 (2.) +( 130)+ 1 of 172 0.0029 large 1. maximal spacing 1230-1285 (3.) *( 55)* 2 of 436 0.0000 large 2. maximal spacing 1401-1640 (3.) F( 239)F 2 of 40 0.0415 large 2. maximal spacing 1404-1616 (3.) D( 212)D 1 of 115 0.0007 large 1. maximal spacing 1406-1572 (3.) N( 166)N 1 of 145 0.0010 large 1. maximal spacing 1429-1608 (3.) Y( 179)Y 2 of 23 0.9787 small 2. maximal spacing 1438-1548 (3.) G( 110)G 2 of 128 0.0066 large 2. maximal spacing 1458-1583 (3.) +( 125)+ 2 of 172 0.0000 large 2. maximal spacing 1461-1582 (3.) *( 121)* 1 of 436 0.0000 large 1. maximal spacing 1461-1582 (3.) -( 121)- 1 of 265 0.0000 large 1. maximal spacing 1813-1901 (4.) N( 88)N 2 of 145 0.0318 large 2. maximal spacing 1958-2015 (4.) -( 57)- 2 of 265 0.0084 large 2. maximal spacing