Fundamentals of Sequence Analysis, 1998-1999 Problem set 3: Databases and Database searching If you get stuck, refer to the OpenVMS and GCG resources in the class home page. References: S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman (1990) Basic Local Alignment Search Tool, J. Mol. Biol. 215:403-410 S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, and D.J. Lipman (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nuc. Acids Res 25:3389-3402 M. Gribskov, M. Homyak, J. Edenfield, D. Eisenberg (1988) Profile scanning for 3-dimensional structural patterns in protein sequences. CABIOS 4(1):61-66 W.R. Pearson and D.J. Lipman (1988) Improved Tools for biological sequence comparison, PNAS 85:2444-2448 W.R. Pearson (1995) Comparison of methods for searching protein sequence databases. Prot. Science 4:1145-1160 W.R. Pearson (1998) Empirical Statistical Estimates for Sequence Similarity Searches. J. Mol. Biol. 276:71-84 Problem group 1. Databases Here is the recent history of the Genbank database: Release Date Base Pairs Entries 106 Apr 98 1502542306 2209232 107 Jun 98 1622041465 2355928 108 Aug 98 1797137713 2532359 109 Oct 98 2008761784 2837897 110 Dec 98 2162067871 3043729 1. Assuming that this growth rate is steady, estimate the size of the next Genbank release (111), in base pairs. Genbank is in exponential growth 1 1502542306 2209232 680 9.177 2 1622041465 2355928 688 9.210 3 1797137713 2532359 710 9.255 4 2008761784 2837897 708 9.303 5 2162067871 3043729 710 9.335 Rate of growth (roughly) (9.335 - 9.177)/4.0 = .0395/release (You could curve fit this, but it won't change things much) Next release = 9.335 + .0395 = 9.3745 -> 2368645132 bases Problem group 2. Sequence database searching with query sequences For this exercise use PEPCORRUPT to create these files derived from Pir1:A1HU (which is 353 amino acids in length) Substitutions Indels Average subs/residue A1HU_150.pep 525 0 1.5 A1HU_175.pep 613 0 1.75 BLAST each of these against the Swiss Protein database using the default settings. FASTA each of these against the local SwissProtein database using a wordsize of 1, with the standard matrix (Dayhoff) and the Blosum62 matrix. This will take a while, so run it in a batch queue using CLASS:db_search_test.com, which contains: $ blast_program="blastp" $ blast_datalib="swissprot" $ blast_descriptions=50 $ blast_alignments=0 $ blast2 a1hu_150.pep $ blast2 a1hu_175.pep $! $! Use Pearson's FASTA, not the GCG one $! $ fasta3 "-QO" a1hu_150_fasta_2.out -b 20 -d 1 a1hu_150.pep "c" $ delete .;* $ fasta3 "-QO" a1hu_175_fasta_2.out -b 20 -d 1 a1hu_175.pep "c" $ delete .;* $ fasta3 "-QO" a1hu_150_fasta_1.out -b 20 -d 1 a1hu_150.pep "c" 1 $ delete .;* $ fasta3 "-QO" a1hu_175_fasta_1.out -b 20 -d 1 a1hu_175.pep "c" 1 $ delete .;* Use the following command to start the job only AFTER you verify that you did, in fact, create the two .PEP files!!! $ submit/log/noprint class:db_search_test It may take a while for these calculations to complete - use $ show entry to see what is still running. 2A. Summarize the results. Your results may vary, since the substitutions produced are random. The results I had were: E <= .1 Program Query Tuple False Positives 0 BLAST2 150 - 0 1 FASTA3 150 2 0 3 FASTA3 150 1 0 3 BLAST2 175 - 0 4 FASTA3 175 2 1 (Q03401) 4 FASTA3 175 1 1 (Q03401) This somewhat mysterious result arises because both of the remaining sequences have only about 21% identity remaining, and by chance the 175 query had a region with slightly better conservation than was present in the 150 query. This was confirmed by running GAP with a huge gap weight between the original and corrupted sequences, which forces an ungapped alignment. The 175 query had a Quality score of 122.8,but the 150 query had a Quality score of only 120.4. A search through the EST database divisions found these six high scoring regions: Gb_Est1:H25698 H25698 yl54g08.r1 Homo sapiens cDNA clone 162110 5'... 638.0 Gb_Est1:H39741 H39741 yo53d04.r1 Homo sapiens cDNA clone 181639 5'... 638.0 Gb_Est2:T28687 T28687 EST51971 Homo sapiens cDNA 5' end similar to... 635.0 Gb_Est1:H45310 H45310 yo65b01.r1 Homo sapiens cDNA clone 182761 5'... 624.0 Gb_Est1:H39698 H39698 yo52e05.r1 Homo sapiens cDNA clone 181568 5'... 613.0 Gb_Est2:R83490 R83490 yp15b06.r1 Homo sapiens cDNA clone 187475 5'... 613.0 Using A1HU as a query, search through these six EST sequence files with FRAMESEARCH: $ framesearch/infile1=pir1:a1hu/infile2=@class:est6.fil- /out=est6.txt/default 2B. How many of the six contain at least one frameshift? What does that suggest to you about EST sequences? Five of the six have frameshifts. ESTs contain a lot of garbage and should be handled with care. Always check any EST hits for frameshifts before doing further work. Never translate an EST without checking this! Problem group 3. Patterns, Profiles, and Databases The PROSITE database is available in both pattern form (search with MOTIFS) and profile form (search with PROFILESCAN). Run both on the PIR1:A1HU sequence and you will find a hit only against the "ig_mhc" motif, this will be in the output .motifs and .scan files - you will need to search the latter for ig_mhc to find the hit. 3A. Compare the results of the two methods. The pattern method retrieves this region: Ig_Mhc (F,Y)xCx(V,A)xH (F)xCx(V)xH 311: KKGDT FSCMVGH EALPL whereas the profile method retrieves a much larger region: ******* <- piece found by Motifs 286 SQGTTTFAVTSILRVAAEDWKKGDTFSCMVGHEALPLAFTQKTIDRLAGK 335 . :|.:.|:| |::::|..:::|:|.|:|:: |: .. :::: .:. P 1 PTGDGTYSVFSVLTVTASDWKSGDTYTCRVTHEGSNLP.EPVLVSRTPSE 49 S 336 PTHVNVS 342 :. . P 50 SSGKSPS 56 The take home lesson is that most protein sequence motifs are better described by their profiles than by their simple pattern. Or put another way - "consensus" sequences are often a bad approximation for the true spectrum of allowed sequences. 3B. Use FINDPATTERNS to search the SwissProtein database for the Ig_Mhc pattern: (F,Y)xCx(V,A)xH Then use PROFILESEARCH with profiledir:ig_mhc.prf to search the same database. Use /BATCH on both. What do you see? The commands would have been: $ findpatterns/infile=sw:*/pattern="(F,Y)xCx(V,A)xH"/batch/default $ profilesearch/infile1=profiledir:ig_mhc.prf/infile2=sw:*/batch/default In this case, the results are roughly equivalent. If you cut off the profilesearch at a Z value of about 4, which is close to the last recognizable Ig-Mhc protein in the list, one finds 443 entries (Z values are (observed - mean)/ standard deviation). Conversely, findpatterns found 434 sequences, which is pretty similar. Looking more closely at what is found: findpatterns profilesearch what 218 224 "histoc" (for HISTOCompatability) 64 71 " ig" (various Ig entries) 1 1 "immunog" (for immunoglobulin) 151 147 everything else ---------------------------------------------------------------- 434 443 Sum Problem group 4. Finding database entries by documentation 4A. Use Entrez at http://www.ncbi.nlm.nih.gov/Entrez/ to search the Protein databases for: Organism: Drosophila All Fields: Immunoglobulin All Fields: Neural How many entries do you find? There were 11 when I did it, the number may change at any time as new entries are added. 4B. Repeat the search on one of the SRS servers. How many entries did you find? It depends on how many databases you selected, and which SRS server you used. Some have more databases than others. On the server in the Netherlands, I found 3 entries which matched, two in SwissProt and one in Pir.