Fundamentals of Sequence Analysis, 1995-1996 Problem set 3: Databases and Database searching If you get stuck, refer to the OpenVMS and GCG resources in the class home page. References: S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman (1990) Basic Local Alignment Search Tool, J. Mol. Biol. 215:403-410 W.R. Pearson and D.J. Lipman (1988) Improved Tools for biological sequence comparison, PNAS 85:2444-2448 M. Gribskov, M. Homyak, J. Edenfield, D. Eisenberg (1988) Profile scanning for 3-dimensional structural patterns in protein sequences. CABIOS 4(1):61-66 Problem group 1. Databases Here is the recent history of the Genbank database: Release Date Base Pairs Entries 88 Apr 95 286094556 352414 89 Jun 95 318624568 425211 90 Aug 95 353713490 492483 91 Oct 95 384939485 555694 92 Dec 95 425860958 620765 1A. Assuming that this growth rate is steady, estimate the size of the next Genbank release (93), in base pairs. 1B. What is the trend for the average size of a Genbank entry? Why? Problem group 2. Sequence database searching with query sequences For this exercise use PEPCORRUPT to create these files derived from Pir1:A1HU (which is 353 amino acids in length) Substitutions Indels Average subs/residue A1HU_150.pep 525 0 1.5 A1HU_175.pep 613 0 1.75 BLAST each of these against the Swiss Protein database using the default settings. FASTA each of these against the local SwissProtein database using a wordsize of 1, with the standard matrix (Dayhoff) and the Blosum62 matrix. (Hint - you want to use commands like these to run these jobs in the batch queues: $ BLAST/infile1=whatever/infile2=SwissProt/list=10/segment=10/default/batch $ FASTA/infile1=whatever/infile2=SWISS:*/word=1/default/batch $ FASTA/infile1=whatever/infile2=SWISS:*/word=1 - /data=genmoredata:blosum62.cmp/default/batch ) It may take a while for these calculations to complete - use $ show entry to see what is still running. 2A. Summarize the results. When PIR1:A1HU is searched through the GB_EST database divisions using FRAMESEARCH (don't try it - it takes 29 hours!) the six highest scoring results are: Gb_Est1:H25698 H25698 yl54g08.r1 Homo sapiens cDNA clone 162110 5'... 638.0 Gb_Est1:H39741 H39741 yo53d04.r1 Homo sapiens cDNA clone 181639 5'... 638.0 Gb_Est2:T28687 T28687 EST51971 Homo sapiens cDNA 5' end similar to... 635.0 Gb_Est1:H45310 H45310 yo65b01.r1 Homo sapiens cDNA clone 182761 5'... 624.0 Gb_Est1:H39698 H39698 yo52e05.r1 Homo sapiens cDNA clone 181568 5'... 613.0 Gb_Est2:R83490 R83490 yp15b06.r1 Homo sapiens cDNA clone 187475 5'... 613.0 Create a listfile (list.fil) containing these entries and search A1HU against just these six entries: $ framesearch/infile1=pir1:email@example.com/default 2B. How many of the six contain at least one frameshift? What does that suggest to you about EST sequences? Problem group 3. Patterns, Profiles, and Databases The PROSITE database is available in both pattern form (search with MOTIFS) and profile form (search with PROFILESCAN). Run both on the PIR1:A1HU sequence and you will find a hit only against the "ig_mhc" motif, this will be in the output .motifs and .scan files - you will need to search the latter for ig_mhc to find the hit. 3A. Compare the results of the two methods. 3B. Use FINDPATTERNS to search the SwissProtein database for the Ig_Mhc pattern: (F,Y)xCx(V,A)xH Then use PROFILESEARCH with profiledir:ig_mhc.prf to search the same database. Use /BATCH on both. What do you see? Problem group 4. Finding database entries by documentation 4A. The GCG program LOOKUP has indices into the various databases. Use it to find all entries that have: Organism: Drosophila AllText: Immunoglobulin AND AllText: Neural 4B. There are other databases around which are not indexed by our local copy of Lookup, for instance, the medline database. You can use any Web client to use the NCBI entrez server, look at http://www3.ncbi.nlm.nih.gov/Entrez/ Use the Entrez server to reproduce the search above looking only at the protein database. Then repeat the search on the medline database.