Fundamentals of Sequence Analysis, 1998-1999
Problem set 3:  Databases and Database searching

If you get stuck, refer to the OpenVMS and GCG resources in the 
class home page.

 S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman (1990)
  Basic Local Alignment Search Tool, J. Mol. Biol. 215:403-410

 S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, and
  D.J. Lipman (1997) Gapped BLAST and PSI-BLAST: a new generation
  of protein database search programs. Nuc. Acids Res 25:3389-3402

 M. Gribskov, M. Homyak, J. Edenfield, D. Eisenberg (1988)
  Profile scanning for 3-dimensional structural patterns in protein     
  sequences. CABIOS  4(1):61-66

 W.R. Pearson and D.J. Lipman (1988) Improved Tools for biological
  sequence comparison, PNAS 85:2444-2448

 W.R. Pearson (1995) Comparison of methods for searching protein
   sequence databases. Prot. Science 4:1145-1160

 W.R. Pearson (1998) Empirical Statistical Estimates for Sequence 
  Similarity Searches. J. Mol. Biol. 276:71-84

Problem group 1.  Databases

Here is the recent history of the Genbank database:

Release    Date     Base Pairs   Entries

   106   Apr 98     1502542306    2209232
   107   Jun 98     1622041465    2355928
   108   Aug 98     1797137713    2532359
   109   Oct 98     2008761784    2837897
   110   Dec 98     2162067871    3043729

1.  Assuming that this growth rate is steady, estimate the size of the
next Genbank release (111), in base pairs.

Genbank is in exponential growth

1 1502542306    2209232    680     9.177
2 1622041465    2355928    688     9.210
3 1797137713    2532359    710     9.255
4 2008761784    2837897    708     9.303
5 2162067871    3043729    710     9.335

Rate of growth (roughly) (9.335 - 9.177)/4.0 = .0395/release
(You could curve fit this, but it won't change things much)
Next release = 9.335 + .0395 = 9.3745 -> 2368645132 bases

Problem group 2.  Sequence database searching with query sequences

For this exercise use PEPCORRUPT to create these files derived from
Pir1:A1HU (which is 353 amino acids in length)

                    Substitutions   Indels  Average subs/residue
  A1HU_150.pep      525             0       1.5
  A1HU_175.pep      613             0       1.75
BLAST each of these against the Swiss Protein database using the default
settings. FASTA each of these against the local SwissProtein database
using a wordsize of 1, with the standard matrix (Dayhoff) and the Blosum62

This will take a while, so run it in a batch queue using, which contains:

$ blast_program="blastp"
$ blast_datalib="swissprot"
$ blast_descriptions=50
$ blast_alignments=0
$ blast2 a1hu_150.pep
$ blast2 a1hu_175.pep
$! Use Pearson's FASTA, not the GCG one
$ fasta3   "-QO" a1hu_150_fasta_2.out   -b 20 -d 1 a1hu_150.pep "c"
$ delete .;*
$ fasta3   "-QO" a1hu_175_fasta_2.out   -b 20 -d 1 a1hu_175.pep "c"
$ delete .;*
$ fasta3   "-QO" a1hu_150_fasta_1.out   -b 20 -d 1 a1hu_150.pep "c" 1
$ delete .;*
$ fasta3   "-QO" a1hu_175_fasta_1.out   -b 20 -d 1 a1hu_175.pep "c" 1
$ delete .;*

Use the following command to start the job only AFTER you verify
that you did, in fact, create the two .PEP files!!!

$ submit/log/noprint class:db_search_test

It may take a while for these calculations to complete - use 

   $ show entry 

to see what is still running.

2A.  Summarize the results.

Your results may vary, since the substitutions produced are random.  The
results I had were:

E <= .1  Program   Query  Tuple  False Positives
0        BLAST2    150    -       0
1        FASTA3    150    2       0
3        FASTA3    150    1       0
3        BLAST2    175    -       0
4        FASTA3    175    2       1 (Q03401)
4        FASTA3    175    1       1 (Q03401)

This somewhat mysterious result arises because both of the remaining
sequences have only about 21% identity remaining, and by chance the
175 query had a region with slightly better conservation than was
present in the 150 query.  This was confirmed by running GAP with a huge
gap weight between the original and corrupted sequences, which forces an 
ungapped alignment. The 175 query had a Quality score of 122.8,but the 150
query had a Quality score of only 120.4. 

A search through the EST database divisions found these six high
scoring regions: 

Gb_Est1:H25698  H25698 yl54g08.r1 Homo sapiens cDNA clone 162110 5'...  638.0
Gb_Est1:H39741  H39741 yo53d04.r1 Homo sapiens cDNA clone 181639 5'...  638.0
Gb_Est2:T28687  T28687 EST51971 Homo sapiens cDNA 5' end similar to...  635.0
Gb_Est1:H45310  H45310 yo65b01.r1 Homo sapiens cDNA clone 182761 5'...  624.0
Gb_Est1:H39698  H39698 yo52e05.r1 Homo sapiens cDNA clone 181568 5'...  613.0
Gb_Est2:R83490  R83490 yp15b06.r1 Homo sapiens cDNA clone 187475 5'...  613.0

Using A1HU as a query, search through these six EST sequence files with

  $ framesearch/infile1=pir1:a1hu/infile2=@class:est6.fil-

2B. How many of the six contain at least one frameshift?  What does that
suggest to you about EST sequences?

Five of the six have frameshifts.

ESTs contain a lot of garbage and should be handled with care.  Always
check any EST hits for frameshifts before doing further work.  Never
translate an EST without checking this!

Problem group 3.  Patterns, Profiles, and  Databases

The PROSITE database is available in both pattern form (search with MOTIFS)
and profile form (search with PROFILESCAN).  Run both on the PIR1:A1HU
sequence and you will find a hit only against the "ig_mhc" motif, this will
be in the output .motifs and .scan files - you will need to search the 
latter for ig_mhc to find the hit.

3A.  Compare the results of the two methods.

The pattern method retrieves this region:

Ig_Mhc                (F,Y)xCx(V,A)xH
           311: KKGDT     FSCMVGH     EALPL

whereas the profile method retrieves a much larger region:

                                 *******  <- piece found by Motifs
           .  :|.:.|:| |::::|..:::|:|.|:|::  |: .. :::: .:.

S    336 PTHVNVS 342
         :.  .
P     50 SSGKSPS 56

The take home lesson is that most protein sequence motifs are better
described by their profiles than by their simple pattern.  Or put another 
way - "consensus" sequences are often a bad approximation for the true
spectrum of allowed sequences.

3B.  Use FINDPATTERNS to search the SwissProtein database for
     the Ig_Mhc pattern:               (F,Y)xCx(V,A)xH
     Then use PROFILESEARCH with profiledir:ig_mhc.prf to search the same 
     database.  Use /BATCH on both.  What do you see?

The commands would have been:

 $ findpatterns/infile=sw:*/pattern="(F,Y)xCx(V,A)xH"/batch/default
 $ profilesearch/infile1=profiledir:ig_mhc.prf/infile2=sw:*/batch/default

In this case, the results are roughly equivalent. If you cut off the
profilesearch at a Z value of about 4, which is close to the last
recognizable Ig-Mhc protein in the list, one finds 443 entries (Z values
are (observed - mean)/ standard deviation). Conversely, findpatterns found
434 sequences, which is pretty similar.  Looking more closely at what is 

findpatterns   profilesearch     what
218            224               "histoc"  (for HISTOCompatability)
 64             71               " ig"     (various Ig entries)
  1              1               "immunog" (for immunoglobulin)
151            147               everything else
434            443               Sum

Problem group 4.  Finding database entries by documentation

4A.  Use Entrez at to search the Protein databases for:
     Organism:    Drosophila
     All Fields:  Immunoglobulin
     All Fields:  Neural

    How many entries do you find?

There were 11 when I did it, the number may change at any time as
new entries are added.

4B.  Repeat the search on one of the SRS servers.  How many entries 
     did you find?

It depends on how many databases you selected, and which SRS server
you used.  Some have more databases than others.  On the server
in the Netherlands, I found 3 entries which matched, two in SwissProt
and one in Pir.