Fundamentals of Sequence Analysis, 1995-1996
Problem set 3:  Databases and Database searching

If you get stuck, refer to the OpenVMS and GCG resources in the 
class home page.

References:
 
 S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman (1990)
  Basic Local Alignment Search Tool, J. Mol. Biol. 215:403-410

 W.R. Pearson and D.J. Lipman (1988) Improved Tools for biological sequence
  comparison, PNAS 85:2444-2448

 M. Gribskov, M. Homyak, J. Edenfield, D. Eisenberg (1988)
 Profile scanning for 3-dimensional structural patterns in protein     
 sequences. CABIOS  4(1):61-66


Problem group 1.  Databases

Here is the recent history of the Genbank database:

Release    Date     Base Pairs   Entries

   88    Apr 95      286094556    352414
   89    Jun 95      318624568    425211
   90    Aug 95      353713490    492483
   91    Oct 95      384939485    555694
   92    Dec 95      425860958    620765

1A.  Assuming that this growth rate is steady, estimate the size of the
     next Genbank release (93), in base pairs.
1B.  What is the trend for the average size of a Genbank entry?  Why?


Genbank is in exponential growth
#   size       entries average log(size)
1 286094556    352414      812      8.46
2 318624568    425211      749      8.50
3 353713490    492483      718      8.55
4 384939485    555694      692      8.59
5 425860958    620765      686      8.63

Rate of growth (roughly) (8.63 - 8.46)/4.0 = .0425/release
(You could curve fit this, but it won't change things much)
Next release = 8.63 + .0425 = 8.6725 -> 470435405 bases

The average size is falling primarily because the rate of growth of the EST
divisions is greater than for the other divisions, and because the average
EST is under 400 bp.


Problem group 2.  Sequence database searching with query sequences

For this exercise use PEPCORRUPT to create these files derived from
Pir1:A1HU (which is 353 amino acids in length)

                    Substitutions   Indels  Average subs/residue
  A1HU_150.pep      525             0       1.5
  A1HU_175.pep      613             0       1.75
  
BLAST each of these against the Swiss Protein database using the default
settings. FASTA each of these against the local SwissProtein database
using a wordsize of 1, with the standard matrix (Dayhoff) and the Blosum62
matrix.

(Hint - you want to use commands like these to run these jobs in the batch 
queues:

   $ BLAST/infile1=whatever/infile2=SwissProt/list=10/segment=10/default/batch
   $ FASTA/infile1=whatever/infile2=SWISS:*/word=1/default/batch
   $ FASTA/infile1=whatever/infile2=SWISS:*/word=1 -
     /data=genmoredata:blosum62.cmp/default/batch
)

It may take a while for these calculations to complete - use 

   $ show entry 

to see what is still running.

2A.  Summarize the results.


When FASTA sensitivity is optimized (using the Blosum62 matrix and
a word size of 1) it is roughly as sensitive as BLAST.  Both should find
roughly 3 hits at 1.5 subs/residue, and none at 1.75 subs/residue.  Your
results may vary slightly, because PEPCORRUPT makes random changes, and so
your corrupted test sequences may be bit more or less similar to the starting
sequence.


When PIR1:A1HU is searched through the GB_EST database divisions using
FRAMESEARCH (don't try it - it takes 29 hours!) the six highest scoring
results are:

Gb_Est1:H25698  H25698 yl54g08.r1 Homo sapiens cDNA clone 162110 5'...  638.0
Gb_Est1:H39741  H39741 yo53d04.r1 Homo sapiens cDNA clone 181639 5'...  638.0
Gb_Est2:T28687  T28687 EST51971 Homo sapiens cDNA 5' end similar to...  635.0
Gb_Est1:H45310  H45310 yo65b01.r1 Homo sapiens cDNA clone 182761 5'...  624.0
Gb_Est1:H39698  H39698 yo52e05.r1 Homo sapiens cDNA clone 181568 5'...  613.0
Gb_Est2:R83490  R83490 yp15b06.r1 Homo sapiens cDNA clone 187475 5'...  613.0

Create a listfile (list.fil) containing these entries and search A1HU
against just these six entries: 

  $ framesearch/infile1=pir1:a1hu/infile2=@list.fil/default

2B. How many of the six contain at least one frameshift?  What does that
suggest to you about EST sequences?

Five of the six.

They contain a lot of garbage and should be handled with care.  Always
check any EST hits for frameshifts before going on to align them or use
them in other ways.


Problem group 3.  Patterns, Profiles, and  Databases

The PROSITE database is available in both pattern form (search with MOTIFS)
and profile form (search with PROFILESCAN).  Run both on the PIR1:A1HU
sequence and you will find a hit only against the "ig_mhc" motif, this will
be in the output .motifs and .scan files - you will need to search the 
latter for ig_mhc to find the hit.

3A.  Compare the results of the two methods.

The pattern method retrieves this region:

Ig_Mhc                (F,Y)xCx(V,A)xH
                        (F)xCx(V)xH
           311: KKGDT     FSCMVGH     EALPL

whereas the profile method retrieves a much larger region:

                                 *******  <- piece found by Motifs
    286 SQGTTTFAVTSILRVAAEDWKKGDTFSCMVGHEALPLAFTQKTIDRLAGK 335
           .  :|.:.|:| |::::|..:::|:|.|:|::  |: .. :::: .:.
P      1 PTGDGTYSVFSVLTVTASDWKSGDTYTCRVTHEGSNLP.EPVLVSRTPSE 49

S    336 PTHVNVS 342
         :.  .
P     50 SSGKSPS 56

The take home lesson is that most protein sequence motifs are better
described by their profiles than by their simple pattern.  Or put another 
way - "consensus" sequences are often a bad approximation for the true
spectrum of allowed sequences.


3B.  Use FINDPATTERNS to search the SwissProtein database for
     the Ig_Mhc pattern:               (F,Y)xCx(V,A)xH
     Then use PROFILESEARCH with profiledir:ig_mhc.prf to search the same 
     database.  Use /BATCH on both.  What do you see?

The commands would have been:

 $ findpatterns/infile=sw:*/pattern="(F,Y)xCx(V,A)xH"/batch/default
 $ profilesearch/infile1=profiledir:ig_mhc.prf/infile2=sw:*/batch/default

In this case, the results are roughly equivalent. If you cut off the
profilesearch at a Z value of about 4, which is close to the last
recognizable Ig-Mhc protein in the list, one finds 392 entries (Z values
are (observed - mean)/ standard deviation). Conversely, findpatterns found
385 sequences, which is pretty similar.  Looking more closely at what is 
found:

findpatterns   profilesearch     what
219            224               "histoc"  (for HISTOCompatability)
 62             72               " ig"     (various Ig entries)
  1              2               "immunog" (for immunoglobulin)
----------------------------------------------------------------
282            298               Sum


Problem group 4.  Finding database entries by documentation

4A.  The GCG program LOOKUP has indices into the various databases.  Use it
     to find all entries that have:
 
     Organism: Drosophila
     AllText:  Immunoglobulin

       AND

     AllText:  Neural


This can be done interactively from within Lookup as well, but here's how
from the command line: 

    $ lookup/organism=Drosophila/alltext=Immunoglobulin/outf=dros_ig.list/default
    $ lookup/alltext=neural/infile=@dros_ig.list/outfile=final.list/default
    $ Type final.list

There are four entries in this list:
   SWISSPROT:IML2_DROME
   PIR2:A32579
   GB_IN:DROIML2
   GB_IN:DRONRGAA

The key point here is not just that this can be done, but the final list
can be fed straight into a GCG program (assuming that it consists of just
proteins or just nucleic acids, which isn't the case here!) For instance:

   $ fasta/infile1=query.pep/infile2=@result.list


4B.  There are other databases around which are not indexed by our local
     copy of Lookup, for instance, the medline database.  You can use any
     Web client to use the NCBI entrez server, look at   

         http://www3.ncbi.nlm.nih.gov/Entrez/ 

     Use the Entrez server to reproduce the search above looking only at
     the protein database.  Then repeat the search on the medline database.


There is no answer - it's an exercise.  In the protein database you should
find the two proteins, and a bunch of links to nucleotide entries.  In the
medline database you will find 9 entries.