Fundamentals of Sequence Analysis, 1998-1999

Fundamentals of Sequence Analysis, 1998-1999

Lecture 3. Databases and Database searching

Database names, locations, and contents

Today we'll start with an overview of the major databases and then go on to discuss methods for searching them. These are the relations between the major databases:

Nucleic acid:  GenBank <--> EMBL <--> DDBJ
                  |          |
                  V          V
Protein:       GenPept ~= SwissProtein  <->  PIR   NRL_3D
3D Structure:                                       PDB

There are three nucleic acid sequence databases primarily for political and historical reasons. They share data daily, so effectively this represents a single database in three different data formats - anything submitted to one ends up in all the others.

  1. Genbank: American, maintained by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health in Bethesda, Maryland.
  2. EMBL: European, maintained by the European Molecular Biology Laboratories, or more specifically, the European Bioinformatics Institute (EBI), at Hinxton Hall, U.K.
  3. DDBJ: Japanese, stands for DNA database of Japan, in Mishima, Japan.

We keep a local copy of Genbank at the SAF, it is formatted for GCG access. Genbank full releases come out every 2 months. The 109.0 release had 2,162,067,871 bases, from 3,043,729 sequences. In between full releases there are daily incremental releases which we don't keep locally, but which can be searched at the NCBI. Like most of the other databases, it is experiencing exponential growth, with lately a huge number of Expressed Sequence Tag sequences coming in. Genbank is divided into the following pieces:

GCG     Genbank  Description
gb_pr*  gbpri*   Primate sequence, parts 1-3.
gb_ro   gbrod    Rodent sequence.
gb_om   gbmam    Other mammalian sequence.
gb_ov   gbvrt    Other vertebrate sequence.
gb_in   gbinv    Invertebrate sequence.
gb_pl*  gbpln*   Plant sequence (including fungi and algae), parts 1-2.
gb_ba*  gbbct*   Bacterial sequence, parts 1-2.
gb_st   gbrna    Structural RNA sequence.
gb_vi   gbvrl    Viral sequence.
gb_ph   gbphg    Phage sequence.
gb_sy   gbsyn    Synthetic and chimeric sequence.
gb_un   gbuna    Unannotated sequence.
gb_est* gbest*   EST (expressed sequence tag) sequence, parts 1-22
gb_pt   gbpat    Patent sequence.
gb_sts  gbsts    STS (sequence tagged site) sequence.
gb_gss* gbgss*   GSS (genome survey sequence) sequence, part 1-4.
gb_htg  gbhtg    HTGS (high throughput genomic sequencing) sequence.

Related items, specific to the SAF installation:

genbank -        all of the above, except the ESTs.
hs_u    -        Unigene database, Human.
mm_u    -        Unigene database, Mouse.
rn_u    -        Unigene database, Rat.
unigene -        All unigene entries.

WARNING! The HTGS division entries are not simple sequences! They are essentially snapshots of the assembly process for a YAC, BAC, P1 or other large clone, and contain a set of contigs, all spliced together in arbitrary order and orientation. If you get a hit on an HTGS entry, be sure that you read the file header, which will say where that particular contig begins and ends. Then extract just that contig for further work.

The EST divisions contain 762,208,762 bases, or about a third of the total sequence. The EST division entries are often present many times, and are often duplicates of well characterized genomic sequences. They also contain many more sequencing errors than do other entries (see the homework.) To save space, and to make searching easier, the SAF has installed the Unigene database. For the three organisms which it encompasses, this is essentially a nonredundant set of sequences, drawn from both the regular divisions and from the EST divisions. For the human division of Unigene 58, the included sequences were selected from:

     17571     mRNAs + gene CDSs
    410887     EST, 3'reads
    299197     EST, 5'reads
  + 127901     EST, other/unknown
    855556     total sequences in clusters

These were reduced to only 50846 sequences, by clustering, and retaining only one final sequence per cluster. The cluster sizes were as follows:

   Cluster      Number
   size         of clusters
       1        11331
       2         8208
     3-4         8614
     5-8         7216
     9-16        5049
    17-32        4203
    33-64        3313
    65-128       1943
   129-256        670
   257-512        211
   513-1024        65
  1025-2048        19
  2049-4096         3
  4097-8192         1

There are four major protein sequence databases:

  1. SwissProtein: European, maintained by Amos Bairoch's group, in Geneva, Switzerland, along with the EBI.
  2. GenPept American, maintained by a group at the Frederick Biomedical Supercomputing Center, Frederick, Maryland.
  3. PIR: American, from the National Biotechnology Research Foundation.
  4. NRL_3D: American, from the Naval Research Laboratory and the NBRF.

SwissProtein consists of peptides translated from the EMBL database, GenPept consists of peptides translated from Genbank. Since EMBL and Genbank are essentially identical, you would expect that GenPept and SwissProtein would be so as well. In one sense they are - most of the proteins found in one will be found in the other. But there are more sequences in Genpept (109.0 has 343,349 entries) than in SwissProtein (release 36.0, has 74,019 entries, 26,840,295 amino acids worth of sequence.) This comes about because the DNA databases have multiple independent sequences of the same genes, and Genpept faithfully translates all of these copies, whereas SwissProtein tends to merge them into a single entry. The result is that Genpept has many more repeats of the exact same sequence than does SwissProtein. Genpept is essentially unannotated, whereas the annotation in SwissProtein is very complete.

PIR is an independent protein database - some of the entries come from protein sequencing and so forth, and so are not present in SwissProtein. However, overall, most of what is in PIR is also in SwissProtein, and vice versa. The SAF keeps a local copy of the current PIR release, ( 58.0, which has 116738 sequences, 37,460,341 residues.)

The NRL_3D is a useful database - it contains the sequences for every unique structure in the PDB (protein databank) - a database of protein structures. It is often a very good idea to search this to find out if a structure homologous to one's query sequence is known. The SAF keeps a local copy of the current NRL_3D relaase. (35.0, which has 14,791 sequences, 2,636,724 residues.)

There are several pattern databases:

  • REBASE: American, maintained by Richard Roberts at New England Biolabs. Restriction Enzyme Database Consists of all known restriction enzyme sites, the 902 release contains 561 enzymes.
  • PROSITE European, again, from Amos Bairoch, University of Geneva. Consists of annotated peptide patterns, for instance, indicating active sites, structural sites, modification sites, and binding sites. Release 15.0 held 1352 such patterns.
  • TFD (Transcription Factor Database) American, from David Ghosh at the NIH. Consists of most known DNA binding sites or otherwise characterized DNA sites. The latest release contained 2797 such sites. Still used, but no longer updated.
  • TRANSFAC: European. From the GBF (Gesellschaft für Biotechnologische Forschung mbH). Release 3.2 contains 4045 entries, experimentally determined binding sites for various DNA binding factors.
  • Other databases the SAF keeps current are:

  • EC Enzyme: European, again, from Amos Bairoch. Contains descriptions of enzymes by Enzyme Commission number, also cross references to SwissProtein entries.
  • EPD: Eukaryotic Promoter Database, from Philipp Bucher, Lausanne, Switzerland. Consists of -499 to 100 on many eukaryotic promoters found in the EMBL database.
  • SRS/Lookup. This isn't so much an independent database as a set of cross indices into all of the other databases we have. We'll talk about this one later.
  • There are dozens and dozens of other biologically relevant databases, and in particular, many organism specific sequence databases. These tend to slowly percolate into GENBANK, but they are sometimes handy to have broken out separately, so that you can search just the entries for those organisms. Here are some of the databases we have loaded - others can be installed by request:

    GCG           Description
    YEAST         Yeast DNA (complete genome): 110 Kb pieces, 10 Kb overlaps
    YP            Yeast proteins
    Y_ORF         Yeast ORFs
    NRAT          Arabidopsis thaliana, nonredundant peptides
    H_INFLUENZAE  H. influenzae DNA (complete genome)
    M_JANNASCHII  M. jannaschii DNA (complete genome)

    It is worth remembering that GCG can work with lists, so that you can specify a search for a single organism even if they are not broken out into their own database. Example:

    $ stringsearch/infile=sw:*/string="bos taurus"/out=cow.lis/menu=a/noscreen
    $ fasta/infile1=query.pep/infile2=@cow.lis/batch/default

    This would create a list file cow.lis which has one line for each Bos taurus entry, and the subsequent search would only look at those entries. However, this method is actually several times slower than just searching the whole database, so unless the list is quite short, it is usually better to search the whole database and then search the result for the organism of interest.

    Submitting, correcting, and retrieving entries

    While you might conceivably want to submit data to almost any of these databases, in practice most of you will only ever submit data to Genbank, so I'll quickly go over how you do that.

    Up until December 1998 it was possible to obtain a simple form, fill it out, and email it back to Genbank. Since then, the only submissions they will accept must come from either BankIt or Sequin BankIt is a web interface, Sequin is software you install on your computer. The NCBI would prefer that you use Sequin, but if you only rarely submit sequences, and they are not overly complex you might be able to get by with BankIt, which will save you the trouble of setting up Sequin on your computer. However, Sequin, once properly configured, is less prone to glitches than is BankIt, and it would be a good idea to bite the bullet and configure at least one computer in your lab to run Sequin.

    With either tool, after you submit an entry, they will email you back an accession number, usually within a couple of hours. If you have your mail forwarded, remember to look there for your accession number. Remember, they will not release your entry until you tell them to, so don't think that just because your paper has come out in print that Genbank will automatically release your entry! You can get them to release your entry through the BankIt interface, but it might be easiest to just send email to Supply in the body of the message:

    1. The accession number they sent you.
    2. Your contact information (same as in your submission).
    3. The revision you want to make - in this case "ok to release entry now".

    File Formats:

    There are several more or less "standard" sequence file formats which you will encounter when doing sequence analysis work. Is isn't always clear from the name which type a file is, so you should know what they look like so that you will know when to convert them. The most common ones are:

    1. Plain Text, which looks like:
    2. FASTA adds a single comment line, and looks like:
      >gi|4102777 homeobox ultrabithorax protein
    3. GCG, can have any number of comments, separated from the sequence itself by a ".." delimiter line. The following example was converted from a GENBANK file, and so looks much like that, except note that ".." in the header have been converted to ". ." so they won't be recognized as delimiter lines. The comment area can be anything, even free text. It looks like:
      LOCUS       4102763        27 aa                 17-DEC-1997
      DEFINITION  homeobox ultrabithorax protein.
      ACCESSION   4102763
      PID         g4102763
      DBSOURCE    GENBANK: locus AF015936, accession AF015936
      KEYWORDS    .
      SOURCE      Trypetesa lampas.
        ORGANISM  Trypetesa lampas
                  Eukaryotae; Metazoa; Arthropoda; Crustacea; Cirripedia;
                  Acrothoracica; Apygophora; Trypetesidae; Trypetesa.
      REFERENCE   1  (residues 1 to 27)
        AUTHORS   Mouchel-Vielh,E., Rigolot,C. and Deutsch,J.
        TITLE     Acrothoracican homeobox genes
        JOURNAL   Unpublished
      REFERENCE   2  (residues 1 to 27)
        AUTHORS   Mouchel-Vielh,E., Rigolot,C. and Deutsch,J.
        TITLE     Direct Submission
        JOURNAL   Submitted (25-JUL-1997) Developement et evolution, CNRS URA1135, 9
                  Quai Saint Bernard, Paris 75005, France
      COMMENT     Method: conceptual translation supplied by author.
      FEATURES             Location/Qualifiers
           source          1. .27
                           /organism="Trypetesa lampas"
           Protein         <1. .>27
                           /product="homeobox ultrabithorax protein"
           CDS             1. .27
                           /coded_by="AF015936:<1. .>81"
      4102763.Pep  Length: 27  February  3, 1999 11:24  Type: P  Check: 8741  ..
    4. GENBANK has a very rigid structure, with specific keywords and formatting. It looks like:
      LOCUS       4102763        27 aa                 17-DEC-1997
      DEFINITION  homeobox ultrabithorax protein.
      ACCESSION   4102763
      PID         g4102763
      DBSOURCE    GENBANK: locus AF015936, accession AF015936
      KEYWORDS    .
      SOURCE      Trypetesa lampas.
        ORGANISM  Trypetesa lampas
                  Eukaryotae; Metazoa; Arthropoda; Crustacea; Cirripedia;
                  Acrothoracica; Apygophora; Trypetesidae; Trypetesa.
      REFERENCE   1  (residues 1 to 27)
        AUTHORS   Mouchel-Vielh,E., Rigolot,C. and Deutsch,J.
        TITLE     Acrothoracican homeobox genes
        JOURNAL   Unpublished
      REFERENCE   2  (residues 1 to 27)
        AUTHORS   Mouchel-Vielh,E., Rigolot,C. and Deutsch,J.
        TITLE     Direct Submission
        JOURNAL   Submitted (25-JUL-1997) Developement et evolution, CNRS URA1135, 9
                  Quai Saint Bernard, Paris 75005, France
      COMMENT     Method: conceptual translation supplied by author.
      FEATURES             Location/Qualifiers
           source          1..27
                           /organism="Trypetesa lampas"
           Protein         <1..>27
                           /product="homeobox ultrabithorax protein"
           CDS             1..27
              1 htnhyltrrr riemahqlcl terqiki
    5. ASN.1 has an even more complex syntax than does GENBANK. It isn't really a human readable format, even though it is a text file. It looks like
      Seq-entry ::= set {
        class nuc-prot ,
        descr {
            std {
              year 1997 ,
              month 7 ,
                etc. etc. ad nauseum

    The programs FromGenbank, ToGenbank, FromFasta, ToFasta, FromText, ToText are available for converting between the named format and the GCG format. Additionally, D. Gilbert's program READSEQ can convert the sequence, but not usually the comments, between 18 different sequence formats.

    Searching sequence databases:

    Now that you have an overview of the databases that are available lets consider how you would go about searching them.

    Finding entries homologous to a query sequence

    General considerations

    First lets consider searching sequence databases with a single sequence. That is, you've cloned something and want to know if that or something like it is already known. Last week we discussed the Smith Waterman method for aligning two sequences. That method is very computationally intensive, and it isn't usually used fpr routine database searching on a normal computer. On the other hand, it will complete in a reasonable time for some situations. For instance the commands:

    $ fasta3   "-Q" "-O" a1hu.fasta3   "-E" 0.1 -b 20 -d 1 a1hu.fasta "c"
    $ ssearch3 "-Q" "-O" a1hu.ssearch3 "-E" 0.1 -b 20 -d 1 a1hu.fasta "c"

    Do a FASTA and Smith Waterman search against the Swiss-Protein database using a query sequence in fasta format. In both cases, only hits expected to occur 0.1 times in the database, or less are returned, and the list of hits is restricted to 20, the number of alignments returned, to 1. "-Q" suppresses messages during the run, "-O" specifies the output filename. The last two arguments are the query sequence file name, and the letter of the library to search. (Run it with no command line arguments once to get the full list.) FASTA3 took 173 seconds to scan this database, but the Smith Waterman progam needed 1000 seconds.

    The equivalent GCG FASTA command is:

    $ FASTA/infile=pir1:a1hu/infile2=sw: -

    However, the Smith Waterman method can be run quickly on some massively parallel computers, for instance, those made by MasPar, which have been programmed to use this algorithm, and which you can access over the Internet. All the servers that I'm aware of only search the SwissProtein database. You can use this method on SEQAXP by issuing the command:

    $  CBRG

    and then selecting Option 1, PepPepSearch. When the search is completed it will be mailed back to you. It may take hours to days for this to occur though.

    When comparing only two sequences, efficiency isn't all that important, but when comparing 50,000 sequences, as in SwissProtein, or 3 million, as in Genbank, efficiency is very important! The Smith-Waterman method, while being very thorough, is not very efficient, and so the common database search methods are based on other algorithms that are much faster.

    FASTA searches
    0 0 0 0 0 0 0 5 0 0 0 1 0 0 0 0 2 0
    | | | | | | | | | | | | | | | | | |
    . . . . . . . . . . . . . . . . . . - 0
    . . . . . . . . . . . . . . . . . . - 0
    . . . . . + . . . . . . . . . . . . - 0
    . . . . + . . . . . . . . . . . . . - 0
    . . . + . . . . . . . . . . . . . . - 0
    . . + . . . . . . . + . . . . . . . - 1
    . + . . . . . . . + . . . . . . . . - 0
    . . . . . . . . . . . . . . . + . . - 0
    . . . . . . . . . . . . . . . . . . - 0
    . . + . . . . . . . . . . . . . . . - 0
    . . . . . . . . . . . . . . . . . . - 0

    One such method is Pearson's FASTA (and variants of it). FASTA starts by making a generalization from the idea of dotplots. As we discussed last week, In a dotplot regions of similarity between two sequences show up as diagonal streaks. In FASTA, the sum of the dots along each diagonal is calculated, as shown in the diagram above. If there is a region of high similarity the diagonal sum will be higher than the background level found in surrounding diagonals. The fast in FASTA comes from the method of calculating those sums. If it were necessary to actually construct the dotplot matrix and then add along the diagonals FASTA wouldn't be much quicker than Smith-Waterman. Instead, it uses a word or tuple based method. That is, it makes a list of all W-symbol long words, where W is usually 1 or 2 for amino acids, or 5 or 6 for nucleotides, and their positions, for each sequence. It then goes through, and for each identical word in each list, it calculates which diagonals each pair of positions would fall into, but it only counts nonoverlapping words. It then rescores the highest scoring regions using a replacement matrix such as the BLOSUM50 - the best of these scores is called init1. It then tries to join together several high scoring diagonals, the best score from that is called initn. Finally, the program will try to make an optimal local alignment around the regions it has discovered, and that score is called opt. Depending on the version of FASTA and the command line parameters, that final optimization will either be by Smith-Waterman or another similar algorithm, in either case, it is applied near the segments of similarity found already - not the whole protein. This last optimization step is only applied to a small number of "hit" sequences, which had high initn values after the database search.

    Recent versions of FASTA, such as FASTA3, but not the FASTA in the GCG package, also have methods of estimating the statistical likelihood of a match with a given score. Refer to the most recent Pearson reference in the homework for a discussion of the Z and E values. Here is what the output from FASTA3 looks like:

     a1hu.fasta, 353 aa
     vs Swissprot           library
           opt      E()
    < 20   176     0:==
      22     0     0:           one = represents 130 library sequences
      24     0     0:
      26     2     2:*
      28     1    17:*
      30    18   102:*
      32   125   393:=  *
      34   540  1065:=====   *
      36  1738  2187:==============  *
      38  3743  3614:===========================*=
      40  5825  5041:======================================*======
      42  7267  6163:===============================================*========
      44  7747  6798:====================================================*=======
      46  7354  6924:=====================================================*===
      48  6988  6629:==================================================*===
      50  6047  6049:==============================================*
      52  5391  5318:========================================*=
      54  4220  4542:================================= *
      56  3356  3794:==========================   *
      58  2909  3115:=======================*
      60  2188  2523:=================  *
      62  1739  2023:============== *
      64  1293  1609:==========  *
      66   983  1272:======== *
      68   728  1000:====== *
      70   646   784:===== *
      72   506   612:====*
      74   387   478:===*
      76   354   372:==*
      78   300   289:==*
      80   220   224:=*
      82   181   172:=*
      84   164   136:=*
      86   134   105:*=
      88   102    81:*          inset = represents 4 library sequences
      90    66    63:*
      92    62    49:*         :============*===
      94    53    38:*         :=========*====
      96    56    29:*         :=======*======
      98    53    23:*         :=====*========
     100    30    17:*         :====*===
     102    37    14:*         :===*======
     104    20    10:*         :==*==
     106    24     8:*         :=*====
     108    22     6:*         :=*====
     110    15     5:*         :=*==
     112    17     4:*         :*====
     114    18     3:*         :*====
     116    13     2:*         :*===
     118     8     2:*         :*=
    >120   153     1:*=        :*======================================
    26840295 residues in 74019 sequences
     statistics extrapolated from 50000 to 73710 sequences
     Expectation fit: rho(ln(x))= 6.4951+/-0.000554; mu= 2.3794+/- 0.031;
     Kolmogorov-Smirnov  statistic: 0.0337 (N=29) at  52
    FASTA (3.08 July, 1997) function (optimized, BL50 matrix) ktup: 2
     join: 37, opt: 25, gap-pen: -12/ -2, width:  16 reg.-scaled
    The best scores are:                        initn init1 opt z-sc E(73710)
    ALC1_HUMAN P01876 homo sapiens (human  ( 353) 2435 2435 2435 2238.4 5.3e-118
    ALC1_GORGO P20758 gorilla gorilla gor  ( 353) 2377 2377 2377 2185.4 4.7e-115
    ALC2_HUMAN P01877 homo sapiens (human  ( 340) 2214 1549 1584 1460.9 1.1e-74
    ALC_MOUSE P01878 mus musculus (mouse)  ( 344) 1212 1070 1253 1158.3 7.7e-58
    ALC_RABIT P01879 oryctolagus cuniculu  ( 299) 1080  966 1106 1024.8 2.1e-50
    MUC_MOUSE P01872 mus musculus (mouse)  ( 455)  533  236  612  570.8 4.0e-25
    MUC_HUMAN P01871 homo sapiens (human)  ( 454)  599  227  596  556.2 2.6e-24
    MUC_RABIT P03988 oryctolagus cuniculu  ( 458)  486  220  584  545.2 1.1e-23
    MUCB_HUMAN P04220 homo sapiens (human  ( 391)  438  227  580  542.5 1.5e-23
    GC2_HUMAN P01859 homo sapiens (human)  ( 326)  315  132  563  528.0 9.8e-23
    GC4_HUMAN P01861 homo sapiens (human)  ( 327)  270  123  551  517.0 4.0e-22
    GC1_HUMAN P01857 homo sapiens (human)  ( 330)  305  127  549  515.1 5.1e-22
    MUC_MESAU P06337 mesocricetus auratus  ( 454)  396  165  550  514.2 5.8e-22
    MUC_CHICK P01875 gallus gallus (chick  ( 446)  434  232  545  509.7 1.0e-21
    GCAA_MOUSE P01863 mus musculus (mouse  ( 330)  273  122  536  503.3 2.3e-21
    MUC_SUNMU P20768 suncus murinus (hous  ( 457)  524  242  538  503.2 2.4e-21
    GCAB_MOUSE P01864 mus musculus (mouse  ( 335)  273  116  534  501.3 3.0e-21
    GCAM_MOUSE P01865 mus musculus (mouse  ( 399)  246  122  530  496.7 5.5e-21
    MUCM_MOUSE P01873 mus musculus (mouse  ( 476)  422  207  525  491.0 1.1e-20
    GC1_RAT P20759 rattus norvegicus (rat  ( 326)  180  138  516  485.1 2.4e-20
    >>ALC1_HUMAN P01876 homo sapiens (human). ig alpha-1 chai (353 aa)
     initn: 2435 init1: 2435 opt: 2435 Z-score: 2238.4 expect() 5.3e-118
    Smith-Waterman score: 2435;  100.000% identity in 353 aa overlap
                   10        20        30        40        50        60
                   10        20        30        40        50        60
                   70        80        90       100       110       120
                   70        80        90       100       110       120
                  130       140       150       160       170       180
                  130       140       150       160       170       180
                  190       200       210       220       230       240
                  190       200       210       220       230       240
                  250       260       270       280       290       300
                  250       260       270       280       290       300
                  310       320       330       340       350   
                  310       320       330       340       350   
    353 residues in 1 query   sequences
    26840295 residues in 74019 library sequences
     Scomplib [version 3.0t81 August 28, 1997]
     start: Wed Feb  3 12:29:24 1999 done: Wed Feb  3 12:32:12 1999
     Scan time: 167.000 Display time:  1.000
    Function used was  FASTA 

    A few things to keep in mind with FASTA:

    1. It is most sensitive with a word size of 1, but the default is 2 (for proteins). Conversely, it will take longer to run a search with a word size of 1 than with one of 2. For instance, in a test run where all of SwissProtein was searched with a 350 amino acid protein, it took 1 minute with a word size of 2, just under 3 minutes with a word size of 1.
    2. The output file contains a histogram. If your best scoring regions are past the end of the tail of the distribution shown you can be fairly confident that they are significantly similar.
    3. The Opt value should be roughly as high as the initn value, or higher. If it is more than a few percent lower it probably indicates a spurious match.

    BLAST searches

    Another searching algorithm that most of you have used, or will use, is BLAST, which stands for Basic Local Alignment Search Tool. BLAST is also a word or tuple based method. However, the method is considerably different than that used in FASTA. One major difference is that BLAST requires a preformatted search database. That is, one has to take all of SwissProtein, for instance, and convert it to a different format before it can be searched. Because this would effectively require a second copy of every database, BLAST searches are run remotely, on a pair of servers located at the NCBI. These servers apparently have a tremendous amount of RAM, and so are able to keep entire databases in memory at one time, and not have to read it in from disk for each search - this is probably one reason why BLAST is so fast. It is possible to create a local BLAST database though, so if need be, one can BLAST against one's own personal database of sequences.

    There are currently three major BLAST variants: original, gapped, and psi. The original variant did not handle gaps directly, but did have a way of summing scores from a series of ungapped ordered similarities. The gapped variant does handle gaps, and is about three times faster than the original. PSI-BLAST does a first pass using gapped blast, then prepares a profile from the results and does a second pass through the database looking for matches to the profile.

    All three start work in roughly the same way. The program takes each word from the query sequence (for proteins, that is 3 symbols long, for nucleotides 11 or 13), and then locates all of the words in the neighborhood of the current test sequence. A neighborhood consists of all those words that score greater than a certain value against the query word, for a given comparison matrix. Once the neighborhood words are ascertained, BLAST tries to expand the alignment around both the query and test sequence locations. For the original variant, it tries very hard to expand the regions, for the gapped variant it tries somewhat less hard. The "magic" in BLAST involves the heavy use of statistics in deciding when to stop expanding one of these seed regions, and that revolves around two parameters called lambda and K, which are calculated from first principles in the original BLAST, but must be estimated for gapped BLAST. Read the papers in the reference for more information. It then stores that score and the segment position and tests another word. At the end of this process, it selects a set of high-scoring segment pairs (HSPs). The gapped variant then tries to find two HSPs which are close to each other, and to merge those into one larger alignment. What it gives up in extension it gains back in the ability to close gaps. At the end of the database search the program has a list of the largest HSPs, or the largest aligned regions, and returns that. The original variant will check to see if there are several HSPs in order and nonoverlapping, and will then calculate an aggregate score - effectively allowing for certain types of gaps. It will do this even if these may not be significant scores by themselves, which is why it tries harder than the gapped variant to extend the HSPs. You will see these in the output file as a series of segments marked with P(N), where N is 2,3, or whatever.

    Here is how the three BLAST variants stack up against each other when tested against a random sequence database (negative control) and SwissProtein (experiment).

    The three BLAST variants against a database of random peptides.

    Query       Original           Gapped         PSI-BLAST
                E-Value:          E-Value:         E-Value:     
             Low   <=1 <=10    Low   <=1 <=10   Low   <=1 <=10
    P00762   0.86    1    7    3.0   0      4   0.94    1    8
    P01008   3.9     0    4    0.078 1      9   1.5     0    9
    P01111   3.4     0    8    3.4   0      7   1.1     0    9
    P02232   2.4     0    7    2.8   0      5   8.2     0    2
    P03435   0.11    2   11    0.46  3     16   0.87    1    8
    P05013   2.4     0    6    0.27  2      4   0.11    2   11
    P07327   1.5     0    2    0.80  1      5   1.5     0    9
    P10318   0.91    1    7    0.13  1      7   0.0031  2    6
    P10635   0.84    2    5    8.5   0      3   0.46    1   15
    P14942   1.0     1   10    3.3   0      3   0.30    2    9
    P20705   0.012   1    8    0.26  2     14   0.79    2   10
    Average  1.0     0.7  6.8  0.80  0.9   7.0  0.87    1.0  8.7

    From this you can see that they all do about equally well in rejecting false matches, although PSI-BLAST does return one result which would probably be misinterpreted as being significant when it is not.

    Number of matches with E <= .01 for Smith-Waterman and three BLAST variants against SwissProtein.

    Query     Smith-   Original  Gapped  PSI-BLAST
    Sequence  Waterman                           
    P00762      275       273      275      286
    P01008      108       105      108      111
    P01111      255       249      252      375
    P02232       28        26       28      623
    P03435      128       114      128      130
    P05013       53        53       53       53
    P07327      138       128      137      160
    P10318      262       241      261      338
    P10635      211       197      211      224
    P14942       83        79       81      142
    P20705      198       191      197      207
    Run times  3600       100       34       87 (Normalized)

    Data in preceding two tables is from Altschul et al. (1997) Nucleic Acids Research 25(17):3389-3402. The queries sequences were: P00762 Serine protease, P01008 Serine protease inhibitor, PO1111 Ras, P02232 Globin, P03435 Hemagglutinin, P05013 Interferon a, P07327 Alcohol dehydrogenase, P10318 Histocompatibility antigen, P10635 Cytochrome P450, P14942 Glutathione transferase, P20705 H+-transporting ATPsynthase.

    So in terms of sensitivity, there isn't much reason to prefer the gapped one over the original. On the other hand, the gapped one is faster, so all else being equal, you should use that. PSI-BLAST usually returns a few more hits than either of the other two variants, and even finds more things than does Smith-Waterman. It can do this because there is more information in a profile method than in the simple pairwise comparisons that all of the others use. In the one highlighted case above, the query was a plant globin, and PSI-BLAST managed to pick up some animal globins, which then came to dominate the profile. After a few more iterations it comes back with all animal globins, but the original query is no longer on the list. That is, the profile has converged on another family of related genes, to the exclusion of the original!

    BLAST considerations. BLAST is a very complicated tool. Some of the things you should keep in mind.

    1. Default is to search the nr database, which does not include the EST database.
    2. It works much better for proteins than for nucleotides. Mostly this arises from the requirement for 11 consecutive identical nucleotides to start a HSP. The estimate is that BLAST can find protein sequences that have diverged by as much as 250 substitutions per 100 amino acids, but only 50 substitutions for 100 nucleotides. If you must search nucleotide sequence against nucleotide databases, and you are looking for something remote, as opposed to an overlapping clone, then use FASTA instead.
    3. The output from BLAST can be enormous. You don't want to blindly PRINT it, as it will often run to a couple of hundred pages. Rather go in with an editor, delete the pieces you don't need, and then print what's left. Additionally, for first runs, you might want to restrict the number of hits returned, both in the initial list and the number of alignments.
    4. BLAST is particularly sensitive to short repeats and skewed composition. This is because the model it uses for extending HSPs doesn't handle these common sequence anomolies correctly. By default these are filtered out, which will cause your query sequence to come back with a region or two to be full of XXXXXXX instead of sequence.
    5. If your sequence comes from a system that does not use the standard genetic code, be sure to use the appropriate translation option to specify the proper table.
    6. There are a whole bunch of parameters you can twiddle on BLAST that change the way it behaves in one way or another. It is a good idea to read the entire manual through at least once, to get some idea what some of these are. There is a link to it on the NCBI web site.
    7. Practical considerations - BLAST files lay around and take up an immense amount of disk space. Periodically, for computer housekeeping purposes, there is a process that makes a sweep through all of the user directories and deletes any BLAST output file that is more than one month old.
    8. You can run all BLAST programs through the web interface at the NCBI. You may also submit BLAST jobs from the Seqaxp command line through the BLAST and BLAST2 commands. These are actually scripts which queue jobs onto the NCBI servers, issue the command with no arguments to see how they are used. The advantage of using the command line variants is that you can include them in batch jobs and send large numbers of sequences through the NCBI server, without having to sit at your computer pasting them into a browser. Lastly, we have a Linux server which may be used by special arrangement to BLAST thousands or tens of thousands of sequences against GenPept.

    Here's an example run of BLAST2, from the command line, on Seqaxp.

    $ blast_datalib="swissprot"
    $ blast_program="blastp"
    $ blast_alignments=1
    $ blast_descriptions=20
    $ blast2 pir1:a1hu
     353 symbols written into "Kill_12470457.Seq".
    output will be A1HU.blastp
    %DCL-S-SPAWNED, process K89A412470457 spawned
    Now processing job, subprocess is 20209180
    still waiting at  3-FEB-1999 12:47:27.21
    BLAST job completed at  3-FEB-1999 12:47:33.22, results in A1HU.blastp
    $ type a1hu.blastp
       HTTP/1.0 200 OK MIME-Version: 1.0 Content-type: text/html 
    Commencing search, please wait for results.
    BLASTP 2.0.7 [Dec-21-1998]
    Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer,
    Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
    "Gapped BLAST and PSI-BLAST: a new generation of protein database search
    programs",  Nucleic Acids Res. 25:3389-3402.
    Query= A1HU Ig alpha-1 chain C region - human
             (353 letters)
    Database: Non-redundant SwissProt sequences
               77,419 sequences; 27,864,727 total letters
    If you have any problems or questions with the results of this search
    please refer to the BLAST FAQs
                                                                       Score     E
    Sequences producing significant alignments:                        (bits)  Value
    sp|P01876|ALC1_HUMAN  IG ALPHA-1 CHAIN C REGION                       669  0.0
    sp|P20758|ALC1_GORGO  IG ALPHA-1 CHAIN C REGION                       656  0.0
    sp|P01877|ALC2_HUMAN  IG ALPHA-2 CHAIN C REGION                       633  0.0
    sp|P01878|ALC_MOUSE  IG ALPHA CHAIN C REGION                          393  e-109
    sp|P01879|ALC_RABIT  IG ALPHA CHAIN C REGION                          311  1e-84
    sp|P01871|MUC_HUMAN  IG MU CHAIN C REGION                             186  5e-47
    sp|P04220|MUCB_HUMAN  IG MU HEAVY CHAIN DISEASE PROTEIN (BOT)         183  4e-46
    sp|P03988|MUC_RABIT  IG MU CHAIN C REGION                             181  2e-45
    sp|P01872|MUC_MOUSE  IG MU CHAIN C REGION                             180  3e-45
    sp|P06337|MUC_MESAU  IG MU CHAIN C REGION                             167  3e-41
    sp|P20768|MUC_SUNMU  IG MU CHAIN C REGION                             160  3e-39
    sp|P01874|MUC_CANFA  IG MU CHAIN C REGION                             159  1e-38
    sp|P01875|MUC_CHICK  IG MU CHAIN C REGION                             158  2e-38
    sp|P23084|HVC1_HETFR  IG HEAVY CHAIN C REGION (CLONE 6125)            149  8e-36
    sp|P23085|HVC2_HETFR  IG HEAVY CHAIN C REGION (CLONE 12022)           145  9e-35
    sp|P23086|HVC3_HETFR  IG HEAVY CHAIN C REGION (CLONE 6121)            143  6e-34
    sp|P01857|GC1_HUMAN  IG GAMMA-1 CHAIN C REGION                        133  5e-31
               Length = 353
     Score =  669 bits (1708), Expect = 0.0
     Identities = 326/353 (92%), Positives = 326/353 (92%)
    CPU time:     0.12 user secs.       0.06 sys. secs          0.18 total secs.
      Database: Non-redundant SwissProt sequences
        Posted date:  Feb 2, 1999  3:30 AM
      Number of letters in database: 27,864,727
      Number of sequences in database:  77,419
    Lambda     K      H
       0.315    0.131     0.00
    Lambda     K      H
       0.270   0.0470 7.29e-304
    Matrix: BLOSUM62
    Gap Penalties: Existence: 11, Extension: 1
    Number of Hits to DB: 19690490
    Number of Sequences: 77419
    Number of extensions: 807756
    Number of successful extensions: 2530
    Number of sequences better than  1.0: 154
    Number of HSP's better than  1.0 without gapping: 52
    Number of HSP's successfully gapped in prelim test: 102
    Number of HSP's that attempted gapping in prelim test: 1951
    Number of HSP's gapped (non-prelim): 369
    length of query: 353
    length of database: 27864727
    effective HSP length: 51
    effective length of query: 302
    effective length of database: 23916358
    effective search space: 7222740116
    effective search space used: 7222740116
    T: 11
    A: 40
    X1: 16 ( 7.3 bits)
    X2: 38 (14.8 bits)
    X3: 64 (24.9 bits)
    S1: 41 (21.6 bits)
    S2: 73 (32.8 bits)

    FRAMESEARCH searches

    The last sequence searching method I'm going to talk about today is called FRAMESEARCH. It is specifically designed for dealing with bad data. That is, it can find an alignment between a protein query and a nucleotide test sequence even if the latter contains frameshifting gaps. The main reason you need to know about this program is that there are a lot of sequences in the databases, primarily in the EST divisions, that do contain frameshifts in the sequence. A fair estimate is that over half of the EST sequences have these errors. Since many protein coding regions are open in all frames these can still show up in a search with BLAST or FASTA, when the in frame part scores highly, but they will mess up any subsequent multiple sequence alignment because the section on the other side of the frameshift will be incorrect. In a really junky sequence there might be multiple frame shifts, so that nothing is long enough to show up in BLAST or FASTA. FRAMESEARCH, unfortunately, is a very slow program, as it is an extended version of the Smith-Waterman method. One thing that you might want to do, which is much less time consuming, is to run FRAMESEARCH or FRAMEALIGN directly against any EST hit you find with either FASTA or BLAST. That is usually acceptable because you will find the sequences which interest you with these two general tools, but can then get the nucleic acid lined up better with the protein with the more specialized tool. The FASTX3 and TFASTX3 programs from Pearson can also handle frameshifts.

    The case for iterated searches

    Before we move on, I'd like to make one more important point. All of these search algorithms will at some "distance" between two sequences be unable to find one using the other as a query. Consider three different sequences A,B, and C. For any sequence comparison method, it is common to find this situation:

     A  finds   B
        misses  C
     B  finds   A
        finds   C
     C  finds   B
        misses  A

    In the example shown above, using A as a query we may find B, but not be able to find C, yet C may really be distantly related to A. Therefore, it is important to take the sequences that are found in a database search, and to search the database again using those. In the example shown, B would find A, but also C. Obviously, you want to redo searches with the more divergent hits - if the original query was mouse trypsin, there isn't much point redoing the search with rat trypsin!

    Finding entries containing a query pattern

    So far we've discussed how to find a sequence query in a sequence database. Sometimes instead you want to find a pattern query in a sequence database. There are basically two kinds of patterns that the GCG package can search for. The first is called a pattern, and it consists of string that describes the pattern, for instance, Threonine or Serine, a gap of 3 to 5, an Alanine then a Glycine. The second is called a profile, and it consists of an array of weights for each type of residue for each position in a window of some length. Profiles are similar to what we saw in the last lecture with the Gibbs program.

    The program to use to find patterns is called, naturally, FINDPATTERNS. You use this when you have a pattern of your own to search a database with. You need only know the correct syntax to express your pattern, the key pieces of that syntax are:

      AGCT         Sequence "AGCT" (also Y,R and other degenerate codes)
      ~C           Not C
      {n,m}        previous symbol or expression repeated n to m times
      (A,G,C)      A or G or C
      (AG,G,C)     AG or G or C
      <            Beginning of sequence
      >            End of sequence
    $ findpatterns/infile=gb:dmwhite/pattern="AGCTN{1,10}(AG,GT)~YAA"
    $ type dmwhite.find
    ! FINDPATTERNS on Gb_In:Dmwhite allowing 0 mismatches
    !        1 AGCTN{1,10}(AG,GT)~YAA    February  2, 1999 15:02 ..
                 Dmwhite  ck: 9858  len: 14,245 ! X02974
      Drosophila melanogaster DNA sequence of white locus. 2/93
    1                     AGCTN{1,10}(AG,GT)~YAA
             4,041: TAGTT    AGCTAGAGAGAGAAA     CGGCA
    1 /Rev                TT~R(AC,CT)N{1,10}AGCT
             1,146: TGGAT     TTTACAGCCAAGCT     TAGCT
            12,685: AGGTG     TTCACCTCAGAGCT     GCCAG
         Total finds:          3
        Total length:     14,245
     Total sequences:          1
            CPU time:      00.33

    Things to watch out for with FINDPATTERNS:

    1. Be sure that your peptide query goes against a protein database. Otherwise, the amino acid symbols will be interpreted as degenerate nucleotide symbols, and a million hits will result.
    2. Usually you have to include the entire string in double quotes.
    3. You can allow mismatches. (/mismatch=)
    4. If you specify the option /NAMES a list file is generated which can be used for subsequent searches. That is, if you have one pattern that defines a group of proteins, you can make a list that contains only those names, then search just that list for subsequent patterns.
    5. There are several locally written variants:
      RFINDPATTERNS: finds the patterns, then extracts all pieces into separate files, along with a specified amount of flanking sequence.
      CFINDPATTERNS: inverts the meaning of the /EXCLUDE option
      NFINDPATTERNS: outputs just the positions of the pieces found

    There is a very complete discussion of profiles in the GCG manual, so this will only be a quick overview. The profile methods work this way:

    1. Generate a multiple sequence alignment.
    2. Use PROFILEMAKE to create a profile from that alignment.
    3. Use PROFILESEARCH to find other sequences consistent with that profile in the database.

    Searching a sequence with a pattern database

    Sometimes the problem is inverted, you want to test a new sequence to see if it contains a known pattern or profile. For instance, you might want to check a protein for a match in PROSITE, or a piece of DNA for a match in the TFD or TRANSFAC databases.

    To see if a protein contains a PROSITE motif, use the program MOTIFS. It is pretty much self explanatory.

    $ motifs/infile=pir1:a1hu
    $ type a1hu.motifs
     MOTIFS from: Pir1:A1hu
     Mismatches: 0                February  2, 1999 14:50  ..
    A1hu  Check: 6072  Length: 353   ! Ig alpha-1 chain C region - human
    Ig_Mhc                (F,Y)xCx(V,A)xH
               311: KKGDT     FSCMVGH     EALPL
    Immunoglobulins and major histocompatibility complex proteins signature
    The basic structure  of immunoglobulin (Ig) [1] molecules is a
    tetramer of two light chains  and two heavy chains linked  by
    disulfide  bonds.  There are two types of  light chains:  kappa and
    lambda,  each composed of a constant domain (CL) and a variable domain
    (VL).  There are five types of heavy chains: alpha, delta, epsilon, 
    gamma  and  mu,  all consisting of a variable domain (VH) and three
    (in alpha,  delta  and  gamma)  or  four (in  epsilon and  mu) 
    constant domains (CH1 to CH4). 
    The major histocompatibility complex  (MHC) molecules  are made of two
    chains. In class I [2] the alpha  chain is composed of three 
    extracellular domains, a transmembrane   region   and a   cytoplasmic
    tail.   The  beta  chain (beta-2- microglobulin) is  composed of a
    single extracellular domain. In class II [3], both the  alpha and the
    beta chains are composed of two extracellular domains, a transmembrane
    region and a cytoplasmic tail. 
    It is  known  [4,5]    that  the  Ig  constant  chain  domains  and  
    a single extracellular   domain   in each  type  of  MHC   chains are
    related.    These homologous domains  are  approximately    one   
    hundred amino  acids long and include a conserved intradomain
    disulfide bond.  We developed  a small pattern around the C-terminal
    cysteine involved  in  this  disulfide bond which can be used to
    detect these category of Ig related proteins. 
    -Consensus pattern: [FY]-x-C-x-[VA]-x-H
    -Sequences known to belong to this class detected by the pattern:
     Ig heavy chains type Alpha C region  : All, in CH2 and CH3.
     Ig heavy chains type Delta C region  : All, in CH3.
     Ig heavy chains type Epsilon C region: All, in CH1, CH3 and CH4.
     Ig heavy chains type Gamma C region  : All, in CH3 and also CH1 in some cases
     Ig heavy chains type Mu C region     : All, in CH2, CH3 and CH4.
     Ig light chains type Kappa C region  : In all CL except rabbit and Xenopus.
     Ig light chains type Lambda C region : In all CL except rabbit.
     MHC class I alpha chains : All,  in   alpha-3  domains,   including   
       in  the  cytomegalovirus MHC-1 homologous protein [6].
     Beta-2-microglobulin     : All.
     MHC class II alpha chains: All, in alpha-2 domains.
     MHC class II beta  chains: All, in beta-2 domains.
    -Other sequence(s) detected in SWISS-PROT: 68.
    -Last update: May 1991 / Text revised.
    [ 1] Gough N.
         Trends Biochem. Sci. 6:203-205(1981).
    [ 2] Klein J., Figueroa F.
         Immunol. Today 7:41-44(1986).
    [ 3] Figueroa F., Klein J.
         Immunol. Today 7:78-81(1986).
    [ 4] Orr H.T., Lancet D., Robb R.J., Lopez de Castro J.A., Strominger J.L.
         Nature 282:266-270(1979).
    [ 5] Cushley W., Owen M.J.
         Immunol. Today 4:88-92(1983).
    [ 6] Beck S., Barrel B.G.
         Nature 331:269-272(1988).

    Keep in mind is that some motifs occur so frequently that they are normally ignored in searches. For instance, the pattern for one of the Glycosylation sites is

    Assuming equal distribution of all amino acid types, this site would be flagged on .95 * .90 * .95 = 81% of all asparagine occurences. The PROSITE patterns are all stored in the file PROSITEDIR:PROSITE.PATTERNS, and those that would be ignored are preceded by a semicolon. If you do want to see these anyway, add the qualifier /FREQUENT to the command line to include these. MOTIFS only takes a few seconds to run.

    There is also a Profile version of PROSITE available. It was constructed by finding proteins that matched each PROSITE pattern, and making a profile from them. So you can also do:

    $ profilescan/infile=pir1:a1hu

    to find the profiles that match in that protein sequence. Two files will be produced: a1hu.scan and a1hu.sum. The former gives a detailed acount of what was found, the latter a brief (relatively speaking) summary. The output tends to be on the verbose side, so it isn't shown here.

    To check a DNA sequence against the TFD database, use one of the commands TFMAP, TFMAPPLOT, or TFMAPSORT. These format the output in various ways, but essentially provide the same information. These are exactly the same programs used to do restriction digests, but in that case the database of patterns is restriction enzymes rather than the TFD.

    $ tektronix versaterm term
    $ tfmapplot/infile=gb_in:dmwhite
    $ mapplot/infile=gb_in:dmwhite/ -

    Finding entries with specific documented properties

    Sometimes you want to find a database entry by the documentary information within it. For instance, by author, or by gene name. There are a lot of ways to do that. The slowest way is to use the GCG tool STRINGSEARCH which just does a serial search of the reference information in databases. An example of that was introduced above.

    The fastest method is to use one of the database index tools. The two primary ones are:

  • Entrez at the NCBI.
  • SRS at a variety of sites, mostly in Europe.
  • NCBI_RETRIEVE,NCBI_SEARCH which run through an email server at the NCBI.
  • More than any other category of tool, this particular group is in flux. For instance, the unsupported GCG program LOOKUP, which is based on an older version of SRS, is currently in an odd state, as it could not index several of the current databases, so the indices it does have are incompatible with our current databases. If you ignore the warnings, it is still fairly useful. Plans are to replace it soon with a local copy of SRS. The NCBI is also replacing the RETRIEVE server this month with one called QUERY, which is essentially a variant of ENTREZ. When that change occurs, the syntax used with NCBI_RETRIEVE will have to change.

    Next week we'll start covering tools for Molecular Biology.