Today we'll start with an overview of the major databases and then go on to discuss methods for searching them. These are the relations between the major databases:
Nucleic acid: GenBank <--> EMBL <--> DDBJ
| |
V V
Protein: GenPept ~= SwissProtein <-> PIR NRL_3D
^
|
3D Structure: PDB
There are three nucleic acid sequence databases primarily for political and historical reasons. They share data daily, so effectively this represents a single database in three different data formats - anything submitted to one ends up in all the others.
We keep a local copy of Genbank at the SAF, it is formatted for GCG access. Genbank full releases come out every 2 months. The 109.0 release had 2,162,067,871 bases, from 3,043,729 sequences. In between full releases there are daily incremental releases which we don't keep locally, but which can be searched at the NCBI. Like most of the other databases, it is experiencing exponential growth, with lately a huge number of Expressed Sequence Tag sequences coming in. Genbank is divided into the following pieces:
GCG Genbank Description gb_pr* gbpri* Primate sequence, parts 1-3. gb_ro gbrod Rodent sequence. gb_om gbmam Other mammalian sequence. gb_ov gbvrt Other vertebrate sequence. gb_in gbinv Invertebrate sequence. gb_pl* gbpln* Plant sequence (including fungi and algae), parts 1-2. gb_ba* gbbct* Bacterial sequence, parts 1-2. gb_st gbrna Structural RNA sequence. gb_vi gbvrl Viral sequence. gb_ph gbphg Phage sequence. gb_sy gbsyn Synthetic and chimeric sequence. gb_un gbuna Unannotated sequence. gb_est* gbest* EST (expressed sequence tag) sequence, parts 1-22 gb_pt gbpat Patent sequence. gb_sts gbsts STS (sequence tagged site) sequence. gb_gss* gbgss* GSS (genome survey sequence) sequence, part 1-4. gb_htg gbhtg HTGS (high throughput genomic sequencing) sequence. Related items, specific to the SAF installation: genbank - all of the above, except the ESTs. hs_u - Unigene database, Human. mm_u - Unigene database, Mouse. rn_u - Unigene database, Rat. unigene - All unigene entries.
WARNING! The HTGS division entries are not simple sequences! They are essentially snapshots of the assembly process for a YAC, BAC, P1 or other large clone, and contain a set of contigs, all spliced together in arbitrary order and orientation. If you get a hit on an HTGS entry, be sure that you read the file header, which will say where that particular contig begins and ends. Then extract just that contig for further work.
The EST divisions contain 762,208,762 bases, or about a third of the total sequence. The EST division entries are often present many times, and are often duplicates of well characterized genomic sequences. They also contain many more sequencing errors than do other entries (see the homework.) To save space, and to make searching easier, the SAF has installed the Unigene database. For the three organisms which it encompasses, this is essentially a nonredundant set of sequences, drawn from both the regular divisions and from the EST divisions. For the human division of Unigene 58, the included sequences were selected from:
17571 mRNAs + gene CDSs
410887 EST, 3'reads
299197 EST, 5'reads
+ 127901 EST, other/unknown
----------
855556 total sequences in clusters
These were reduced to only 50846 sequences, by clustering, and retaining only one final sequence per cluster. The cluster sizes were as follows:
Cluster Number
size of clusters
1 11331
2 8208
3-4 8614
5-8 7216
9-16 5049
17-32 4203
33-64 3313
65-128 1943
129-256 670
257-512 211
513-1024 65
1025-2048 19
2049-4096 3
4097-8192 1
There are four major protein sequence databases:
SwissProtein consists of peptides translated from the EMBL database, GenPept consists of peptides translated from Genbank. Since EMBL and Genbank are essentially identical, you would expect that GenPept and SwissProtein would be so as well. In one sense they are - most of the proteins found in one will be found in the other. But there are more sequences in Genpept (109.0 has 343,349 entries) than in SwissProtein (release 36.0, has 74,019 entries, 26,840,295 amino acids worth of sequence.) This comes about because the DNA databases have multiple independent sequences of the same genes, and Genpept faithfully translates all of these copies, whereas SwissProtein tends to merge them into a single entry. The result is that Genpept has many more repeats of the exact same sequence than does SwissProtein. Genpept is essentially unannotated, whereas the annotation in SwissProtein is very complete.
PIR is an independent protein database - some of the entries come from protein sequencing and so forth, and so are not present in SwissProtein. However, overall, most of what is in PIR is also in SwissProtein, and vice versa. The SAF keeps a local copy of the current PIR release, ( 58.0, which has 116738 sequences, 37,460,341 residues.)
The NRL_3D is a useful database - it contains the sequences for every unique structure in the PDB (protein databank) - a database of protein structures. It is often a very good idea to search this to find out if a structure homologous to one's query sequence is known. The SAF keeps a local copy of the current NRL_3D relaase. (35.0, which has 14,791 sequences, 2,636,724 residues.)
There are several pattern databases:
Other databases the SAF keeps current are:
There are dozens and dozens of other biologically relevant databases, and in particular, many organism specific sequence databases. These tend to slowly percolate into GENBANK, but they are sometimes handy to have broken out separately, so that you can search just the entries for those organisms. Here are some of the databases we have loaded - others can be installed by request:
GCG Description YEAST Yeast DNA (complete genome): 110 Kb pieces, 10 Kb overlaps YP Yeast proteins Y_ORF Yeast ORFs NRAT Arabidopsis thaliana, nonredundant peptides H_INFLUENZAE H. influenzae DNA (complete genome) M_JANNASCHII M. jannaschii DNA (complete genome)
It is worth remembering that GCG can work with lists, so that you can specify a search for a single organism even if they are not broken out into their own database. Example:
$ stringsearch/infile=sw:*/string="bos taurus"/out=cow.lis/menu=a/noscreen $ fasta/infile1=query.pep/infile2=@cow.lis/batch/default
This would create a list file cow.lis which has one line for each Bos taurus entry, and the subsequent search would only look at those entries. However, this method is actually several times slower than just searching the whole database, so unless the list is quite short, it is usually better to search the whole database and then search the result for the organism of interest.
While you might conceivably want to submit data to almost any of these databases, in practice most of you will only ever submit data to Genbank, so I'll quickly go over how you do that.
Up until December 1998 it was possible to obtain a simple form, fill it out, and email it back to Genbank. Since then, the only submissions they will accept must come from either BankIt http://www.ncbi.nlm.nih.gov/BankIt or Sequin http://www.ncbi.nlm.nih.gov/Sequin/index.html. BankIt is a web interface, Sequin is software you install on your computer. The NCBI would prefer that you use Sequin, but if you only rarely submit sequences, and they are not overly complex you might be able to get by with BankIt, which will save you the trouble of setting up Sequin on your computer. However, Sequin, once properly configured, is less prone to glitches than is BankIt, and it would be a good idea to bite the bullet and configure at least one computer in your lab to run Sequin.
With either tool, after you submit an entry, they will email you back an accession number, usually within a couple of hours. If you have your mail forwarded, remember to look there for your accession number. Remember, they will not release your entry until you tell them to, so don't think that just because your paper has come out in print that Genbank will automatically release your entry! You can get them to release your entry through the BankIt interface, but it might be easiest to just send email to gb-admin@ncbi.nlm.nih.gov. Supply in the body of the message:
There are several more or less "standard" sequence file formats which you will encounter when doing sequence analysis work. Is isn't always clear from the name which type a file is, so you should know what they look like so that you will know when to convert them. The most common ones are:
HTNHYLTRRRRIEMAHALCLTERQIKI
>gi|4102777 homeobox ultrabithorax protein HTNHYLTRRRRIEMAHALCLTERQIKI
LOCUS 4102763 27 aa 17-DEC-1997
DEFINITION homeobox ultrabithorax protein.
ACCESSION 4102763
PID g4102763
DBSOURCE GENBANK: locus AF015936, accession AF015936
KEYWORDS .
SOURCE Trypetesa lampas.
ORGANISM Trypetesa lampas
Eukaryotae; Metazoa; Arthropoda; Crustacea; Cirripedia;
Acrothoracica; Apygophora; Trypetesidae; Trypetesa.
REFERENCE 1 (residues 1 to 27)
AUTHORS Mouchel-Vielh,E., Rigolot,C. and Deutsch,J.
TITLE Acrothoracican homeobox genes
JOURNAL Unpublished
REFERENCE 2 (residues 1 to 27)
AUTHORS Mouchel-Vielh,E., Rigolot,C. and Deutsch,J.
TITLE Direct Submission
JOURNAL Submitted (25-JUL-1997) Developement et evolution, CNRS URA1135, 9
Quai Saint Bernard, Paris 75005, France
COMMENT Method: conceptual translation supplied by author.
FEATURES Location/Qualifiers
source 1. .27
/organism="Trypetesa lampas"
/db_xref="taxon:37713"
Protein <1. .>27
/product="homeobox ultrabithorax protein"
CDS 1. .27
/gene="Ubx"
/coded_by="AF015936:<1. .>81"
ORIGIN
4102763.Pep Length: 27 February 3, 1999 11:24 Type: P Check: 8741 ..
1 HTNHYLTRRR RIEMAHQLCL TERQIKI
LOCUS 4102763 27 aa 17-DEC-1997
DEFINITION homeobox ultrabithorax protein.
ACCESSION 4102763
PID g4102763
DBSOURCE GENBANK: locus AF015936, accession AF015936
KEYWORDS .
SOURCE Trypetesa lampas.
ORGANISM Trypetesa lampas
Eukaryotae; Metazoa; Arthropoda; Crustacea; Cirripedia;
Acrothoracica; Apygophora; Trypetesidae; Trypetesa.
REFERENCE 1 (residues 1 to 27)
AUTHORS Mouchel-Vielh,E., Rigolot,C. and Deutsch,J.
TITLE Acrothoracican homeobox genes
JOURNAL Unpublished
REFERENCE 2 (residues 1 to 27)
AUTHORS Mouchel-Vielh,E., Rigolot,C. and Deutsch,J.
TITLE Direct Submission
JOURNAL Submitted (25-JUL-1997) Developement et evolution, CNRS URA1135, 9
Quai Saint Bernard, Paris 75005, France
COMMENT Method: conceptual translation supplied by author.
FEATURES Location/Qualifiers
source 1..27
/organism="Trypetesa lampas"
/db_xref="taxon:37713"
Protein <1..>27
/product="homeobox ultrabithorax protein"
CDS 1..27
/gene="Ubx"
/coded_by="AF015936:<1..>81"
ORIGIN
1 htnhyltrrr riemahqlcl terqiki
//
Seq-entry ::= set {
class nuc-prot ,
descr {
create-date
std {
year 1997 ,
month 7 ,
etc. etc. ad nauseum
The programs FromGenbank, ToGenbank, FromFasta, ToFasta, FromText, ToText are available for converting between the named format and the GCG format. Additionally, D. Gilbert's program READSEQ can convert the sequence, but not usually the comments, between 18 different sequence formats.
Now that you have an overview of the databases that are available lets consider how you would go about searching them.
First lets consider searching sequence databases with a single sequence. That is, you've cloned something and want to know if that or something like it is already known. Last week we discussed the Smith Waterman method for aligning two sequences. That method is very computationally intensive, and it isn't usually used fpr routine database searching on a normal computer. On the other hand, it will complete in a reasonable time for some situations. For instance the commands:
$ fasta3 "-Q" "-O" a1hu.fasta3 "-E" 0.1 -b 20 -d 1 a1hu.fasta "c" $ ssearch3 "-Q" "-O" a1hu.ssearch3 "-E" 0.1 -b 20 -d 1 a1hu.fasta "c"
Do a FASTA and Smith Waterman search against the Swiss-Protein database using a query sequence in fasta format. In both cases, only hits expected to occur 0.1 times in the database, or less are returned, and the list of hits is restricted to 20, the number of alignments returned, to 1. "-Q" suppresses messages during the run, "-O" specifies the output filename. The last two arguments are the query sequence file name, and the letter of the library to search. (Run it with no command line arguments once to get the full list.) FASTA3 took 173 seconds to scan this database, but the Smith Waterman progam needed 1000 seconds.
The equivalent GCG FASTA command is:
$ FASTA/infile=pir1:a1hu/infile2=sw: -
/outfile=a1hu.gcg_fasta/list=20/default
However, the Smith Waterman method can be run quickly on some massively parallel computers, for instance, those made by MasPar, which have been programmed to use this algorithm, and which you can access over the Internet. All the servers that I'm aware of only search the SwissProtein database. You can use this method on SEQAXP by issuing the command:
$ CBRG
and then selecting Option 1, PepPepSearch. When the search is completed it will be mailed back to you. It may take hours to days for this to occur though.
When comparing only two sequences, efficiency isn't all that important, but when comparing 50,000 sequences, as in SwissProtein, or 3 million, as in Genbank, efficiency is very important! The Smith-Waterman method, while being very thorough, is not very efficient, and so the common database search methods are based on other algorithms that are much faster.
0 0 0 0 0 0 0 5 0 0 0 1 0 0 0 0 2 0 | | | | | | | | | | | | | | | | | | . . . . . . . . . . . . . . . . . . - 0 . . . . . . . . . . . . . . . . . . - 0 . . . . . + . . . . . . . . . . . . - 0 . . . . + . . . . . . . . . . . . . - 0 . . . + . . . . . . . . . . . . . . - 0 . . + . . . . . . . + . . . . . . . - 1 . + . . . . . . . + . . . . . . . . - 0 . . . . . . . . . . . . . . . + . . - 0 . . . . . . . . . . . . . . . . . . - 0 . . + . . . . . . . . . . . . . . . - 0 . . . . . . . . . . . . . . . . . . - 0
One such method is Pearson's FASTA (and variants of it). FASTA starts by making a generalization from the idea of dotplots. As we discussed last week, In a dotplot regions of similarity between two sequences show up as diagonal streaks. In FASTA, the sum of the dots along each diagonal is calculated, as shown in the diagram above. If there is a region of high similarity the diagonal sum will be higher than the background level found in surrounding diagonals. The fast in FASTA comes from the method of calculating those sums. If it were necessary to actually construct the dotplot matrix and then add along the diagonals FASTA wouldn't be much quicker than Smith-Waterman. Instead, it uses a word or tuple based method. That is, it makes a list of all W-symbol long words, where W is usually 1 or 2 for amino acids, or 5 or 6 for nucleotides, and their positions, for each sequence. It then goes through, and for each identical word in each list, it calculates which diagonals each pair of positions would fall into, but it only counts nonoverlapping words. It then rescores the highest scoring regions using a replacement matrix such as the BLOSUM50 - the best of these scores is called init1. It then tries to join together several high scoring diagonals, the best score from that is called initn. Finally, the program will try to make an optimal local alignment around the regions it has discovered, and that score is called opt. Depending on the version of FASTA and the command line parameters, that final optimization will either be by Smith-Waterman or another similar algorithm, in either case, it is applied near the segments of similarity found already - not the whole protein. This last optimization step is only applied to a small number of "hit" sequences, which had high initn values after the database search.
Recent versions of FASTA, such as FASTA3, but not the FASTA in the GCG package, also have methods of estimating the statistical likelihood of a match with a given score. Refer to the most recent Pearson reference in the homework for a discussion of the Z and E values. Here is what the output from FASTA3 looks like:
a1hu.fasta, 353 aa
vs Swissprot library
opt E()
< 20 176 0:==
22 0 0: one = represents 130 library sequences
24 0 0:
26 2 2:*
28 1 17:*
30 18 102:*
32 125 393:= *
34 540 1065:===== *
36 1738 2187:============== *
38 3743 3614:===========================*=
40 5825 5041:======================================*======
42 7267 6163:===============================================*========
44 7747 6798:====================================================*=======
46 7354 6924:=====================================================*===
48 6988 6629:==================================================*===
50 6047 6049:==============================================*
52 5391 5318:========================================*=
54 4220 4542:================================= *
56 3356 3794:========================== *
58 2909 3115:=======================*
60 2188 2523:================= *
62 1739 2023:============== *
64 1293 1609:========== *
66 983 1272:======== *
68 728 1000:====== *
70 646 784:===== *
72 506 612:====*
74 387 478:===*
76 354 372:==*
78 300 289:==*
80 220 224:=*
82 181 172:=*
84 164 136:=*
86 134 105:*=
88 102 81:* inset = represents 4 library sequences
90 66 63:*
92 62 49:* :============*===
94 53 38:* :=========*====
96 56 29:* :=======*======
98 53 23:* :=====*========
100 30 17:* :====*===
102 37 14:* :===*======
104 20 10:* :==*==
106 24 8:* :=*====
108 22 6:* :=*====
110 15 5:* :=*==
112 17 4:* :*====
114 18 3:* :*====
116 13 2:* :*===
118 8 2:* :*=
>120 153 1:*= :*======================================
26840295 residues in 74019 sequences
statistics extrapolated from 50000 to 73710 sequences
Expectation fit: rho(ln(x))= 6.4951+/-0.000554; mu= 2.3794+/- 0.031;
mean_var=119.7253+/-23.985
Kolmogorov-Smirnov statistic: 0.0337 (N=29) at 52
FASTA (3.08 July, 1997) function (optimized, BL50 matrix) ktup: 2
join: 37, opt: 25, gap-pen: -12/ -2, width: 16 reg.-scaled
The best scores are: initn init1 opt z-sc E(73710)
ALC1_HUMAN P01876 homo sapiens (human ( 353) 2435 2435 2435 2238.4 5.3e-118
ALC1_GORGO P20758 gorilla gorilla gor ( 353) 2377 2377 2377 2185.4 4.7e-115
ALC2_HUMAN P01877 homo sapiens (human ( 340) 2214 1549 1584 1460.9 1.1e-74
ALC_MOUSE P01878 mus musculus (mouse) ( 344) 1212 1070 1253 1158.3 7.7e-58
ALC_RABIT P01879 oryctolagus cuniculu ( 299) 1080 966 1106 1024.8 2.1e-50
MUC_MOUSE P01872 mus musculus (mouse) ( 455) 533 236 612 570.8 4.0e-25
MUC_HUMAN P01871 homo sapiens (human) ( 454) 599 227 596 556.2 2.6e-24
MUC_RABIT P03988 oryctolagus cuniculu ( 458) 486 220 584 545.2 1.1e-23
MUCB_HUMAN P04220 homo sapiens (human ( 391) 438 227 580 542.5 1.5e-23
GC2_HUMAN P01859 homo sapiens (human) ( 326) 315 132 563 528.0 9.8e-23
GC4_HUMAN P01861 homo sapiens (human) ( 327) 270 123 551 517.0 4.0e-22
GC1_HUMAN P01857 homo sapiens (human) ( 330) 305 127 549 515.1 5.1e-22
MUC_MESAU P06337 mesocricetus auratus ( 454) 396 165 550 514.2 5.8e-22
MUC_CHICK P01875 gallus gallus (chick ( 446) 434 232 545 509.7 1.0e-21
GCAA_MOUSE P01863 mus musculus (mouse ( 330) 273 122 536 503.3 2.3e-21
MUC_SUNMU P20768 suncus murinus (hous ( 457) 524 242 538 503.2 2.4e-21
GCAB_MOUSE P01864 mus musculus (mouse ( 335) 273 116 534 501.3 3.0e-21
GCAM_MOUSE P01865 mus musculus (mouse ( 399) 246 122 530 496.7 5.5e-21
MUCM_MOUSE P01873 mus musculus (mouse ( 476) 422 207 525 491.0 1.1e-20
GC1_RAT P20759 rattus norvegicus (rat ( 326) 180 138 516 485.1 2.4e-20
>>ALC1_HUMAN P01876 homo sapiens (human). ig alpha-1 chai (353 aa)
initn: 2435 init1: 2435 opt: 2435 Z-score: 2238.4 expect() 5.3e-118
Smith-Waterman score: 2435; 100.000% identity in 353 aa overlap
10 20 30 40 50 60
A1HU ASPTSPKVFPLSLCSTQPDGNVVIACLVQGFFPQEPLSVTWSESGQGVTARNFPPSQDAS
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
ALC1_H ASPTSPKVFPLSLCSTQPDGNVVIACLVQGFFPQEPLSVTWSESGQGVTARNFPPSQDAS
10 20 30 40 50 60
70 80 90 100 110 120
A1HU GDLYTTSSQLTLPATQCLAGKSVTCHVKHYTNPSQDVTVPCPVPSTPPTPSPSTPPTPSP
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
ALC1_H GDLYTTSSQLTLPATQCLAGKSVTCHVKHYTNPSQDVTVPCPVPSTPPTPSPSTPPTPSP
70 80 90 100 110 120
130 140 150 160 170 180
A1HU SCCHPRLSLHRPALEDLLLGSEANLTCTLTGLRDASGVTFTWTPSSGKSAVQGPPERDLC
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
ALC1_H SCCHPRLSLHRPALEDLLLGSEANLTCTLTGLRDASGVTFTWTPSSGKSAVQGPPERDLC
130 140 150 160 170 180
190 200 210 220 230 240
A1HU GCYSVSSVLPGCAEPWNHGKTFTCTAAYPESKTPLTATLSKSGNTFRPEVHLLPPPSEEL
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
ALC1_H GCYSVSSVLPGCAEPWNHGKTFTCTAAYPESKTPLTATLSKSGNTFRPEVHLLPPPSEEL
190 200 210 220 230 240
250 260 270 280 290 300
A1HU ALNELVTLTCLARGFSPKDVLVRWLQGSQELPREKYLTWASRQEPSQGTTTFAVTSILRV
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
ALC1_H ALNELVTLTCLARGFSPKDVLVRWLQGSQELPREKYLTWASRQEPSQGTTTFAVTSILRV
250 260 270 280 290 300
310 320 330 340 350
A1HU AAEDWKKGDTFSCMVGHEALPLAFTQKTIDRLAGKPTHVNVSVVMAEVDGTCY
:::::::::::::::::::::::::::::::::::::::::::::::::::::
ALC1_H AAEDWKKGDTFSCMVGHEALPLAFTQKTIDRLAGKPTHVNVSVVMAEVDGTCY
310 320 330 340 350
353 residues in 1 query sequences
26840295 residues in 74019 library sequences
Scomplib [version 3.0t81 August 28, 1997]
start: Wed Feb 3 12:29:24 1999 done: Wed Feb 3 12:32:12 1999
Scan time: 167.000 Display time: 1.000
Function used was FASTA
A few things to keep in mind with FASTA:
Another searching algorithm that most of you have used, or will use, is BLAST, which stands for Basic Local Alignment Search Tool. BLAST is also a word or tuple based method. However, the method is considerably different than that used in FASTA. One major difference is that BLAST requires a preformatted search database. That is, one has to take all of SwissProtein, for instance, and convert it to a different format before it can be searched. Because this would effectively require a second copy of every database, BLAST searches are run remotely, on a pair of servers located at the NCBI. These servers apparently have a tremendous amount of RAM, and so are able to keep entire databases in memory at one time, and not have to read it in from disk for each search - this is probably one reason why BLAST is so fast. It is possible to create a local BLAST database though, so if need be, one can BLAST against one's own personal database of sequences.
There are currently three major BLAST variants: original, gapped, and psi. The original variant did not handle gaps directly, but did have a way of summing scores from a series of ungapped ordered similarities. The gapped variant does handle gaps, and is about three times faster than the original. PSI-BLAST does a first pass using gapped blast, then prepares a profile from the results and does a second pass through the database looking for matches to the profile.
All three start work in roughly the same way. The program takes each word from the query sequence (for proteins, that is 3 symbols long, for nucleotides 11 or 13), and then locates all of the words in the neighborhood of the current test sequence. A neighborhood consists of all those words that score greater than a certain value against the query word, for a given comparison matrix. Once the neighborhood words are ascertained, BLAST tries to expand the alignment around both the query and test sequence locations. For the original variant, it tries very hard to expand the regions, for the gapped variant it tries somewhat less hard. The "magic" in BLAST involves the heavy use of statistics in deciding when to stop expanding one of these seed regions, and that revolves around two parameters called lambda and K, which are calculated from first principles in the original BLAST, but must be estimated for gapped BLAST. Read the papers in the reference for more information. It then stores that score and the segment position and tests another word. At the end of this process, it selects a set of high-scoring segment pairs (HSPs). The gapped variant then tries to find two HSPs which are close to each other, and to merge those into one larger alignment. What it gives up in extension it gains back in the ability to close gaps. At the end of the database search the program has a list of the largest HSPs, or the largest aligned regions, and returns that. The original variant will check to see if there are several HSPs in order and nonoverlapping, and will then calculate an aggregate score - effectively allowing for certain types of gaps. It will do this even if these may not be significant scores by themselves, which is why it tries harder than the gapped variant to extend the HSPs. You will see these in the output file as a series of segments marked with P(N), where N is 2,3, or whatever.
Here is how the three BLAST variants stack up against each other when tested against a random sequence database (negative control) and SwissProtein (experiment).
The three BLAST variants against a database of random peptides.
Query Original Gapped PSI-BLAST
E-Value: E-Value: E-Value:
Low <=1 <=10 Low <=1 <=10 Low <=1 <=10
P00762 0.86 1 7 3.0 0 4 0.94 1 8
P01008 3.9 0 4 0.078 1 9 1.5 0 9
P01111 3.4 0 8 3.4 0 7 1.1 0 9
P02232 2.4 0 7 2.8 0 5 8.2 0 2
P03435 0.11 2 11 0.46 3 16 0.87 1 8
P05013 2.4 0 6 0.27 2 4 0.11 2 11
P07327 1.5 0 2 0.80 1 5 1.5 0 9
P10318 0.91 1 7 0.13 1 7 0.0031 2 6
P10635 0.84 2 5 8.5 0 3 0.46 1 15
P14942 1.0 1 10 3.3 0 3 0.30 2 9
P20705 0.012 1 8 0.26 2 14 0.79 2 10
Average 1.0 0.7 6.8 0.80 0.9 7.0 0.87 1.0 8.7
From this you can see that they all do about equally well in rejecting false matches, although PSI-BLAST does return one result which would probably be misinterpreted as being significant when it is not.
Number of matches with E <= .01 for Smith-Waterman and three BLAST variants against SwissProtein.Query Smith- Original Gapped PSI-BLAST Sequence Waterman P00762 275 273 275 286 P01008 108 105 108 111 P01111 255 249 252 375 P02232 28 26 28 623 P03435 128 114 128 130 P05013 53 53 53 53 P07327 138 128 137 160 P10318 262 241 261 338 P10635 211 197 211 224 P14942 83 79 81 142 P20705 198 191 197 207 Run times 3600 100 34 87 (Normalized)
Data in preceding two tables is from Altschul et al. (1997) Nucleic Acids Research 25(17):3389-3402. The queries sequences were: P00762 Serine protease, P01008 Serine protease inhibitor, PO1111 Ras, P02232 Globin, P03435 Hemagglutinin, P05013 Interferon a, P07327 Alcohol dehydrogenase, P10318 Histocompatibility antigen, P10635 Cytochrome P450, P14942 Glutathione transferase, P20705 H+-transporting ATPsynthase.
So in terms of sensitivity, there isn't much reason to prefer the gapped one over the original. On the other hand, the gapped one is faster, so all else being equal, you should use that. PSI-BLAST usually returns a few more hits than either of the other two variants, and even finds more things than does Smith-Waterman. It can do this because there is more information in a profile method than in the simple pairwise comparisons that all of the others use. In the one highlighted case above, the query was a plant globin, and PSI-BLAST managed to pick up some animal globins, which then came to dominate the profile. After a few more iterations it comes back with all animal globins, but the original query is no longer on the list. That is, the profile has converged on another family of related genes, to the exclusion of the original!
BLAST considerations. BLAST is a very complicated tool. Some of the things you should keep in mind.
Here's an example run of BLAST2, from the command line, on Seqaxp.
$ blast_datalib="swissprot"
$ blast_program="blastp"
$ blast_alignments=1
$ blast_descriptions=20
$ blast2 pir1:a1hu
353 symbols written into "Kill_12470457.Seq".
output will be A1HU.blastp
%DCL-S-SPAWNED, process K89A412470457 spawned
Now processing job, subprocess is 20209180
still waiting at 3-FEB-1999 12:47:27.21
BLAST job completed at 3-FEB-1999 12:47:33.22, results in A1HU.blastp
$ type a1hu.blastp
HTTP/1.0 200 OK MIME-Version: 1.0 Content-type: text/html
Commencing search, please wait for results.
BLASTP 2.0.7 [Dec-21-1998]
Reference:
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= A1HU Ig alpha-1 chain C region - human
(353 letters)
Database: Non-redundant SwissProt sequences
77,419 sequences; 27,864,727 total letters
Searching..................................................done
If you have any problems or questions with the results of this search
please refer to the BLAST FAQs
Score E
Sequences producing significant alignments: (bits) Value
sp|P01876|ALC1_HUMAN IG ALPHA-1 CHAIN C REGION 669 0.0
sp|P20758|ALC1_GORGO IG ALPHA-1 CHAIN C REGION 656 0.0
sp|P01877|ALC2_HUMAN IG ALPHA-2 CHAIN C REGION 633 0.0
sp|P01878|ALC_MOUSE IG ALPHA CHAIN C REGION 393 e-109
sp|P01879|ALC_RABIT IG ALPHA CHAIN C REGION 311 1e-84
sp|P01871|MUC_HUMAN IG MU CHAIN C REGION 186 5e-47
sp|P04220|MUCB_HUMAN IG MU HEAVY CHAIN DISEASE PROTEIN (BOT) 183 4e-46
sp|P03988|MUC_RABIT IG MU CHAIN C REGION 181 2e-45
sp|P01872|MUC_MOUSE IG MU CHAIN C REGION 180 3e-45
sp|P06337|MUC_MESAU IG MU CHAIN C REGION 167 3e-41
sp|P20768|MUC_SUNMU IG MU CHAIN C REGION 160 3e-39
sp|P01874|MUC_CANFA IG MU CHAIN C REGION 159 1e-38
sp|P01875|MUC_CHICK IG MU CHAIN C REGION 158 2e-38
sp|P04221|MUCM_RABIT IG MU CHAIN C REGION MEMBRANE-BOUND FORM 157 4e-38
sp|P01873|MUCM_MOUSE IG MU CHAIN C REGION MEMBRANE-BOUND FORM 153 4e-37
sp|P23084|HVC1_HETFR IG HEAVY CHAIN C REGION (CLONE 6125) 149 8e-36
sp|P23085|HVC2_HETFR IG HEAVY CHAIN C REGION (CLONE 12022) 145 9e-35
sp|P23087|HVCS_HETFR IG HEAVY CHAIN C REGION, SECRETED FORM (CL... 143 5e-34
sp|P23086|HVC3_HETFR IG HEAVY CHAIN C REGION (CLONE 6121) 143 6e-34
sp|P01857|GC1_HUMAN IG GAMMA-1 CHAIN C REGION 133 5e-31
sp|P01876|ALC1_HUMAN IG ALPHA-1 CHAIN C REGION
Length = 353
Score = 669 bits (1708), Expect = 0.0
Identities = 326/353 (92%), Positives = 326/353 (92%)
Query: 1 ASPTSPKVFPLSLCSTQPDGNVVIACLVQGFFPQEPLSVTWSESGQGVTARNFPPSQDAS 60
ASPTSPKVFPLSLCSTQPDGNVVIACLVQGFFPQEPLSVTWSESGQGVTARNFPPSQDAS
Sbjct: 1 ASPTSPKVFPLSLCSTQPDGNVVIACLVQGFFPQEPLSVTWSESGQGVTARNFPPSQDAS 60
Query: 61 GDLYTTSSQLTLPATQCLAGKSVTCHVKHYTNPSQDXXXXXXXXXXXXXXXXXXXXXXXX 120
GDLYTTSSQLTLPATQCLAGKSVTCHVKHYTNPSQD
Sbjct: 61 GDLYTTSSQLTLPATQCLAGKSVTCHVKHYTNPSQDVTVPCPVPSTPPTPSPSTPPTPSP 120
Query: 121 XXXHPRLSLHRPALEDLLLGSEANLTCTLTGLRDASGVTFTWTPSSGKSAVQGPPERDLC 180
HPRLSLHRPALEDLLLGSEANLTCTLTGLRDASGVTFTWTPSSGKSAVQGPPERDLC
Sbjct: 121 SCCHPRLSLHRPALEDLLLGSEANLTCTLTGLRDASGVTFTWTPSSGKSAVQGPPERDLC 180
Query: 181 GCYSVSSVLPGCAEPWNHGKTFTCTAAYPESKTPLTATLSKSGNTFRPEVHLLPPPSEEL 240
GCYSVSSVLPGCAEPWNHGKTFTCTAAYPESKTPLTATLSKSGNTFRPEVHLLPPPSEEL
Sbjct: 181 GCYSVSSVLPGCAEPWNHGKTFTCTAAYPESKTPLTATLSKSGNTFRPEVHLLPPPSEEL 240
Query: 241 ALNELVTLTCLARGFSPKDVLVRWLQGSQELPREKYLTWASRQEPSQGTTTFAVTSILRV 300
ALNELVTLTCLARGFSPKDVLVRWLQGSQELPREKYLTWASRQEPSQGTTTFAVTSILRV
Sbjct: 241 ALNELVTLTCLARGFSPKDVLVRWLQGSQELPREKYLTWASRQEPSQGTTTFAVTSILRV 300
Query: 301 AAEDWKKGDTFSCMVGHEALPLAFTQKTIDRLAGKPTHVNVSVVMAEVDGTCY 353
AAEDWKKGDTFSCMVGHEALPLAFTQKTIDRLAGKPTHVNVSVVMAEVDGTCY
Sbjct: 301 AAEDWKKGDTFSCMVGHEALPLAFTQKTIDRLAGKPTHVNVSVVMAEVDGTCY 353
CPU time: 0.12 user secs. 0.06 sys. secs 0.18 total secs.
Database: Non-redundant SwissProt sequences
Posted date: Feb 2, 1999 3:30 AM
Number of letters in database: 27,864,727
Number of sequences in database: 77,419
Lambda K H
0.315 0.131 0.00
Gapped
Lambda K H
0.270 0.0470 7.29e-304
Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 19690490
Number of Sequences: 77419
Number of extensions: 807756
Number of successful extensions: 2530
Number of sequences better than 1.0: 154
Number of HSP's better than 1.0 without gapping: 52
Number of HSP's successfully gapped in prelim test: 102
Number of HSP's that attempted gapping in prelim test: 1951
Number of HSP's gapped (non-prelim): 369
length of query: 353
length of database: 27864727
effective HSP length: 51
effective length of query: 302
effective length of database: 23916358
effective search space: 7222740116
effective search space used: 7222740116
T: 11
A: 40
X1: 16 ( 7.3 bits)
X2: 38 (14.8 bits)
X3: 64 (24.9 bits)
S1: 41 (21.6 bits)
S2: 73 (32.8 bits)
The last sequence searching method I'm going to talk about today is called FRAMESEARCH. It is specifically designed for dealing with bad data. That is, it can find an alignment between a protein query and a nucleotide test sequence even if the latter contains frameshifting gaps. The main reason you need to know about this program is that there are a lot of sequences in the databases, primarily in the EST divisions, that do contain frameshifts in the sequence. A fair estimate is that over half of the EST sequences have these errors. Since many protein coding regions are open in all frames these can still show up in a search with BLAST or FASTA, when the in frame part scores highly, but they will mess up any subsequent multiple sequence alignment because the section on the other side of the frameshift will be incorrect. In a really junky sequence there might be multiple frame shifts, so that nothing is long enough to show up in BLAST or FASTA. FRAMESEARCH, unfortunately, is a very slow program, as it is an extended version of the Smith-Waterman method. One thing that you might want to do, which is much less time consuming, is to run FRAMESEARCH or FRAMEALIGN directly against any EST hit you find with either FASTA or BLAST. That is usually acceptable because you will find the sequences which interest you with these two general tools, but can then get the nucleic acid lined up better with the protein with the more specialized tool. The FASTX3 and TFASTX3 programs from Pearson can also handle frameshifts.
Before we move on, I'd like to make one more important point. All of these search algorithms will at some "distance" between two sequences be unable to find one using the other as a query. Consider three different sequences A,B, and C. For any sequence comparison method, it is common to find this situation:
A finds B
misses C
B finds A
finds C
C finds B
misses A
In the example shown above, using A as a query we may find B, but not be able to find C, yet C may really be distantly related to A. Therefore, it is important to take the sequences that are found in a database search, and to search the database again using those. In the example shown, B would find A, but also C. Obviously, you want to redo searches with the more divergent hits - if the original query was mouse trypsin, there isn't much point redoing the search with rat trypsin!
So far we've discussed how to find a sequence query in a sequence database. Sometimes instead you want to find a pattern query in a sequence database. There are basically two kinds of patterns that the GCG package can search for. The first is called a pattern, and it consists of string that describes the pattern, for instance, Threonine or Serine, a gap of 3 to 5, an Alanine then a Glycine. The second is called a profile, and it consists of an array of weights for each type of residue for each position in a window of some length. Profiles are similar to what we saw in the last lecture with the Gibbs program.
The program to use to find patterns is called, naturally, FINDPATTERNS. You use this when you have a pattern of your own to search a database with. You need only know the correct syntax to express your pattern, the key pieces of that syntax are:
AGCT Sequence "AGCT" (also Y,R and other degenerate codes)
~C Not C
{n,m} previous symbol or expression repeated n to m times
(A,G,C) A or G or C
(AG,G,C) AG or G or C
< Beginning of sequence
> End of sequence
$ findpatterns/infile=gb:dmwhite/pattern="AGCTN{1,10}(AG,GT)~YAA"
$ type dmwhite.find
! FINDPATTERNS on Gb_In:Dmwhite allowing 0 mismatches
! 1 AGCTN{1,10}(AG,GT)~YAA February 2, 1999 15:02 ..
Dmwhite ck: 9858 len: 14,245 ! X02974
Drosophila melanogaster DNA sequence of white locus. 2/93
1 AGCTN{1,10}(AG,GT)~YAA
AGCTN{4}(AG)~YAA
4,041: TAGTT AGCTAGAGAGAGAAA CGGCA
1 /Rev TT~R(AC,CT)N{1,10}AGCT
TT~R(AC)N{5}AGCT
1,146: TGGAT TTTACAGCCAAGCT TAGCT
TT~R(AC)N{5}AGCT
12,685: AGGTG TTCACCTCAGAGCT GCCAG
Total finds: 3
Total length: 14,245
Total sequences: 1
CPU time: 00.33
Things to watch out for with FINDPATTERNS:
There is a very complete discussion of profiles in the GCG manual, so this will only be a quick overview. The profile methods work this way:
Sometimes the problem is inverted, you want to test a new sequence to see if it contains a known pattern or profile. For instance, you might want to check a protein for a match in PROSITE, or a piece of DNA for a match in the TFD or TRANSFAC databases.
To see if a protein contains a PROSITE motif, use the program MOTIFS. It is pretty much self explanatory.
$ motifs/infile=pir1:a1hu
$ type a1hu.motifs
MOTIFS from: Pir1:A1hu
Mismatches: 0 February 2, 1999 14:50 ..
A1hu Check: 6072 Length: 353 ! Ig alpha-1 chain C region - human
__________________________________________________________________________
Ig_Mhc (F,Y)xCx(V,A)xH
(F)xCx(V)xH
311: KKGDT FSCMVGH EALPL
***********************************************************************
Immunoglobulins and major histocompatibility complex proteins signature
***********************************************************************
The basic structure of immunoglobulin (Ig) [1] molecules is a
tetramer of two light chains and two heavy chains linked by
disulfide bonds. There are two types of light chains: kappa and
lambda, each composed of a constant domain (CL) and a variable domain
(VL). There are five types of heavy chains: alpha, delta, epsilon,
gamma and mu, all consisting of a variable domain (VH) and three
(in alpha, delta and gamma) or four (in epsilon and mu)
constant domains (CH1 to CH4).
The major histocompatibility complex (MHC) molecules are made of two
chains. In class I [2] the alpha chain is composed of three
extracellular domains, a transmembrane region and a cytoplasmic
tail. The beta chain (beta-2- microglobulin) is composed of a
single extracellular domain. In class II [3], both the alpha and the
beta chains are composed of two extracellular domains, a transmembrane
region and a cytoplasmic tail.
It is known [4,5] that the Ig constant chain domains and
a single extracellular domain in each type of MHC chains are
related. These homologous domains are approximately one
hundred amino acids long and include a conserved intradomain
disulfide bond. We developed a small pattern around the C-terminal
cysteine involved in this disulfide bond which can be used to
detect these category of Ig related proteins.
-Consensus pattern: [FY]-x-C-x-[VA]-x-H
-Sequences known to belong to this class detected by the pattern:
Ig heavy chains type Alpha C region : All, in CH2 and CH3.
Ig heavy chains type Delta C region : All, in CH3.
Ig heavy chains type Epsilon C region: All, in CH1, CH3 and CH4.
Ig heavy chains type Gamma C region : All, in CH3 and also CH1 in some cases
Ig heavy chains type Mu C region : All, in CH2, CH3 and CH4.
Ig light chains type Kappa C region : In all CL except rabbit and Xenopus.
Ig light chains type Lambda C region : In all CL except rabbit.
MHC class I alpha chains : All, in alpha-3 domains, including
in the cytomegalovirus MHC-1 homologous protein [6].
Beta-2-microglobulin : All.
MHC class II alpha chains: All, in alpha-2 domains.
MHC class II beta chains: All, in beta-2 domains.
-Other sequence(s) detected in SWISS-PROT: 68.
-Last update: May 1991 / Text revised.
[ 1] Gough N.
Trends Biochem. Sci. 6:203-205(1981).
[ 2] Klein J., Figueroa F.
Immunol. Today 7:41-44(1986).
[ 3] Figueroa F., Klein J.
Immunol. Today 7:78-81(1986).
[ 4] Orr H.T., Lancet D., Robb R.J., Lopez de Castro J.A., Strominger J.L.
Nature 282:266-270(1979).
[ 5] Cushley W., Owen M.J.
Immunol. Today 4:88-92(1983).
[ 6] Beck S., Barrel B.G.
Nature 331:269-272(1988).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Keep in mind is that some motifs occur so frequently that they are normally ignored in searches. For instance, the pattern for one of the Glycosylation sites is
There is also a Profile version of PROSITE available. It was constructed by finding proteins that matched each PROSITE pattern, and making a profile from them. So you can also do:
$ profilescan/infile=pir1:a1hu
to find the profiles that match in that protein sequence. Two files will be produced: a1hu.scan and a1hu.sum. The former gives a detailed acount of what was found, the latter a brief (relatively speaking) summary. The output tends to be on the verbose side, so it isn't shown here.
To check a DNA sequence against the TFD database, use one of the commands TFMAP, TFMAPPLOT, or TFMAPSORT. These format the output in various ways, but essentially provide the same information. These are exactly the same programs used to do restriction digests, but in that case the database of patterns is restriction enzymes rather than the TFD.
$ tektronix versaterm term $ tfmapplot/infile=gb_in:dmwhite or $ mapplot/infile=gb_in:dmwhite/ - data=transfac:transfac_athr.dat
Sometimes you want to find a database entry by the documentary information within it. For instance, by author, or by gene name. There are a lot of ways to do that. The slowest way is to use the GCG tool STRINGSEARCH which just does a serial search of the reference information in databases. An example of that was introduced above.
The fastest method is to use one of the database index tools. The two primary ones are:
More than any other category of tool, this particular group is in flux. For instance, the unsupported GCG program LOOKUP, which is based on an older version of SRS, is currently in an odd state, as it could not index several of the current databases, so the indices it does have are incompatible with our current databases. If you ignore the warnings, it is still fairly useful. Plans are to replace it soon with a local copy of SRS. The NCBI is also replacing the RETRIEVE server this month with one called QUERY, which is essentially a variant of ENTREZ. When that change occurs, the syntax used with NCBI_RETRIEVE will have to change.
Next week we'll start covering tools for Molecular Biology.