Before we get into the techniques of sequence alignment, I want to make two points clearly, and I want you to try to keep them in the back of your heads any time you do sequence comparisons.
In general, one can either talk about the similarity between two sequences or the distance between them. These are very closely related concepts, when distance goes up, similarity goes down, and vice versa. But keep in mind that the relationship isn't necessarily linear, which you will discover for yourself when you do the homework.
Let's start with the simplest sequence alignment case: comparing two nucleic acid or protein sequences where no gaps are allowed. That is, just come up with a way to score differences when the sequences are already aligned. To measure distance one needs a distance metric, which is an operation that has certain properties. When gaps have been eliminated from consideration, it is easy to come up with a distance metric.
D(a,b) is a distance metric if all of the following hold:
D(a,b) >= 0 for all sequences a,b
D(a,b) = 0 only if a=b
D(a,b) = D(b,a) commutative
D(a,b) + D(b,c) >= D(a,c) triangle inequality
Here is a simple distance metric: No gaps, Identity = 0, mismatch = 1.
Name Matches Score (D) Triangle inequality
----------------------------------------------------------------
A a c g t a c g t
| | | 3
B a c t t c t g t
------------------------------------ 8
B a c t t c t g t
| | | | | 5
C t c g t a a a t
------------------------------------------------- >= ---------
A a c g t a c g t
| | | 3 3
C t c g t a a a t
----------------------------------------------------------------
In general, and especially with proteins, the metric tends to be a bit more complicated. To begin with, the scores for "identity" and "mismatch" offen differ for each pair of amino acids. That is, the score for a Serine/Threonine comparison is different than that for a Arginine/Phenylalanine comparison. These scores are commonly placed into a matrix or table.
The oldest commonly used such matrix, in GCG parlance, a comparison matrix, is the PAM, which was originally worked out by Dayhoff and coworkers in the mid-seventies. PAM is a pseudoacronym for Accepted Point Mutation. At the time this matrix was derived there was very little sequence available, and so the substitutions they found were almost entirely from globins of various types. Anyway, then and now, a comparison matrix is found by measuring all of the residue changes in a group of aligned proteins (if they weren't aligned, you couldn't tell what was replacing what!), and then normalizing the resulting matrix in one manner or another. You will usually see a number on the end of comparison matrices, like PAM250. What this usually means is that the base matrix, which is a sort of unit matrix for substitution rates, has been multiplied by itself that many times. This is, in one sense, a correction for evolutionary time - it gives a matrix corresponding to an average divergence time. Practically speaking, one often uses different numbered versions with different programs because they have been shown to be somehow "optimal" for the algorithms used in those programs. There is almost a cottage industry devoted to these matrices, and one can find dozens in the literature, a good place to start is with the Pearson paper that is referenced in the homework. Probably the most commonly encountered modern matrix is the BLOSUM, which is the default matrix for a lot of programs, notably BLAST. The differences between matrices tend to show up at the limits of sensitivity, when you are aligning distantly related proteins.
One last point about matrices - they need to be normalized or scaled appropriately for particular programs. Do not assume that a matrix you find in the literature can be typed in and used verbatim.
Enough about comparison matrices, back to sequence alignment. If two or more sequences have been aligned, the resulting insertions and deletions which appear are both referred to by the common term "indel". This is because an insertion into one aligned sequence is the equivalent of a deletion in another.
Once indels are allowed the alignment problem becomes more complicated; not only must the distance metric incorporate indel scoring, but a method is required to place the indels so as to find the minimum value of that distance metric for the two sequences being aligned. The last part of this is quite a difficult problem to solve, since there are far too many possible combinations of indel placements to try them all. It turns out that this particular mathematical problem does have a solution, or rather, two solutions, depending on exactly what is being aligned:
/NOBIGGAPS suppresses a shorthand that GAP and BESTFIT
use for representing long gaps, where they write a sort of
ellipsis (...) instead of a long row of dots (gaps). This
qualifier forces them to explicitly show all sequence.
$ GAP /INFILE1=sw:myg_human /INFILE2=sw:hba_eleel -
/OUTFILE=TEST_GAP.PAIR /NOBIGGAPS /DEFAULT
$ BESTFIT /INFILE1=sw:myg_human /INFILE2=sw:hba_eleel -
/OUTFILE=TEST_BESTFIT.PAIR /NOBIGGAPS /DEFAULT
Unless the two sequences under scrutiny are known to be cognates for their entire lengths, the tool that would generally be used is BESTFIT, which will find the most similar region between two sequences. It is very common for there to be slightly different alignments of this same region which score as high, or very nearly as high, as the BESTFIT alignment. There are often also other regions with significantly high similarity. BESTFIT won't tell you any of this - it always returns the single best result. So it is a good idea to rerun BESTFIT on the regions before and after the first region it finds, to see if you can find any of these other regions. You aren't giving up as much as it would seem by using BESTFIT rather than GAP, because it will align the entire length of both sequences if they really are quite similar. For instance, there aren't any significant differences in the two alignments produced in the preceding two examples:
$ DIFFerence test_gap.pair test_bestfit.pair Irrelevant header differences omitted File TEST_BESTFIT.PAIR;1 39 . . . . 40 101 VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYK 147 41 . :.::.. :| ||. |:||.::.: |::| :: . ..:...|: 42 96 PANFKILAHNLIVVLALFFPADFTPEVHMAVDKFFQNVASALSEKYR 142 ****** File TEST_GAP.PAIR;1 39 . . . . . 40 101 VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELG 150 41 . :.::.. :| ||. |:||.::.: |::| :: . ..:...|: 42 96 PANFKILAHNLIVVLALFFPADFTPEVHMAVDKFFQNVASALSEKYR... 142 43 44 151 FQG 153 45 46 143 ... 143The only difference between these is that BESTFIT terminated the first sequence at position 147, whereas GAP continued it all the way to the end of the sequence.
We won't go through the proof that the alignments found by these methods are optimal. However, it is instructive to see how they work, so I'll show you a simple Smith and Waterman analysis. We won't go through the Needleman Wunsch because it is quite similar. Both of these are dynamic programming methods, which sounds complicated, but actually turns out to be pretty easy to comprehend.
Alignment example from Smith and Waterman, 1981:
Sequences: a(i), b(j); i = 1->N, j = 1->M
D(i,j) = maximum of:
Zero
D(i-1,j-1) + Compare( a(i),b(j) ) (along diagonal)
D(i-k,j) - Gap(k);k=1....j-1 (gap on one sequence)
D(i,j-k) - Gap(k);k=1....i-1 (gap on the other)
Compare()= 1 for identities, -1/3 for mismatches
Gap(k) = 1 + k/3
D(1,*) = D(*,1) = 0.0 ( = " . ", to simplify diagram)
5' C A G C C U C G C U U A G 3'
. . . . . . . . . . . . . .
A . . 1.0 . . . . . . . . . 1.0 .
A . . 1.0 0.7 . . . . . . . . 1.0 0.7
U . . . 0.7 0.3 . 1.0 . . . 1.0 1.0 . 0.7
G . . . 1.0 0.3 . . 0.7 1.0 . . 0.7 0.7 1.0
C . 1.0 . . 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3
C . 1.0 0.7 . 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 . .
A . . 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 .
U . . 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0
U . . 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0
G . . . 1.3 . 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7
A . . 1.0 . 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0
C . 1.0 . 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0
G . . 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0
G . . . 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0
3'
Optimal subfragment: GCCAUUG
||| | |
GCC.UCG
The idea is to calculate all of the values in an array, which has traditionally been called D, which is pretty confusing because in this case at the end of the analysis it will consist essentially of an array of similarity values for subalignments. The similarity calculation that we use is shown at the top of the figure, it has scores for matches and mismatches, but also a variable indel score, where the first indel scores 1, but subsequent adjacent indels only 1/3. That's the similarity metric. Now lets go through the method used to calculate it and to find where the indels go.
Make the array one row and one column bigger than it needs to be, then place one sequence along the top, the other down the side as shown. Set the top row and left column to zero. Next, starting at the upper left corner, where the first pair of residues meet, use the formulation of the similarity metric D(i,j) that is shown at the top of the figure. Step one to the right and repeat. Drop down a row, all the way to the left, and repeat. Keep going until the whole array is filled. Then find the highest value. Work back up the chain until the last value that is >= 1.0. That path indicates the single best alignment value.
Some things to keep in mind.
How do you tell if an alignment is significant? One rule of thumb suggested by Russell Doolittle, is that if after alignment you have > 25% identity on > 100 amino acids, then it's probably real. Below that is the twilight zone. (Except, do not apply this rule if the composition of the sequence is highly biased towards a few amino acid types!) Some of the database search tools, which will be discussed in another lecture, have ways of either estimating or measuring the distribution of alignment scores for a query against a database, and can in many cases discover much smaller and weaker similarities which are statistically significant.
Some of the homework problems deal with this problem.
Another tool that is commonly used for aligning a pair of sequences is the dot plot. The basic dotplot is constructed by putting a dot on a 2D plot everywhere the sequences match, and no dot where they don't. In practice, one generally filters the plot a bit by requiring a short region of similarity for each dot, which cuts down on the visual noise. Regions of identity show up as diagonals, often quite strikingly. Dot plots can also be used to look for repeats or inversions within a single sequence. Two GCG programs are used to make a dotplot:
Set up GCG graphics, compare the two proteins, then display the results. This GCG graphics setting stays in effect until it is changed - it doesn't need to be set before each program is run. Instruct COMPARE to require 50% homology over a 30 amino acid window, compare human myoglobin versus electric eel hemoglobin A. $ tektronix verstarm-tek4105 term: $ compare /infile1=sw:myg_human/infile2=sw:hba_eleel - /window=30/string=15.0/default $ dotplot /infile=myg_human.pnt Which produces this graphic on the output device:
COMPARE/word=N must be used to produce dotplots of larger sequences. The word method looks for exact matches of subsequences of length N, which is a much quicker operation than the default sliding window method. In either case, the .pnt file which is produced can be very large, so delete it when you're done or you'll use up your disk quota in a hurry.
We've discussed how one aligns two sequences, or parts of two sequences. What happens when similar information is needed for multiple sequences? The mathematical formalisms of the Smith Waterman and Needleman Wunsch methods do still apply in multiple dimensions, however the problem is computationally expensive in multiple dimensions with typical numbers of average length sequences. That is, while one could in theory do an N-dimensional Smith-Waterman analysis, in practice, except on the simplest cases, it would take more computer power than is generally available.
Instead, most of the available multiple alignment programs use some sort of incremental method. I'll discuss PILEUP, the multiple alignment program that comes with the GCG package, but you should also at some point give CLUSTALW a try, as it is also commonly used. First, let's run a quick alignment, this between human myoglobin, electric eel hemoglobin alpha, echidna alpha-2, and beaver myoglobin. There's nothing special about these - I just want to show what happens when you align a bunch of proteins that are fairly closely related.
$ create pileup.lis This list file instructs PILEUP to align the two sequences it names. The next line after this contains two periods and separates the comments from the list of sequences .. sw:myg_human sw:hba_eleel sw:hba1_tacac sw:myg_casfi ^Z $ pileup/infile=@pileup.lis
PILEUP makes an initial pass where it aligns all pairs of sequences and scores the alignments. This initial alignment is by the method of Needleman and Wunsch, that is, it aligns the entire sequence. Think about that - if you are aligning a bunch of different proteins, and you know some regions are just not at all similar, cut those regions out before you do the alignment, especially if it is easy to do so because they are on the ends of the protein. Similarly, if you are interested just in some particular repeat or motif, extract them from the original sequence as best you can and then do the alignment. The reason is, everything you throw into PILEUP that doesn't belong there just acts to gunk up the works. The final alignment may still come out right, but then, it might not.
At the end of this pass PILEUP has a table of overall similarities between all pairs of sequences. It then incrementally aligns them, starting with the two most similar sequences. It then looks for the next closest pair, considering the weighted distance from the combined first pair. It might add another sequence to that cluster, or it might align another pair, and then align the resulting alignments, and so forth. The program has an option to output a figure of the tree it used. It's usually a good idea to look at it, just to make sure that the order of alignment makes some sort of sense - it can help you catch misnamed sequences, for instance. One more note about that tree - it may look like a phylogenetic tree, but it isn't one. There are times, in fact, most of the time, that if you do a phylogenetic analysis on the aligned sequences the phylogenetic tree that results will be similar to the PILEUP one, but this isn't always the case.
Here is how you look at this figure:
$ figure/infile=pileup.figureThis is what it looks like:
The final product of a PILEUP run is a set of aligned sequences, which are stored in a Multiple Sequence File. As when aligning pairs, the final alignment will usually change somewhat if a different comparison matrix is used, or the gap weights changed.
This is what the final alignment looks like:
$ type pileup.msf
PileUp of: @Pileup.Lis
Symbol comparison table: GenRunData:Pileuppep.Cmp CompCheck: 1254
GapWeight: 3.000
GapLengthWeight: 0.100
Test.Msf MSF: 153 Type: P January 26, 1999 14:59 Check: 4798 ..
Name: Myg_Human Len: 153 Check: 4188 Weight: 1.00
Name: Myg_Casfi Len: 153 Check: 2412 Weight: 1.00
Name: Hba_Eleel Len: 153 Check: 4688 Weight: 1.00
Name: Hba1_Tacac Len: 153 Check: 3510 Weight: 1.00
//
1 50
Myg_Human GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK
Myg_Casfi GLSDGEWQLV LHVWGKVEAD LAGHGQEVLI RLFKGHPETL EKFNKFKHIK
Hba_Eleel SLTAKSKSIV KAFWGKIGSR ADDIGAEAFG RMLTVYPETK TYFASWSDLS
Hba1_Tacac VLTDAEKKEV TSLWGKASGH AEEYGAEALE RLFLSFPTTK TYF.SHMDLS
51 100
Myg_Human SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP
Myg_Casfi SEDEMKASED LKKHGVTVLT ALGGVLKKKG HHEAEIKPLA QSHATKHKIP
Hba_Eleel .....PGSAA VKKHGKTIMG GIAEAVGHID DLTGGLASLS ELHAFKLRVD
Hba1_Tacac .....KGSAQ VKAHGKRVAD ALTTAAGHFN DMDSALSALS DLHAHKLRVD
101 150
Myg_Human VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG
Myg_Casfi IKYLEFISEA IIHVLQSKHP GBFGADABGA MNKALELFRK DIAAKYKELG
Hba_Eleel PANFKILAHN LIVVLALFFP ADFTPEVHMA VDKFFQNVAS ALSEKYR...
Hba1_Tacac PVNFKLLAHC FLVVLARHHP AEFTPSAHAA MDKFLSRVAT VLTSKYR...
151
Myg_Human FQG
Myg_Casfi FQG
Hba_Eleel ...
Hba1_Tacac ...
In a later lecture we'll discuss the tools which you can use to format a multiple sequence alignment for publication. For right now though, you might just want to know which regions of the alignment have the most similarities, and which the least. To answer this you can use the PLOTSIMILARITY program. The output will show the arithmetic average of the scores of all possible pairwise symbol comparisons among the sequence symbols at that position. When all sequences at a position are the same, the value will be 1.5.
Note the {*} in the next command - leaving it out is
one of the most common mistakes people make when using GCG
$ plotsimilarity/infile=pileup.msf{*}/minscale=0.0/wind=1/bar/default
There is a different multiple alignment problem which one sometimes encounters. Imagine that you suspect that each member of a set of proteins contains a well conserved core region, but that the rest of the protein is highly diverged. If you knew ahead of time the sequence pattern that characterized this region you could search for it, but if you don't, what then? Comparing all small subsequences combinatorially is obvious, but impractical.
A group at the NCBI wrote an interesting program to find just these sorts of motifs using a technique known as the Gibbs sampling strategy. There is rather a lot of math involved in understanding exactly what it does, but in a nutshell, it tries to find the best scoring common ungapped segment of width W in a set of N sequences.
Refer to the following figure, which illustrates how the method works. First the program places one such segment at random on each of the sequences (yellow boxes) and initializes a weighted sequence profile (blue box, Matrix Version 1). A weighted sequence profile is just that - a set of weights to apply for each type of amino acid at each position in the scoring window. It then selects one such sequence at random (Test), recalculates the weighted sequence profile from the remaining N-1 sequences (Matrix, Version 2), then scans it across all N-W+1 positions in the test sequence until it finds the highest scoring segment (red box). It continues in this manner (next version)for 500 cycles (or as specified by the user.)
Here is an example of how to use it. The proteins in this example come are the same as those in the original paper.
$ create list_for_gibbs.fil
This is a list of helix-turn-helix proteins which
we will convert into FASTA format for use with the
gibbs program.
..
PIR:A25944
PIR:A28627
PIR:A32837
PIR:A23450
PIR:B26499
PIR:BVECDA
PIR:C29010
PIR:DNECFS
PIR:JEBY1
PIR:QCBP2L
PIR:QRECC
PIR:RCBPL
PIR:RGBP22
PIR:RGECA
PIR:RGECF
PIR:RGECH
PIR:RGKBCP
PIR:RPECCT
PIR:RPECDO
PIR:RPECG
PIR:RPECL
PIR:RPECTN
PIR:RPECW
PIR:S02513
PIR:S07337
PIR:S07958
PIR:S08477
PIR:S09205
PIR:S11945
PIR:Z1BPC2
^Z
Now create a command file, it contains the answers
to the questions:
How many types of elements?
For the first (only, in this case) element,
how long is the site?
Do all sequences have the same number of sites?
How many is that?
$ create cmd.hth
1
18
y
1
^Z
$ tofasta/infile=@list_for_gibbs.fil/out=hth.30
The GIBBS command on the next line actually starts a
wrapper procedure which then runs the actual GIBBS
program. If there were no parameters supplied, it would
have sent usage information to the screen instead.
$ gibbs hth.30 cmd.hth
Output is in: HTH.out
Messages are in:HTH.msg
Gibbs processing: completed
$ type hth.msg
How many types of elements? For element 1...
How long are the sites?
Do all seqs have the same # of sites (Y/N)?
How many sites in each seq? 1 sites for each running gibbs...
nconv = 500
** 1 **
5
10
15
20
25
** 2 **
5
[---] 0.0522946 motif A cycle 10
10
15
** 3 **
5
[+] 2.18188 motif A cycle 7
10
15
[--] 18.9218 motif A cycle 18
[-] 1.40928 motif A cycle 20
20
[-] 18.6323 motif A cycle 22
[-] 195.936 motif A cycle 24
25
30
[---] 12972.7 motif A cycle 32
35
40 seed: 458719990
time: 3 seconds (0.05 minutes)
$ type hth.out
#1 A25944 transcription initiation factor sigma B - Bacillus s
#2 A28627 transcription initiation factor sigma K precursor -
#3 A32837 transcription activator nahR - Pseudomonas putida pl
#4 A23450 homeotic protein Antennapedia - fruit fly (Drosophil
#5 B26499 nitrogen assimilation regulatory protein ntrC - Brad
#6 BVECDA cell division control protein dicA (gene dicB repres
#7 C29010 mercuric resistance operon regulatory merD protein -
#8 DNECFS DNA-binding protein fis - Escherichia coli
#9 JEBY1 mating-type regulatory protein a1 - yeast (Saccharomy
#10 QCBP2L regulatory protein cII - phage lambda
#11 QRECC cAMP receptor protein - Escherichia coli
#12 RCBPL regulatory protein cro - phage lambda
#13 RGBP22 regulatory protein cro - phage P22
#14 RGECA arabinose operon transcription regulator - Escherichi
#15 RGECF fumarate and nitrate reduction regulatory protein - E
#16 RGECH transcription initiation factor sigma 32 - Escherichi
#17 RGKBCP nitrogen regulation protein I ntrC - Klebsiella pneu
#18 RPECCT cyt transcription repressor cytr - Escherichia coli
#19 RPECDO deoxyribose operon repressor - Escherichia coli
#20 RPECG gal operon repressor - Escherichia coli
#21 RPECL lactose operon repressor - Escherichia coli
#22 RPECTN repressor tetR - Escherichia coli transposon Tn10
#23 RPECW trp operon repressor - Escherichia coli
#24 S02513 nif-specific regulatory protein - Klebsiella pneumon
#25 S07337 transcription initiation factor sigma E precursor -
#26 S07958 DNA-invertase - Escherichia coli
#27 S08477 pur operon repressor - Escherichia coli
#28 S09205 ebg repressor - Escherichia coli
#29 S11945 lexA repressor - Escherichia coli
#30 Z1BPC2 regulatory protein cI - phage P22
MOTIF A
1-1 225 iidltyiqnk SQKETGDILGISQMHVSR lqrkavkklr 242
2-1 198 rfgldlkkek TQREIAKELGISRSYVSR iekralmkmf 215
3-1 22 vvfnqllvdr RVSITAENLGLTQPAVSN alkrlrtslq 39
4-1 326 fhfnryltrr RRIEIAHALCLTERQIKI wfqnrrmkwk 343
5-1 449 ltaalaatrg NQIRAADLLGLNRNTLRK kirdldiqvy 466
6-1 22 iryrrknlkh TQRSLAKALKISHVSVSQ wergdseptg 39
7-1 5 mnay TVSRLALDAGVSVHIVRD yllrgllrpv 22
8-1 73 ldmvmqytrg NQTRAALMMGINRGTLRK klkkygmn 90
9-1 99 frrkqslnsk EKEEVAKKCGITPLQVRV wfinkrmrsk 116
10-1 25 sallnkiaml GTEKTAEAVGVDKSQISR wkrdwipkfs 42
11-1 169 thpdgmqiki TRQEIGQIVGCSRETVGR ilkmledqnl 186
12-1 15 itlkdyamrf GQTKTAKDLGVYQSAINK aihagrkifl 32
13-1 12 ykkdvidhfg TQRAVAKALGISDAAVSQ wkevipekda 29
14-1 196 isdhladsnf DIASVAQHVCLSPSRLSH lfrqqlgisv 213
15-1 196 fsprefrltm TRGDIGNYLGLTVETISR llgrfqksgm 213
16-1 252 arwldednks TLQELADRYGVSAERVRQ leknamkklr 269
17-1 444 lttalrhtqg HKQEAARLLGWGRNTLTR klkelgme 461
18-1 11 mkakkqetaa TMKDVALKAKVSTATVSR almnpdkvsq 28
19-1 23 lqelkrsdkl HLKDAAALLGVSEMTIRR dlnnhsapvv 40
20-1 3 ma TIKDVARLAGVSVATVSR vinnspkase 20
21-1 5 mkpv TLYDVAEYAGVSYQTVSR vvnqashvsa 22
22-1 26 llnevgiegl TTRKLAQKLGVEQPTLYW hvknkralld 43
23-1 67 iveellrgem SQRELKNELGAGIATITR gsnslkaapv 84
24-1 495 liaalekagw VQAKAARLLGMTPRQVAY riqimditmp 512
25-1 205 rfglvgeeek TQKDVADMMGISQSYISR lekriikrlr 222
26-1 160 qagrliaagt PRQKVAIIYDVGVSTLYK rfpagdk 177
27-1 3 ma TIKDVAKRANVSTTTVSH vinktrfvae 20
28-1 3 ma TLKDIAIEAGVSLATVSR vlnddptlnv 20
29-1 27 dhisqtgmpp TRAEIAQRLGFRSPNAAE ehlkalarkg 44
30-1 25 ssilnriair GQRKVADALGINESQISR wkgdfipkmg 42
5 10 15
POS A R N D C Q E G H I L K M F P S T W Y V
1 1 7 7 4 0 0 4 11 7 1 1 0 0 0 4 7 50 0 0 4
2 1 17 0 0 0 37 1 1 0 11 14 7 4 0 0 1 7 0 0 7
3 11 21 0 0 0 14 7 4 0 7 1 24 0 0 0 7 7 0 4 1
4 4 11 0 27 0 0 31 1 0 4 1 20 0 0 0 7 0 0 0 1
5 17 1 0 0 0 0 1 1 0 21 18 0 0 0 0 1 14 0 0 34
6 87 1 0 0 0 0 1 11 0 1 1 4 0 0 0 1 0 0 0 1
7 4 11 7 17 0 14 11 1 4 7 11 20 0 0 0 1 0 0 0 1
8 17 11 4 7 0 0 11 1 4 11 18 10 7 0 0 1 0 0 7 1
9 21 1 0 0 3 0 1 1 0 1 54 0 7 0 0 1 0 0 7 11
10 1 1 4 4 7 0 1 81 0 1 1 7 0 0 0 1 0 0 0 1
11 4 1 0 0 3 0 1 1 0 27 18 0 4 4 0 1 0 3 0 41
12 1 4 10 4 0 0 4 11 0 1 1 0 0 0 0 51 17 0 4 1
13 4 17 0 4 0 17 11 1 4 4 4 4 0 0 10 4 7 0 4 14
14 17 7 7 0 0 4 11 4 4 1 4 0 7 0 10 24 4 0 0 4
15 11 7 4 0 0 17 1 1 4 4 1 0 0 0 0 4 47 0 7 1
16 4 1 0 0 0 0 1 1 0 27 21 0 0 0 0 1 0 0 0 51
17 7 21 4 0 0 0 1 4 0 1 1 4 0 0 0 51 7 0 7 1
18 1 47 4 4 0 10 4 1 7 4 1 14 0 0 0 1 0 3 4 4
non-
site 9 7 4 5 1 6 7 6 3 6 11 5 3 3 4 6 5 1 2 6
Complete log-likelihood ratio = 861 bits
Missing position information = 227 bits
Log-likelihood ratio statistic = 633 bits
Degrees of freedom = 342
Information per parameter = 1.85 bits
Best Sites (std. dev. above mean)
1-1 225 iidltyiqnk SQKETGDILGISQMHVSR lqrkavkklr 242 (6.47)
2-1 198 rfgldlkkek TQREIAKELGISRSYVSR iekralmkmf 215 (7.59)
3-1 22 vvfnqllvdr RVSITAENLGLTQPAVSN alkrlrtslq 39 (5.54)
4-1 326 fhfnryltrr RRIEIAHALCLTERQIKI wfqnrrmkwk 343 (6.05)
5-1 449 ltaalaatrg NQIRAADLLGLNRNTLRK kirdldiqvy 466 (6.27)
6-1 22 iryrrknlkh TQRSLAKALKISHVSVSQ wergdseptg 39 (6.15)
7-1 5 mnay TVSRLALDAGVSVHIVRD yllrgllrpv 22 (5.25)
8-1 73 ldmvmqytrg NQTRAALMMGINRGTLRK klkkygmn 90 (5.71)
9-1 99 frrkqslnsk EKEEVAKKCGITPLQVRV wfinkrmrsk 116 (6.08)
10-1 25 sallnkiaml GTEKTAEAVGVDKSQISR wkrdwipkfs 42 (6.97)
11-1 169 thpdgmqiki TRQEIGQIVGCSRETVGR ilkmledqnl 186 (6.12)
12-1 15 itlkdyamrf GQTKTAKDLGVYQSAINK aihagrkifl 32 (6.97)
13-1 12 ykkdvidhfg TQRAVAKALGISDAAVSQ wkevipekda 29 (6.06)
14-1 196 isdhladsnf DIASVAQHVCLSPSRLSH lfrqqlgisv 213 (5.69)
15-1 196 fsprefrltm TRGDIGNYLGLTVETISR llgrfqksgm 213 (6.88)
16-1 252 arwldednks TLQELADRYGVSAERVRQ leknamkklr 269 (5.97)
17-1 444 lttalrhtqg HKQEAARLLGWGRNTLTR klkelgme 461 (6.63)
18-1 11 mkakkqetaa TMKDVALKAKVSTATVSR almnpdkvsq 28 (7.02)
19-1 23 lqelkrsdkl HLKDAAALLGVSEMTIRR dlnnhsapvv 40 (6.96)
20-1 3 ma TIKDVARLAGVSVATVSR vinnspkase 20 (8.14)
21-1 5 mkpv TLYDVAEYAGVSYQTVSR vvnqashvsa 22 (6.61)
22-1 26 llnevgiegl TTRKLAQKLGVEQPTLYW hvknkralld 43 (7.33)
23-1 67 iveellrgem SQRELKNELGAGIATITR gsnslkaapv 84 (5.57)
24-1 495 liaalekagw VQAKAARLLGMTPRQVAY riqimditmp 512 (5.78)
25-1 205 rfglvgeeek TQKDVADMMGISQSYISR lekriikrlr 222 (8.19)
26-1 160 qagrliaagt PRQKVAIIYDVGVSTLYK rfpagdk 177 (6.00)
27-1 3 ma TIKDVAKRANVSTTTVSH vinktrfvae 20 (7.14)
28-1 3 ma TLKDIAIEAGVSLATVSR vlnddptlnv 20 (6.42)
29-1 27 dhisqtgmpp TRAEIAQRLGFRSPNAAE ehlkalarkg 44 (5.19)
30-1 25 ssilnriair GQRKVADALGINESQISR wkgdfipkmg 42 (6.11)
5 10 15
Total log-likelihood ratio = 634 bits
This is, in fact, a good predictor for the helix turn helix motif, as shown in the original paper. The quality of the predictor is indicated by the value of "Information per parameter", which is 1.85 bits. For a set of random sequences (no pattern) that value would instead have been on the order of 1.2. The numbers to the right of the best sites show standard deviations above the mean for that profile at this site, the numbers to the left, like "20-1", mean sequence # - pattern #, ie, the first pattern location in the 20th sequence.
I encourage you all to do the homework, which will hopefully help clarify some of the points that we covered here today. Next week we'll go over databases and database searching.
Are there any questions?