Sequence Alignment Fundamentals

Fundamentals of Sequence Analysis, 1998-1999

Lecture 2: Sequence alignment fundamentals

Introduction

Before we get into the techniques of sequence alignment, I want to make two points clearly, and I want you to try to keep them in the back of your heads any time you do sequence comparisons.

  1. Whenever possible, work on the protein sequence instead of the nucleic acid sequence. It's a long way from the DNA sequence to the protein sequence. Through alternate codon usage for the same amino acid, completely different codon usage for a similar amino acid, and in eukaryotes, differences in intron/exon boundaries, there can be a tremendous amount of change in the DNA that has little or no effect on the final protein product. Nucleic acid comparisons also tend to have more problems due to skewed composition and small repeats, both of which may cause gaps to be placed improperly in alignments, as compared to the "correct" protein level alignment.
  2. When comparing two or more proteins, what is really of interest is how similar their 3D structures are, since that is the best indicator of functional similarity. Unfortunately, more often that not, there will be no experimentally determined structure for any of the proteins you are working on. So you must compare the sequences by making use of the approximation that the structure is correlated with the underlying sequence. This is an approximation, and when dealing with proteins that have had a long time to diverge, and so have very different sequences, it's often not a very good one.

Distance

In general, one can either talk about the similarity between two sequences or the distance between them. These are very closely related concepts, when distance goes up, similarity goes down, and vice versa. But keep in mind that the relationship isn't necessarily linear, which you will discover for yourself when you do the homework.

Let's start with the simplest sequence alignment case: comparing two nucleic acid or protein sequences where no gaps are allowed. That is, just come up with a way to score differences when the sequences are already aligned. To measure distance one needs a distance metric, which is an operation that has certain properties. When gaps have been eliminated from consideration, it is easy to come up with a distance metric.


D(a,b) is a distance metric if all of the following hold:

  D(a,b) >= 0                  for all sequences a,b
  D(a,b)  = 0                  only if a=b
  D(a,b)  = D(b,a)             commutative
  D(a,b) + D(b,c) >= D(a,c)    triangle inequality

Here is a simple distance metric:  No gaps, Identity = 0, mismatch = 1.

Name   Matches               Score (D)        Triangle inequality
----------------------------------------------------------------
A      a c g t a c g t
           |   | |              3
B      a c t t c t g t
------------------------------------                8
B      a c t t c t g t
       |   |   | | |            5
C      t c g t a a a t
-------------------------------------------------  >=  ---------
A      a c g t a c g t
       |         | |            3                   3
C      t c g t a a a t
----------------------------------------------------------------

Comparison Matrices

In general, and especially with proteins, the metric tends to be a bit more complicated. To begin with, the scores for "identity" and "mismatch" offen differ for each pair of amino acids. That is, the score for a Serine/Threonine comparison is different than that for a Arginine/Phenylalanine comparison. These scores are commonly placed into a matrix or table.

The oldest commonly used such matrix, in GCG parlance, a comparison matrix, is the PAM, which was originally worked out by Dayhoff and coworkers in the mid-seventies. PAM is a pseudoacronym for Accepted Point Mutation. At the time this matrix was derived there was very little sequence available, and so the substitutions they found were almost entirely from globins of various types. Anyway, then and now, a comparison matrix is found by measuring all of the residue changes in a group of aligned proteins (if they weren't aligned, you couldn't tell what was replacing what!), and then normalizing the resulting matrix in one manner or another. You will usually see a number on the end of comparison matrices, like PAM250. What this usually means is that the base matrix, which is a sort of unit matrix for substitution rates, has been multiplied by itself that many times. This is, in one sense, a correction for evolutionary time - it gives a matrix corresponding to an average divergence time. Practically speaking, one often uses different numbered versions with different programs because they have been shown to be somehow "optimal" for the algorithms used in those programs. There is almost a cottage industry devoted to these matrices, and one can find dozens in the literature, a good place to start is with the Pearson paper that is referenced in the homework. Probably the most commonly encountered modern matrix is the BLOSUM, which is the default matrix for a lot of programs, notably BLAST. The differences between matrices tend to show up at the limits of sensitivity, when you are aligning distantly related proteins.

One last point about matrices - they need to be normalized or scaled appropriately for particular programs. Do not assume that a matrix you find in the literature can be typed in and used verbatim.

Gaps (indels)

Enough about comparison matrices, back to sequence alignment. If two or more sequences have been aligned, the resulting insertions and deletions which appear are both referred to by the common term "indel". This is because an insertion into one aligned sequence is the equivalent of a deletion in another.

Aligning two sequences

Once indels are allowed the alignment problem becomes more complicated; not only must the distance metric incorporate indel scoring, but a method is required to place the indels so as to find the minimum value of that distance metric for the two sequences being aligned. The last part of this is quite a difficult problem to solve, since there are far too many possible combinations of indel placements to try them all. It turns out that this particular mathematical problem does have a solution, or rather, two solutions, depending on exactly what is being aligned:

/NOBIGGAPS suppresses a shorthand that GAP and BESTFIT
use for representing long gaps, where they write a sort of
ellipsis (...) instead of a long row of dots (gaps).  This
qualifier forces them to explicitly show all sequence.
 
$ GAP     /INFILE1=sw:myg_human /INFILE2=sw:hba_eleel -
     /OUTFILE=TEST_GAP.PAIR        /NOBIGGAPS  /DEFAULT
$ BESTFIT /INFILE1=sw:myg_human /INFILE2=sw:hba_eleel -
     /OUTFILE=TEST_BESTFIT.PAIR    /NOBIGGAPS  /DEFAULT

Unless the two sequences under scrutiny are known to be cognates for their entire lengths, the tool that would generally be used is BESTFIT, which will find the most similar region between two sequences. It is very common for there to be slightly different alignments of this same region which score as high, or very nearly as high, as the BESTFIT alignment. There are often also other regions with significantly high similarity. BESTFIT won't tell you any of this - it always returns the single best result. So it is a good idea to rerun BESTFIT on the regions before and after the first region it finds, to see if you can find any of these other regions. You aren't giving up as much as it would seem by using BESTFIT rather than GAP, because it will align the entire length of both sequences if they really are quite similar. For instance, there aren't any significant differences in the two alignments produced in the preceding two examples:


$ DIFFerence test_gap.pair test_bestfit.pair

Irrelevant header differences omitted

File TEST_BESTFIT.PAIR;1
   39                     .         .         .         .
   40        101 VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYK 147
   41            .  :.::.. :| ||.   |:||.::.: |::| :: . ..:...|:
   42         96 PANFKILAHNLIVVLALFFPADFTPEVHMAVDKFFQNVASALSEKYR 142
******
File TEST_GAP.PAIR;1
   39                     .         .         .         .         .
   40        101 VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELG 150
   41            .  :.::.. :| ||.   |:||.::.: |::| :: . ..:...|:
   42         96 PANFKILAHNLIVVLALFFPADFTPEVHMAVDKFFQNVASALSEKYR... 142
   43
   44        151 FQG 153
   45
   46        143 ... 143

The only difference between these is that BESTFIT terminated the first sequence at position 147, whereas GAP continued it all the way to the end of the sequence.

We won't go through the proof that the alignments found by these methods are optimal. However, it is instructive to see how they work, so I'll show you a simple Smith and Waterman analysis. We won't go through the Needleman Wunsch because it is quite similar. Both of these are dynamic programming methods, which sounds complicated, but actually turns out to be pretty easy to comprehend.

Alignment example from Smith and Waterman, 1981:

  Sequences:     a(i), b(j); i = 1->N, j = 1->M 
  D(i,j) = maximum of:
    Zero
    D(i-1,j-1) + Compare( a(i),b(j) )     (along diagonal)
    D(i-k,j)   - Gap(k);k=1....j-1        (gap on one sequence)
    D(i,j-k)   - Gap(k);k=1....i-1        (gap on the other)

       Compare()= 1 for identities, -1/3 for mismatches
       Gap(k)   = 1 + k/3
       D(1,*) = D(*,1) = 0.0 ( = " . ", to simplify diagram)

5'      C   A   G   C   C   U   C   G   C   U   U   A   G   3'
    .   .   .   .   .   .   .   .   .   .   .   .   .   . 
A   .   .  1.0  .   .   .   .   .   .   .   .   .  1.0  . 
A   .   .  1.0 0.7  .   .   .   .   .   .   .   .  1.0 0.7
U   .   .   .  0.7 0.3  .  1.0  .   .   .  1.0 1.0  .  0.7
G   .   .   .  1.0 0.3  .   .  0.7 1.0  .   .  0.7 0.7 1.0
C   .  1.0  .   .  2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3
C   .  1.0 0.7  .  1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3  .   . 
A   .   .  2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3  . 
U   .   .  0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0 
U   .   .  0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0
G   .   .   .  1.3  .  1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7
A   .   .  1.0  .  1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0
C   .  1.0  .  0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0 
G   .   .  0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0
G   .   .   .  1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0
3'

Optimal subfragment:  GCCAUUG
                      ||| | |
                      GCC.UCG

The idea is to calculate all of the values in an array, which has traditionally been called D, which is pretty confusing because in this case at the end of the analysis it will consist essentially of an array of similarity values for subalignments. The similarity calculation that we use is shown at the top of the figure, it has scores for matches and mismatches, but also a variable indel score, where the first indel scores 1, but subsequent adjacent indels only 1/3. That's the similarity metric. Now lets go through the method used to calculate it and to find where the indels go.

Make the array one row and one column bigger than it needs to be, then place one sequence along the top, the other down the side as shown. Set the top row and left column to zero. Next, starting at the upper left corner, where the first pair of residues meet, use the formulation of the similarity metric D(i,j) that is shown at the top of the figure. Step one to the right and repeat. Drop down a row, all the way to the left, and repeat. Keep going until the whole array is filled. Then find the highest value. Work back up the chain until the last value that is >= 1.0. That path indicates the single best alignment value.

Some things to keep in mind.

  1. While the alignment found may be the best, it does not mean that there aren't others that are nearly as good. Also, there may be several different paths that all give the same highest alignment value, for instance a gap-mismatch pair on one sequence might be just as likely to be a mismatch-gap. Like TT/-A vs. TT/A-.
  2. These programs always find an optimal alignment, even if the sequences are completely dissimilar. BESTFIT/RANDOM=N randomizes the sequences and then aligns them N times. It derives from that the mean and standard deviation for the Quality score. I'm mentioning this option primarily to warn you not to put too much credence in the mean and standard deviations it reports. Often the alignment score may be 4 or 5 deviations above the mean, but it isn't necessarily biologically significant. This is because the scrambling method, which is what BESTFIT uses, is a terrible model for biological sequences. It has been shown that such controls often make alignments appear to be significant when they are not, which GCG acknowledges in the manual, if you read that section carefully.
  3. The output of BESTFIT lists %SIMILARITY and %IDENTITY. For nucleic acids, these are the same thing. For Proteins they are different, the first being the sum of the weights from the comparison matrix along the best path, the second being the sum of the identities along that path. Note that neither value is dependent on the number of gaps, which is somewhat misleading since allowing more gaps will tend to increase these scores. You want to look at the QUALITY and RATIO values, which are scores that do take gaps into account.
  4. Changing the gap weights can change the alignment!
  5. Changing the comparison matrix can change the alignment!
  6. Remember the point I made at the beginning of the lecture - the alignment may not bear much resemblance to what a structural alignment would show. The more similar the sequences are, the closer the sequence alignment will be to a structure alignment.
  7. If you start with two similar sequences, and add random sequence to each, and then run BESTFIT, it will (usually) still find the original two similar regions. However, if you add too much garbage sequence, GAP may no longer align those original two sequence regions. GAP has to align everything, and the noise can overwhelm the signal.

How do you tell if an alignment is significant? One rule of thumb suggested by Russell Doolittle, is that if after alignment you have > 25% identity on > 100 amino acids, then it's probably real. Below that is the twilight zone. (Except, do not apply this rule if the composition of the sequence is highly biased towards a few amino acid types!) Some of the database search tools, which will be discussed in another lecture, have ways of either estimating or measuring the distribution of alignment scores for a query against a database, and can in many cases discover much smaller and weaker similarities which are statistically significant.

Some of the homework problems deal with this problem.

Dot Plots

Another tool that is commonly used for aligning a pair of sequences is the dot plot. The basic dotplot is constructed by putting a dot on a 2D plot everywhere the sequences match, and no dot where they don't. In practice, one generally filters the plot a bit by requiring a short region of similarity for each dot, which cuts down on the visual noise. Regions of identity show up as diagonals, often quite strikingly. Dot plots can also be used to look for repeats or inversions within a single sequence. Two GCG programs are used to make a dotplot:

Set up GCG graphics, compare the two proteins, then
display the results.  This GCG graphics setting stays in
effect until it is changed - it doesn't need to be set
before each program is run. Instruct COMPARE to require 50%
homology over a 30 amino acid window, compare human
myoglobin versus electric eel hemoglobin A. 


$ tektronix verstarm-tek4105 term:
$ compare /infile1=sw:myg_human/infile2=sw:hba_eleel -
  /window=30/string=15.0/default
$ dotplot /infile=myg_human.pnt

Which produces this graphic on the output device:

COMPARE/word=N must be used to produce dotplots of larger sequences. The word method looks for exact matches of subsequences of length N, which is a much quicker operation than the default sliding window method. In either case, the .pnt file which is produced can be very large, so delete it when you're done or you'll use up your disk quota in a hurry.

Multiple Sequence Alignment - whole sequence

We've discussed how one aligns two sequences, or parts of two sequences. What happens when similar information is needed for multiple sequences? The mathematical formalisms of the Smith Waterman and Needleman Wunsch methods do still apply in multiple dimensions, however the problem is computationally expensive in multiple dimensions with typical numbers of average length sequences. That is, while one could in theory do an N-dimensional Smith-Waterman analysis, in practice, except on the simplest cases, it would take more computer power than is generally available.

Instead, most of the available multiple alignment programs use some sort of incremental method. I'll discuss PILEUP, the multiple alignment program that comes with the GCG package, but you should also at some point give CLUSTALW a try, as it is also commonly used. First, let's run a quick alignment, this between human myoglobin, electric eel hemoglobin alpha, echidna alpha-2, and beaver myoglobin. There's nothing special about these - I just want to show what happens when you align a bunch of proteins that are fairly closely related.

$ create pileup.lis
This list file instructs PILEUP to align the two sequences
it names.  The next line after this contains two periods
and separates the comments from the list of sequences
..
sw:myg_human
sw:hba_eleel
sw:hba1_tacac
sw:myg_casfi
^Z

$ pileup/infile=@pileup.lis

PILEUP makes an initial pass where it aligns all pairs of sequences and scores the alignments. This initial alignment is by the method of Needleman and Wunsch, that is, it aligns the entire sequence. Think about that - if you are aligning a bunch of different proteins, and you know some regions are just not at all similar, cut those regions out before you do the alignment, especially if it is easy to do so because they are on the ends of the protein. Similarly, if you are interested just in some particular repeat or motif, extract them from the original sequence as best you can and then do the alignment. The reason is, everything you throw into PILEUP that doesn't belong there just acts to gunk up the works. The final alignment may still come out right, but then, it might not.

At the end of this pass PILEUP has a table of overall similarities between all pairs of sequences. It then incrementally aligns them, starting with the two most similar sequences. It then looks for the next closest pair, considering the weighted distance from the combined first pair. It might add another sequence to that cluster, or it might align another pair, and then align the resulting alignments, and so forth. The program has an option to output a figure of the tree it used. It's usually a good idea to look at it, just to make sure that the order of alignment makes some sort of sense - it can help you catch misnamed sequences, for instance. One more note about that tree - it may look like a phylogenetic tree, but it isn't one. There are times, in fact, most of the time, that if you do a phylogenetic analysis on the aligned sequences the phylogenetic tree that results will be similar to the PILEUP one, but this isn't always the case.

Here is how you look at this figure:


$ figure/infile=pileup.figure

This is what it looks like:

The final product of a PILEUP run is a set of aligned sequences, which are stored in a Multiple Sequence File. As when aligning pairs, the final alignment will usually change somewhat if a different comparison matrix is used, or the gap weights changed.

This is what the final alignment looks like:

$ type pileup.msf
PileUp of: @Pileup.Lis

 Symbol comparison table: GenRunData:Pileuppep.Cmp  CompCheck: 1254

                   GapWeight: 3.000
             GapLengthWeight: 0.100 

 Test.Msf  MSF: 153  Type: P  January 26, 1999 14:59  Check: 4798 ..

 Name: Myg_Human        Len:   153  Check: 4188  Weight:  1.00
 Name: Myg_Casfi        Len:   153  Check: 2412  Weight:  1.00
 Name: Hba_Eleel        Len:   153  Check: 4688  Weight:  1.00
 Name: Hba1_Tacac       Len:   153  Check: 3510  Weight:  1.00

//

            1                                                   50
 Myg_Human  GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL EKFDKFKHLK 
 Myg_Casfi  GLSDGEWQLV LHVWGKVEAD LAGHGQEVLI RLFKGHPETL EKFNKFKHIK 
 Hba_Eleel  SLTAKSKSIV KAFWGKIGSR ADDIGAEAFG RMLTVYPETK TYFASWSDLS 
Hba1_Tacac  VLTDAEKKEV TSLWGKASGH AEEYGAEALE RLFLSFPTTK TYF.SHMDLS 

            51                                                 100
 Myg_Human  SEDEMKASED LKKHGATVLT ALGGILKKKG HHEAEIKPLA QSHATKHKIP 
 Myg_Casfi  SEDEMKASED LKKHGVTVLT ALGGVLKKKG HHEAEIKPLA QSHATKHKIP 
 Hba_Eleel  .....PGSAA VKKHGKTIMG GIAEAVGHID DLTGGLASLS ELHAFKLRVD 
Hba1_Tacac  .....KGSAQ VKAHGKRVAD ALTTAAGHFN DMDSALSALS DLHAHKLRVD 

            101                                                150
 Myg_Human  VKYLEFISEC IIQVLQSKHP GDFGADAQGA MNKALELFRK DMASNYKELG 
 Myg_Casfi  IKYLEFISEA IIHVLQSKHP GBFGADABGA MNKALELFRK DIAAKYKELG 
 Hba_Eleel  PANFKILAHN LIVVLALFFP ADFTPEVHMA VDKFFQNVAS ALSEKYR... 
Hba1_Tacac  PVNFKLLAHC FLVVLARHHP AEFTPSAHAA MDKFLSRVAT VLTSKYR... 

            151  
 Myg_Human  FQG
 Myg_Casfi  FQG
 Hba_Eleel  ...
Hba1_Tacac  ...

In a later lecture we'll discuss the tools which you can use to format a multiple sequence alignment for publication. For right now though, you might just want to know which regions of the alignment have the most similarities, and which the least. To answer this you can use the PLOTSIMILARITY program. The output will show the arithmetic average of the scores of all possible pairwise symbol comparisons among the sequence symbols at that position. When all sequences at a position are the same, the value will be 1.5.


Note the {*} in the next command - leaving it out is
one of the most common mistakes people make when using GCG

$ plotsimilarity/infile=pileup.msf{*}/minscale=0.0/wind=1/bar/default

Multiple Sequence Alignment - subsequences

There is a different multiple alignment problem which one sometimes encounters. Imagine that you suspect that each member of a set of proteins contains a well conserved core region, but that the rest of the protein is highly diverged. If you knew ahead of time the sequence pattern that characterized this region you could search for it, but if you don't, what then? Comparing all small subsequences combinatorially is obvious, but impractical.

A group at the NCBI wrote an interesting program to find just these sorts of motifs using a technique known as the Gibbs sampling strategy. There is rather a lot of math involved in understanding exactly what it does, but in a nutshell, it tries to find the best scoring common ungapped segment of width W in a set of N sequences.

Refer to the following figure, which illustrates how the method works. First the program places one such segment at random on each of the sequences (yellow boxes) and initializes a weighted sequence profile (blue box, Matrix Version 1). A weighted sequence profile is just that - a set of weights to apply for each type of amino acid at each position in the scoring window. It then selects one such sequence at random (Test), recalculates the weighted sequence profile from the remaining N-1 sequences (Matrix, Version 2), then scans it across all N-W+1 positions in the test sequence until it finds the highest scoring segment (red box). It continues in this manner (next version)for 500 cycles (or as specified by the user.)

This method need not always converge, but it often does when there is really a common pattern among the sequences for it to detect. The manner in which the matrix is recalculated is critical though, it must be done a certain way or the process won't converge on a final answer - refer to the paper for the details. Effectively, after a random number of cycles the program stumbles upon a common region in one sequence then finds it on another. This weights the profile towards this common region, so that it tends to be found in subsequent rounds on subsequent sequences - the more sequences having been found, the better the predictor. There is a figure in the paper that illustrates this - it shows the score wandering around while the program searches for a good region, and then rapidly converging once one is found. The number of passes it takes to stumble onto such a region varies tremendously from run to run, even for the same run parameters. Note also that even when patterns are found, they may be shifted several positions left or right in different program runs, as the window size W may not match the pattern exactly, allowing it to score highly in multiple locations.

Here is an example of how to use it. The proteins in this example come are the same as those in the original paper.

$ create list_for_gibbs.fil
This is a list of helix-turn-helix proteins which
we will convert into FASTA format for use with the
gibbs program.
..
PIR:A25944
PIR:A28627
PIR:A32837
PIR:A23450
PIR:B26499
PIR:BVECDA
PIR:C29010
PIR:DNECFS
PIR:JEBY1
PIR:QCBP2L
PIR:QRECC
PIR:RCBPL
PIR:RGBP22
PIR:RGECA
PIR:RGECF
PIR:RGECH
PIR:RGKBCP
PIR:RPECCT
PIR:RPECDO
PIR:RPECG
PIR:RPECL
PIR:RPECTN
PIR:RPECW
PIR:S02513
PIR:S07337
PIR:S07958
PIR:S08477
PIR:S09205
PIR:S11945
PIR:Z1BPC2
^Z

Now create a command file, it contains the answers
to the questions:

  How many types of elements?
  For the first (only, in this case) element,
    how long is the site?
  Do all sequences have the same number of sites?
  How many is that?

  
$ create cmd.hth
1
18
y
1
^Z
$ tofasta/infile=@list_for_gibbs.fil/out=hth.30

The GIBBS command on the next line actually starts a
wrapper procedure which then runs the actual GIBBS
program.  If there were no parameters supplied, it would
have sent usage information to the screen instead. 

$ gibbs hth.30 cmd.hth
Output is in:   HTH.out
Messages are in:HTH.msg
Gibbs processing: completed
$ type hth.msg
How many types of elements? For element 1...
	How long are the sites?
 	Do all seqs have the same # of sites (Y/N)?
 	How many sites in each seq? 1 sites for each running gibbs...
nconv = 500
** 1 **
5
10
15
20
25
** 2 **
5
[---] 0.0522946  motif A cycle 10
10
15
** 3 **
5
[+] 2.18188  motif A cycle 7
10
15
[--] 18.9218  motif A cycle 18
[-] 1.40928  motif A cycle 20
20
[-] 18.6323  motif A cycle 22
[-] 195.936  motif A cycle 24
25
30
[---] 12972.7  motif A cycle 32
35
40	seed: 458719990
	time: 3 seconds (0.05 minutes)
$ type hth.out

#1   A25944 transcription initiation factor sigma B - Bacillus s
#2   A28627 transcription initiation factor sigma K precursor - 
#3   A32837 transcription activator nahR - Pseudomonas putida pl
#4   A23450 homeotic protein Antennapedia - fruit fly (Drosophil
#5   B26499 nitrogen assimilation regulatory protein ntrC - Brad
#6   BVECDA cell division control protein dicA (gene dicB repres
#7   C29010 mercuric resistance operon regulatory merD protein -
#8   DNECFS DNA-binding protein fis - Escherichia coli
#9   JEBY1 mating-type regulatory protein a1 - yeast (Saccharomy
#10  QCBP2L regulatory protein cII - phage lambda
#11  QRECC cAMP receptor protein - Escherichia coli
#12  RCBPL regulatory protein cro - phage lambda
#13  RGBP22 regulatory protein cro - phage P22
#14  RGECA arabinose operon transcription regulator - Escherichi
#15  RGECF fumarate and nitrate reduction regulatory protein - E
#16  RGECH transcription initiation factor sigma 32 - Escherichi
#17  RGKBCP nitrogen regulation protein I ntrC - Klebsiella pneu
#18  RPECCT cyt transcription repressor cytr - Escherichia coli
#19  RPECDO deoxyribose operon repressor - Escherichia coli
#20  RPECG gal operon repressor - Escherichia coli
#21  RPECL lactose operon repressor - Escherichia coli
#22  RPECTN repressor tetR - Escherichia coli transposon Tn10
#23  RPECW trp operon repressor - Escherichia coli
#24  S02513 nif-specific regulatory protein - Klebsiella pneumon
#25  S07337 transcription initiation factor sigma E precursor - 
#26  S07958 DNA-invertase - Escherichia coli
#27  S08477 pur operon repressor - Escherichia coli
#28  S09205 ebg repressor - Escherichia coli
#29  S11945 lexA repressor - Escherichia coli
#30  Z1BPC2 regulatory protein cI - phage P22




                            MOTIF A

 1-1   225  iidltyiqnk SQKETGDILGISQMHVSR lqrkavkklr  242
 2-1   198  rfgldlkkek TQREIAKELGISRSYVSR iekralmkmf  215
 3-1    22  vvfnqllvdr RVSITAENLGLTQPAVSN alkrlrtslq   39
 4-1   326  fhfnryltrr RRIEIAHALCLTERQIKI wfqnrrmkwk  343
 5-1   449  ltaalaatrg NQIRAADLLGLNRNTLRK kirdldiqvy  466
 6-1    22  iryrrknlkh TQRSLAKALKISHVSVSQ wergdseptg   39
 7-1     5        mnay TVSRLALDAGVSVHIVRD yllrgllrpv   22
 8-1    73  ldmvmqytrg NQTRAALMMGINRGTLRK klkkygmn     90
 9-1    99  frrkqslnsk EKEEVAKKCGITPLQVRV wfinkrmrsk  116
10-1    25  sallnkiaml GTEKTAEAVGVDKSQISR wkrdwipkfs   42
11-1   169  thpdgmqiki TRQEIGQIVGCSRETVGR ilkmledqnl  186
12-1    15  itlkdyamrf GQTKTAKDLGVYQSAINK aihagrkifl   32
13-1    12  ykkdvidhfg TQRAVAKALGISDAAVSQ wkevipekda   29
14-1   196  isdhladsnf DIASVAQHVCLSPSRLSH lfrqqlgisv  213
15-1   196  fsprefrltm TRGDIGNYLGLTVETISR llgrfqksgm  213
16-1   252  arwldednks TLQELADRYGVSAERVRQ leknamkklr  269
17-1   444  lttalrhtqg HKQEAARLLGWGRNTLTR klkelgme    461
18-1    11  mkakkqetaa TMKDVALKAKVSTATVSR almnpdkvsq   28
19-1    23  lqelkrsdkl HLKDAAALLGVSEMTIRR dlnnhsapvv   40
20-1     3          ma TIKDVARLAGVSVATVSR vinnspkase   20
21-1     5        mkpv TLYDVAEYAGVSYQTVSR vvnqashvsa   22
22-1    26  llnevgiegl TTRKLAQKLGVEQPTLYW hvknkralld   43
23-1    67  iveellrgem SQRELKNELGAGIATITR gsnslkaapv   84
24-1   495  liaalekagw VQAKAARLLGMTPRQVAY riqimditmp  512
25-1   205  rfglvgeeek TQKDVADMMGISQSYISR lekriikrlr  222
26-1   160  qagrliaagt PRQKVAIIYDVGVSTLYK rfpagdk     177
27-1     3          ma TIKDVAKRANVSTTTVSH vinktrfvae   20
28-1     3          ma TLKDIAIEAGVSLATVSR vlnddptlnv   20
29-1    27  dhisqtgmpp TRAEIAQRLGFRSPNAAE ehlkalarkg   44
30-1    25  ssilnriair GQRKVADALGINESQISR wkgdfipkmg   42
                           5   10   15

POS    A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V
   1   1  7  7  4  0  0  4 11  7  1  1  0  0  0  4  7 50  0  0  4
   2   1 17  0  0  0 37  1  1  0 11 14  7  4  0  0  1  7  0  0  7
   3  11 21  0  0  0 14  7  4  0  7  1 24  0  0  0  7  7  0  4  1
   4   4 11  0 27  0  0 31  1  0  4  1 20  0  0  0  7  0  0  0  1
   5  17  1  0  0  0  0  1  1  0 21 18  0  0  0  0  1 14  0  0 34
   6  87  1  0  0  0  0  1 11  0  1  1  4  0  0  0  1  0  0  0  1
   7   4 11  7 17  0 14 11  1  4  7 11 20  0  0  0  1  0  0  0  1
   8  17 11  4  7  0  0 11  1  4 11 18 10  7  0  0  1  0  0  7  1
   9  21  1  0  0  3  0  1  1  0  1 54  0  7  0  0  1  0  0  7 11
  10   1  1  4  4  7  0  1 81  0  1  1  7  0  0  0  1  0  0  0  1
  11   4  1  0  0  3  0  1  1  0 27 18  0  4  4  0  1  0  3  0 41
  12   1  4 10  4  0  0  4 11  0  1  1  0  0  0  0 51 17  0  4  1
  13   4 17  0  4  0 17 11  1  4  4  4  4  0  0 10  4  7  0  4 14
  14  17  7  7  0  0  4 11  4  4  1  4  0  7  0 10 24  4  0  0  4
  15  11  7  4  0  0 17  1  1  4  4  1  0  0  0  0  4 47  0  7  1
  16   4  1  0  0  0  0  1  1  0 27 21  0  0  0  0  1  0  0  0 51
  17   7 21  4  0  0  0  1  4  0  1  1  4  0  0  0 51  7  0  7  1
  18   1 47  4  4  0 10  4  1  7  4  1 14  0  0  0  1  0  3  4  4
non-
site   9  7  4  5  1  6  7  6  3  6 11  5  3  3  4  6  5  1  2  6

Complete log-likelihood ratio  =  861 bits
Missing position information   =  227 bits
Log-likelihood ratio statistic =  633 bits
Degrees of freedom             =  342
Information per parameter      = 1.85 bits

                           Best Sites (std. dev. above mean)

 1-1   225  iidltyiqnk SQKETGDILGISQMHVSR lqrkavkklr  242 (6.47)
 2-1   198  rfgldlkkek TQREIAKELGISRSYVSR iekralmkmf  215 (7.59)
 3-1    22  vvfnqllvdr RVSITAENLGLTQPAVSN alkrlrtslq   39 (5.54)
 4-1   326  fhfnryltrr RRIEIAHALCLTERQIKI wfqnrrmkwk  343 (6.05)
 5-1   449  ltaalaatrg NQIRAADLLGLNRNTLRK kirdldiqvy  466 (6.27)
 6-1    22  iryrrknlkh TQRSLAKALKISHVSVSQ wergdseptg   39 (6.15)
 7-1     5        mnay TVSRLALDAGVSVHIVRD yllrgllrpv   22 (5.25)
 8-1    73  ldmvmqytrg NQTRAALMMGINRGTLRK klkkygmn     90 (5.71)
 9-1    99  frrkqslnsk EKEEVAKKCGITPLQVRV wfinkrmrsk  116 (6.08)
10-1    25  sallnkiaml GTEKTAEAVGVDKSQISR wkrdwipkfs   42 (6.97)
11-1   169  thpdgmqiki TRQEIGQIVGCSRETVGR ilkmledqnl  186 (6.12)
12-1    15  itlkdyamrf GQTKTAKDLGVYQSAINK aihagrkifl   32 (6.97)
13-1    12  ykkdvidhfg TQRAVAKALGISDAAVSQ wkevipekda   29 (6.06)
14-1   196  isdhladsnf DIASVAQHVCLSPSRLSH lfrqqlgisv  213 (5.69)
15-1   196  fsprefrltm TRGDIGNYLGLTVETISR llgrfqksgm  213 (6.88)
16-1   252  arwldednks TLQELADRYGVSAERVRQ leknamkklr  269 (5.97)
17-1   444  lttalrhtqg HKQEAARLLGWGRNTLTR klkelgme    461 (6.63)
18-1    11  mkakkqetaa TMKDVALKAKVSTATVSR almnpdkvsq   28 (7.02)
19-1    23  lqelkrsdkl HLKDAAALLGVSEMTIRR dlnnhsapvv   40 (6.96)
20-1     3          ma TIKDVARLAGVSVATVSR vinnspkase   20 (8.14)
21-1     5        mkpv TLYDVAEYAGVSYQTVSR vvnqashvsa   22 (6.61)
22-1    26  llnevgiegl TTRKLAQKLGVEQPTLYW hvknkralld   43 (7.33)
23-1    67  iveellrgem SQRELKNELGAGIATITR gsnslkaapv   84 (5.57)
24-1   495  liaalekagw VQAKAARLLGMTPRQVAY riqimditmp  512 (5.78)
25-1   205  rfglvgeeek TQKDVADMMGISQSYISR lekriikrlr  222 (8.19)
26-1   160  qagrliaagt PRQKVAIIYDVGVSTLYK rfpagdk     177 (6.00)
27-1     3          ma TIKDVAKRANVSTTTVSH vinktrfvae   20 (7.14)
28-1     3          ma TLKDIAIEAGVSLATVSR vlnddptlnv   20 (6.42)
29-1    27  dhisqtgmpp TRAEIAQRLGFRSPNAAE ehlkalarkg   44 (5.19)
30-1    25  ssilnriair GQRKVADALGINESQISR wkgdfipkmg   42 (6.11)
                           5   10   15


Total log-likelihood ratio  =  634 bits

This is, in fact, a good predictor for the helix turn helix motif, as shown in the original paper. The quality of the predictor is indicated by the value of "Information per parameter", which is 1.85 bits. For a set of random sequences (no pattern) that value would instead have been on the order of 1.2. The numbers to the right of the best sites show standard deviations above the mean for that profile at this site, the numbers to the left, like "20-1", mean sequence # - pattern #, ie, the first pattern location in the 20th sequence.

I encourage you all to do the homework, which will hopefully help clarify some of the points that we covered here today. Next week we'll go over databases and database searching.

Are there any questions?