Fundamentals of Sequence Analysis, 1998-1999
Answer set 2:  Sequence Alignment Fundamentals

Problem group 1.  Working sequences

1A.  How much remaining identity is there in each of A1HU_*.pep files?
     (Hint, don't type them all, use SEARCH).

A1HU has length 353, for this protein, the answer is:

Steps       % remaining identity (not all digits significant)
100         74.220970
200         57.790367
300         44.475918
400         34.560905
500         28.611898
600         19.263458
700         16.430593

1B.  If your sequence had the same number of each type of amino acid
     and you told PEPCORRUPT to do 100 X Length substitutions, what
     would the remaining identity be?  (Don't do the experiment, think
     about it!)

5%.  There are 20 amino acids.  If each of the amino acids is equally
represented in the original sequence, and the corrupted sequence is random,
then one in 20 will match just by chance.  Note that few if any proteins
are composed of an equal mix of all amino acids!

1C.  In the A1HU_*.pep files there is a line giving the number of
     "Effective Substitutions", which is smaller than the number of
     total substitutions, and represents the number of sequence positions
     that are different than in the original sequence.  Yet each random
     substitution did in fact change one residue to another.  Why aren't
     these numbers the same? 

Because some sites have been mutated more than once.  Of those, about
5% back mutate to the original amino acid.  You can use Poisson statistics
to figure this out (see any statistics book for the formula.)

Problem Group 2.  Smith Waterman alignment

2A.  At what level of substitution do things start going wrong, as 
     indicated by the introduction of gaps.  (Recall, the corrupted
     sequence has residue changes, but no indels added.)

Gaps will start to be introduced at somewhere between 35% and 30% identity.
(In this particular case we know that gapping is wrong since the test 
sequences have substitutions but no indels.)

2B.  Look at the Quality, Ratio, Gaps, and Similarity scores in the *.pair
     files (use SEARCH).  What can you say about each value as progressively
     corrupted sequences are aligned?

Steps       % identity  Quality Ratio Gaps Similarity
100         74.220970   392.2   1.111    0     77.620
200         57.790367   313.2   0.895    0     65.429
300         44.475918   237.4   0.676    0     52.707
400         34.560905   184.4   0.524    0     45.455 
500         28.611898   162.0   0.474    2     41.176
600         19.263458   117.9   0.343    8     38.438
700         16.430593    97.3   0.290    7     34.862

The obvious point is that Quality, Ratio, and Similarity go down
with the % identity, and the number of Gaps goes up.  A more
interesting observation is that the alignments are incorrect
(given our prior knowledge of what the result should be) for a ratio
of <= 0.5.  Note also that the Similarity score is "inflated"
by the addition of gaps - it doesn't fall as far as it would if
gaps were not allowed.  Don't look at the Similarity score - look
instead at the Quality score, or better, the ratio score, which
is Quality/length.

2C.  Use PEPCORRUPT to generate TWO corrupted sequences that each
     have 50% remaining identity with the same starting sequence
     (for A1HU that would be about 250 substitutions).  Then use BESTFIT
     to align those with each other and with the starting sequence.

The mutations in each are independent, so when the two corrupted sequences
are aligned with each other, even though each has 50% identity with the
starting sequence, they are only 25% identical to each other, and a rather
poor alignment results.  Conversely, each aligns without gapping with the
starting sequence.  

Problem Group 3.  Multiple alignment

3A.  Are the results from the multiple alignment any better than from
     the pairwise alignments?  Why, or why not?

No, they are the same.  This is because the closest sequence to each 
corrupted sequence is the starting sequence.  There is little or no useful
information to be gained by comparing the corrupted sequences to each other
because because they are all quite different from each other (see 2C above).

3B.  Is the quality of the alignment at each step against the starting 
     sequence better, worse, or the same, versus the first set of
     sequences (from Problem group 1 and 2)?

The BESTFITs come out about the same.  This is because each step in 
both series is mutated the same amount against the starting sequence
(within the vagaries of the random number generator, statistical 
fluctuations, and so forth.)

     Are the results from the multiple alignment any better than from the
     pairwise alignments?  Why, or why not? 

This time the multiple alignment is much better.  The sequences are closer
together so that comparing the corrupted sequences to each other yields
a significant amount of information, and the multiple alignment program
takes advantage of it. Note also that the distances between successive steps
in this case are small, whereas in the previous case, these distances were
very large.  Lastly, notice that the order of alignment in the two cases
is different.  This is because in the first case, each sequence was closest 
to the starting sequence, whereas in the second case, each was closest to
the sequences immediately before and after it in the series.

4A.  Look at the output file and compare the locations with the reference
     material in the SW:finc_human file (use FETCH to put a copy in
     your directory so that you can read it.)  What did it find?  Run
     the program again, with the exact same command line.  Try it a couple
     of times. Did it find the same thing, why or why not?

Most likely you found motifs that correlate with the notations:
DISULFIDE or DOMAIN (Fibronectin type III).  Any of these score high
enough for the program to find, but which you get depends upon which
one it stumbles upon first, which is random given the algorithm used
by the program.

4B.  Are the results consistent between runs?

Yes and no.  If you find a particular Motif among the 4 it will generally
score about the same and be shown at the same position (possibly shifted 
left or right by one).  However, there may be more than 4 Motifs that score
well, in which case, different runs will select at random from this set.

4C.  Compare the information per parameter value for each motif from the
     test run versus the control run.  Are these values significantly
     higher in the experimental case than in the controls?

Yes, although you can't tell which control goes with which Motif!
Sort them and align them (your values may vary a bit from these):

  Exp.  Control
  2.08  1.38
  1.93  1.34
  1.90  1.32
  1.70  1.30

All of the experimental values are greater than all of the controls.  Keep
in mind though, that scrambling controls are often a poor indicator of
biological significance, since the scrambled sequence is usually more
random than an unrelated protein sequence would be.

4D.  Are there only 4 motifs, or are there more?  Try increasing the number
     of motifs that you search for.  (But first reduce your run priority
     as this can get time consuming:  Set proc/prio=2).  How many motifs
     are present at least 15 times?  (Hint, when the scrambled controls
     score as highly as the worst motif, you are at the sensitivity limit.)

There are 7 motifs that consistently score higher than the controls (max
1.39),and 5 motifs that score comfortably higher (>1.5).  That doesn't
exactly answer the question though, since we *forced* the motifs to
match 15 positions, some of them may really only have been present at
14 or 13 positions, with the other positions filled in by chance.  To try
to estimate the real number of positions, one can look at the standard
deviation measure associated with each position to see if any is
suspiciously low.  The command file can then be modified and the process
rerun.  This can be a bit painful though, since you can't force the Motif
that you are trying to control to be found in the right order, and you may
have to run the program several times until by chance it comes up
as the Motif with only N positions.