Fundamentals of Sequence Analysis, 1998-1999 Answer set 2: Sequence Alignment Fundamentals Problem group 1. Working sequences 1A. How much remaining identity is there in each of A1HU_*.pep files? (Hint, don't type them all, use SEARCH). A1HU has length 353, for this protein, the answer is: Steps % remaining identity (not all digits significant) 100 74.220970 200 57.790367 300 44.475918 400 34.560905 500 28.611898 600 19.263458 700 16.430593 1B. If your sequence had the same number of each type of amino acid and you told PEPCORRUPT to do 100 X Length substitutions, what would the remaining identity be? (Don't do the experiment, think about it!) 5%. There are 20 amino acids. If each of the amino acids is equally represented in the original sequence, and the corrupted sequence is random, then one in 20 will match just by chance. Note that few if any proteins are composed of an equal mix of all amino acids! 1C. In the A1HU_*.pep files there is a line giving the number of "Effective Substitutions", which is smaller than the number of total substitutions, and represents the number of sequence positions that are different than in the original sequence. Yet each random substitution did in fact change one residue to another. Why aren't these numbers the same? Because some sites have been mutated more than once. Of those, about 5% back mutate to the original amino acid. You can use Poisson statistics to figure this out (see any statistics book for the formula.) Problem Group 2. Smith Waterman alignment 2A. At what level of substitution do things start going wrong, as indicated by the introduction of gaps. (Recall, the corrupted sequence has residue changes, but no indels added.) Gaps will start to be introduced at somewhere between 35% and 30% identity. (In this particular case we know that gapping is wrong since the test sequences have substitutions but no indels.) 2B. Look at the Quality, Ratio, Gaps, and Similarity scores in the *.pair files (use SEARCH). What can you say about each value as progressively corrupted sequences are aligned? Steps % identity Quality Ratio Gaps Similarity 100 74.220970 392.2 1.111 0 77.620 200 57.790367 313.2 0.895 0 65.429 300 44.475918 237.4 0.676 0 52.707 400 34.560905 184.4 0.524 0 45.455 500 28.611898 162.0 0.474 2 41.176 600 19.263458 117.9 0.343 8 38.438 700 16.430593 97.3 0.290 7 34.862 The obvious point is that Quality, Ratio, and Similarity go down with the % identity, and the number of Gaps goes up. A more interesting observation is that the alignments are incorrect (given our prior knowledge of what the result should be) for a ratio of <= 0.5. Note also that the Similarity score is "inflated" by the addition of gaps - it doesn't fall as far as it would if gaps were not allowed. Don't look at the Similarity score - look instead at the Quality score, or better, the ratio score, which is Quality/length. 2C. Use PEPCORRUPT to generate TWO corrupted sequences that each have 50% remaining identity with the same starting sequence (for A1HU that would be about 250 substitutions). Then use BESTFIT to align those with each other and with the starting sequence. The mutations in each are independent, so when the two corrupted sequences are aligned with each other, even though each has 50% identity with the starting sequence, they are only 25% identical to each other, and a rather poor alignment results. Conversely, each aligns without gapping with the starting sequence. Problem Group 3. Multiple alignment 3A. Are the results from the multiple alignment any better than from the pairwise alignments? Why, or why not? No, they are the same. This is because the closest sequence to each corrupted sequence is the starting sequence. There is little or no useful information to be gained by comparing the corrupted sequences to each other because because they are all quite different from each other (see 2C above). 3B. Is the quality of the alignment at each step against the starting sequence better, worse, or the same, versus the first set of sequences (from Problem group 1 and 2)? The BESTFITs come out about the same. This is because each step in both series is mutated the same amount against the starting sequence (within the vagaries of the random number generator, statistical fluctuations, and so forth.) Are the results from the multiple alignment any better than from the pairwise alignments? Why, or why not? This time the multiple alignment is much better. The sequences are closer together so that comparing the corrupted sequences to each other yields a significant amount of information, and the multiple alignment program takes advantage of it. Note also that the distances between successive steps in this case are small, whereas in the previous case, these distances were very large. Lastly, notice that the order of alignment in the two cases is different. This is because in the first case, each sequence was closest to the starting sequence, whereas in the second case, each was closest to the sequences immediately before and after it in the series. 4A. Look at the output file and compare the locations with the reference material in the SW:finc_human file (use FETCH to put a copy in your directory so that you can read it.) What did it find? Run the program again, with the exact same command line. Try it a couple of times. Did it find the same thing, why or why not? Most likely you found motifs that correlate with the notations: DISULFIDE or DOMAIN (Fibronectin type III). Any of these score high enough for the program to find, but which you get depends upon which one it stumbles upon first, which is random given the algorithm used by the program. 4B. Are the results consistent between runs? Yes and no. If you find a particular Motif among the 4 it will generally score about the same and be shown at the same position (possibly shifted left or right by one). However, there may be more than 4 Motifs that score well, in which case, different runs will select at random from this set. 4C. Compare the information per parameter value for each motif from the test run versus the control run. Are these values significantly higher in the experimental case than in the controls? Yes, although you can't tell which control goes with which Motif! Sort them and align them (your values may vary a bit from these): Exp. Control 2.08 1.38 1.93 1.34 1.90 1.32 1.70 1.30 All of the experimental values are greater than all of the controls. Keep in mind though, that scrambling controls are often a poor indicator of biological significance, since the scrambled sequence is usually more random than an unrelated protein sequence would be. 4D. Are there only 4 motifs, or are there more? Try increasing the number of motifs that you search for. (But first reduce your run priority as this can get time consuming: Set proc/prio=2). How many motifs are present at least 15 times? (Hint, when the scrambled controls score as highly as the worst motif, you are at the sensitivity limit.) There are 7 motifs that consistently score higher than the controls (max 1.39),and 5 motifs that score comfortably higher (>1.5). That doesn't exactly answer the question though, since we *forced* the motifs to match 15 positions, some of them may really only have been present at 14 or 13 positions, with the other positions filled in by chance. To try to estimate the real number of positions, one can look at the standard deviation measure associated with each position to see if any is suspiciously low. The command file can then be modified and the process rerun. This can be a bit painful though, since you can't force the Motif that you are trying to control to be found in the right order, and you may have to run the program several times until by chance it comes up as the Motif with only N positions.