Fundamentals of Sequence Analysis, 1998-1999
Problem set 2:  Sequence Alignment Fundamentals

If you get stuck, refer to the OpenVMS and GCG resources in the 
class home page.

References:
 
  T.F. Smith and M.S. Waterman (1981) "Identification of Common
  Molecular Subsequences", Journal Molecular Biology 147:195-197

  S.B. Needleman and C.D. Wunsch (1970) "A General Method Applicable
  to the Search for Similarities in the Amino Acid Sequence of Two
  Proteins", Journal Molecular Biology 48:443-453.

  C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu,
  A. F. Neuwald, J. C. Wootton (1993) "Detecting Subtle Sequence
  Signals: A Gibbs Sampling Strategy for Multiple Alignment",
  Science 262:208-214.

  W.R. Pearson (1995) "Comparison of methods for searching protein
  sequence databases", Protein Science 4:1145-1160.

Problem group 1.  Working sequences

For all of these problem sets you will need a consistent
set of data.  Pick a protein that interests you that is at least several
hundred amino acids long.  (In the examples I use PIR1:A1HU, but you can
use any that you like).  Progressively corrupt the sequence without
inducing gaps using the program PEPCORRUPT.

PEPCORRUPT was written locally, it is not yet part of either GCG or EGCG. 
It will do random substitutions or indel insertions.  Note that the "seed"
for the random number generator changes on each run, so two runs with
identical substition and indel numbers will produce different output.

 $ X = 100
 $ PEPCORRUPT/INFILE=PIR1:A1HU/outfile=A1HU_'x'.pep -
 $_   /SUBS='x'/indels=0/default
 $ X = X + 100

  remember the up-arrow key , you don't need to type any more lines!

 $ PEPCORRUPT/INFILE=PIR1:A1HU/outfile=A1HU_'x'.pep -
 $_   /SUBS='x'/indels=0/default

and so forth, up to substitutions = roughly 2X the length of the protein.

Think about what these are - they represent a set of sequences that
diverged different amounts from a core sequence, and none of them share
any sequence changes except by chance.  You might see something like
this if N species radiated from a single progenitor, but each had a 
different level of selective pressure on this one protein.

1A.  How much remaining identity is there in each of A1HU_*.pep files?
     (Hint, don't type them all, use SEARCH).

1B.  If your sequence had the same number of each type of amino acid
     and you told PEPCORRUPT to do 100 X Length substitutions, what
     would the remaining identity be?  (Don't do the experiment, think
     about it!)

1C.  In the A1HU_*.pep files there is a line giving the number of
     "Effective Substitutions", which is smaller than the number of
     total substitutions, and represents the number of sequence positions
     that are different than in the original sequence.  Yet each random
     substitution did in fact change one residue to another.  Why aren't
     these numbers the same? 

Problem Group 2.  Smith Waterman alignment

Use Bestfit to align the starting sequence with each of the corrupted
sequences using the default matrix and gap weights.  Hint, use
the same trick with "X" as above, and a series of commands like:

 $ bestfit/infile1=a1hu_'x'.pep/infile2=pir1:a1hu/default

The output will be in a series of files called *.PAIR.

2A.  At what level of substitution do things start going wrong, as 
     indicated by the introduction of gaps.  (Recall, the corrupted
     sequence has residue changes, but no indels added.)

2B.  Look at the Quality, Ratio, Gaps, and Similarity scores in the *.pair
     files (use SEARCH).  What can you say about each value as progressively
     corrupted sequences are aligned?

2C.  Use PEPCORRUPT to generate TWO corrupted sequences that each
     have 50% remaining identity with the same starting sequence
     (for A1HU that would be about 250 substitutions).  Then use BESTFIT
     to align those with each other and with the starting sequence.
     What happens?

Problem Group 3.  Multiple alignment

Use pileup to align all of the corrupted files plus the original.

3A.  Are the results from the multiple alignment any better than from
     the pairwise alignments?  Why, or why not?

Make another series of corrupted protein sequences, but this time
make the N'th sequence by running PEPCORRUPT on the N-1th sequence, 
iteratively, up to the same level as before.  Example:

     $ Y = 0
     $ X = 100
     $ PEPCORRUPT/inf=pir1:a1hu/outf=ser_'x'.pep/subs=100/indels=0/default
     $! Repeat the following three lines, in order, until you are done
     $ Y = X
     $ X = X + 100
     $ PEPCORRUPT/inf=ser_'y'.pep/outf=ser_'x'.pep/subs=100/indels=0/default
 
This series has the same number of mutations at each step, with respect to
the starting sequence, as did the first series, but in this case mutations
are retained until mutated again. 

Use Bestfit to align each of these new sequences against the original.  Then
use Pileup to align all of these new sequences with the original.

3B.  Is the quality of the alignment at each step against the starting 
     sequence better, worse, or the same, versus the first set of
     sequences (from Problem group 1 and 2)?

     Are the results from the multiple alignment any better than from the
     pairwise alignments?  Why, or why not? 

Problem Group 4.  Local Multiple alignments

The point of this exercise is to learn how to use the GIBBS program
for local alignment, and how to evaluate what comes out of it.  For this
example, we'll take a look at Fibronectin, a single protein which contains
multiple subunits.  First, we need to put it into FASTA format so that
the GIBBS program can work on it:

  $ tofasta/infile=sw:finc_human/outfile=fibronectin.fasta

Next we need to put together a command file.  We're going to pretend that
we have some sort of experimental data that indicates that there are
at least 15 sites that bind something or other on this protein.

Now we need to create a command file to tell the GIBBS program what to do:

  $ create fibronectin.cmd
  1           How many types of sites?
  10          How long is the first type of site?  (We're guessing)
  Y           Do all of the sequences have the same number of sites?
  15          How many sites in Sequence 1?
  {^Z}        Press the ^Z key to terminate input

Now run the analysis:

  $ gibbs fibronectin.fasta fibronectin.cmd

4A.  Look at the output file and compare the locations with the reference
     material in the SW:finc_human file (use FETCH to put a copy in
     your directory so that you can read it.)  What did it find?  Run
     the program again, with the exact same command line.  Try it a couple
     of times. Did it find the same thing, why or why not?

Modify the fibronectin.cmd to search for 4 motifs, all of size 10 and 
present 15 times.  Then run the Gibbs program with that command file.  Do
it a couple of times.

4B.  Are the results consistent between runs?

It is usually extremely difficult to prove that a local alignment is
significant without some form of experimental evidence to support it. 
However, you can try scrambling the input data by adding the "-r" option
and running the analysis again: 

  $ gibbs fibronectin.fasta fibronectin.cmd "-r"

4C.  Compare the information per parameter value for each motif from the
     test run versus the control run.  Are these values significantly
     higher in the experimental case than in the controls?

4D.  Are there only 4 motifs, or are there more?  Try increasing the number
     of motifs that you search for.  (But first reduce your run priority
     as this can get time consuming:  Set proc/prio=2).  How many motifs
     are present at least 15 times?  (Hint, when the scrambled controls
     score as highly as the worst motif, you are at the sensitivity limit.)