Fundamentals of Sequence Analysis, 1995-1996
Problem set 7:  Phylogenetic analysis.

If you get stuck, refer to the OpenVMS and GCG resources in the 
class home page.

References:
 
  See the GCG and PHYLIP documentation.

  Reviews:

      Felsenstein, J.  Phylogenies from Molecular Sequences: Inference
      and Reliability (1988)  Ann. Rev. of Genetics 22:521-565

Problem group 1.  Artificial data

Create two sets of data (STAR and FORK)  as for Problem set 2.  For the
STAR set start with PIR1:A1HU and produce sequences that differ from it by
50, 100, 150, 200, 250, 300, 350, and 400 substitutions (no indels). That
is A1HU->A50, A1HU->A100, A1HU->A150, etc.. For the FOR set,  again start
with PIR1:A1HU, but this time have each sequence in the order differ from
the preceding one by 50.  That is A1HU->A50, A50->A100, A100->A150 and so
forth. 

Use PEPCORRUPT to do this.  This can be done manually, but the following
procedures indicate the method.  The wait statements are required because 
on the GCG random number generator gets its seed from the time() function,
which returns seconds.  A fast machine can easily run several PEPCORRUPT
passes in a second, and without the waits they would have a false 
correlation due to reuse of the same random number generator seed.

$! make the STAR set, of up to 400 subs starting from A1HU in each case
$ X = 0
$ top:
$ wait 00:00:02.00
$ PEPCORRUPT/INFILE=PIR1:A1HU/outfile=STAR_'x'.pep /SUBS='x'/indels=0/default
$ X = X + 50
$ if (X .le. 400) then goto top
$ exit


$! make the FORK set, of up to 400 subs, sequentially from A1HU 
$ X = 0
$ Y = 50
$ reformat/infile=pir1:a1hu/out=FORK_0.pep
$ top:
$ wait 00:00:02.00
$ PEPCORRUPT/INFILE=FORK_'x'.pep/outfile=FORK_'y'.pep /SUBS=50/indels=0/default
$ X = Y
$ Y = Y + 50
$ if (Y .le. 400) then goto top
$ exit



Align each set separately using PILEUP.  Set the gap penalty so that
no gaps are introduced.  Save the tree that results from each run.


$ pileup/infile=star*.pep/gap=1000/outf=star.msf/figure=star.figure/default
$ pileup/infile=fork*.pep/gap=1000/outf=fork.msf/figure=fork.figure/default


1A.  Use the PARSIMONY method to derive a phylogenetic tree.  How do the
     trees produced by pileup and the phylogenetic tree compare?


$ usephylip
3
2
star.msf
0
6
3
1
PROTPARS
3
STAR
0
2
Y

Same for fork.

Then use DRAWTREE to look at the tree on whatever graphics device you 
have.

Here is the STAR tree:

(((Star_200,Star_100),(((Star_350,Star_250),((Star_400,Star_300),Star_150)),
Star_50)),Star_0);

Here is the FORK tree:
(((Fork_250,(Fork_200,(Fork_150,(Fork_100,(Fork_50,Fork_0))))),(Fork_400,
Fork_350)),Fork_300);

Of the four trees, only the PROTPARS tree for the FORK set correctly
represents the (in this case, known) phylogenetic relationships.

Note that these data sets imply either varying rates of substitution or
sequences collected at different points in time.  


1B.  Use the FITCH, NEIGHBOR joining, and UPGMA methods to derive
     phylogenetic trees.


As above, but first get distances using PROTDIST.

Then use programs FITCH and NEIGHBOR, and change one option in
NEIGHBOR to use UPGMA.

Here are the trees:
Fork:
 Fitch:
((Fork_250:0.00000,(Fork_200:0.00000,(Fork_150:0.00000,(Fork_100:0.00000,
(Fork_50:0.00000,Fork_0:0.14758):0.15922):0.16049):0.17109):0.15522):0.15115,
(Fork_350:0.00000,Fork_400:0.16044):0.14738,Fork_300:0.00000);

  That is, Fitch correctly figures out that there is about 15% substitution
  between each step, and that the side branches are of length zero.  (This is
  slightly artificial, it probably truncated negative branches to 0.0)

Neighbor Joining:
((Fork_350:-0.01663,Fork_400:0.16708):0.16071,(((((Fork_0:0.16497,
Fork_50:-0.02487):0.17336,Fork_100:-0.01436):0.17346,Fork_150:-0.01667):0.18860,
Fork_200:-0.01771):0.16553,Fork_250:-0.01392):0.16368,Fork_300:-0.01417);

  Also quite good.  The average substitution, ignoring the negative branches,
  is slightly higher though.

UPGMA:
((((Fork_300:0.07005,Fork_350:0.07005):0.04342,Fork_400:0.11347):0.07704,
(Fork_200:0.07349,Fork_250:0.07349):0.11702):0.18394,((Fork_0:0.07005,
Fork_50:0.07005):0.08526,(Fork_100:0.07523,Fork_150:0.07523):0.08009):0.21914);

  Very bad - UPGMA requires ultrametric data, and this example is
  far, far, far, from being ultrametric!  In particular, note that the
  enormous distance from Fork_150 to Fork_200.

Star data set:

Fitch:
((((Star_300:1.07725,(Star_50:0.03750,Star_350:1.35190):0.06937):0.03638,
(Star_400:1.51219,Star_100:0.19956):0.11322):0.00000,(Star_250:0.79743,
Star_150:0.45817):0.03335):0.00000,Star_200:0.74766,Star_0:0.00000);

  Again, quite good. the result shows the expected star topology (within
  the constraint that there can only be bifurcations - here all sequences
  derive directly from star_0.)


Neighbor-Joining:
((Star_50:-0.08271,Star_350:1.47211):0.11812,(Star_200:0.71642,
Star_250:0.88127):0.05312,((Star_0:-0.23490,(Star_100:0.00949,
Star_400:1.70226):0.20177):0.08965,
(Star_150:0.36481,Star_300:1.13103):0.06950):0.04987);

  Not too bad, it gets the general trend about right, although there is
  at least one fairly large negative internal branch.

UPGMA:
((((((((Star_0:0.07349,Star_50:0.07349):0.11726,Star_100:0.19075):0.13626,
Star_150:0.32700):0.16676,Star_200:0.49376):0.07529,Star_250:0.56905):0.21137,
Star_300:0.78042):0.22673,Star_350:1.00715):0.27447,Star_400:1.28163);

  This result is interesting.  The program has to assume that all sequences
  are contemporary and that they have the same mutation rate.  However the
  data as generated is not consistent with both of those assumptions.  
  Since the algorithm cannot alter those two assumptions it compensates by
  adding a time equivalent to the 50 substitution difference between each
  branch point.

Problem group 2.  Real data

2A. The phylogenetic relationships between humans and the great apes has 
    been a topic of great debate.  Solve it to your own satisfication using
    the protein sequences in Swiss-Protein.


The first step is to figure out which proteins have been sequenced in all
four species  (PANTR= chimpanzee , GORGO=gorilla, HUMAN=human, PONPY=orang)
The list is actually quite short.  In many of these the sequence is the
same for 2 or more species, as indicated by names in parenthesis that 
follow.  There are several other proteins which are known for 2 or 3 of
these species, but they are not considered here.

 B2MG_HUMAN
      HUMAN  (GORGO)
      HUMAN  (PANTR)
      PONPY
  HBA_HUMAN
      GORGO
      HUMAN  (PANTR)
      PONPY
 HSP1_GORGO
      HUMAN
      PANTR
      PONPY
 HSP2_GORGO
      HUMAN
      PANTR
      PONPY
 NU4M_GORGO
      HUMAN
      PANTR
      PONPY
 NU5M_GORGO
      HUMAN
      PANTR
      PONPY
 PRIO_GORGO
      HUMAN
      PANTR
      PONPY

A procedure that aligns all of these, then reformats the resulting
MSF files so that they are merged into a single large file containing all 
sequences, can be found at the end of this document (MULTI.COM)

The result of this is 1727 residues of sequence.  However, it turns
out that for both NU4M and NU5M the human sequence contains much more of
the gene than do the species for the other sequences.  These artificial
gaps need to be removed, it just so happens that the two gaps are put
together in the result in positions 333-1163, so:

$ reformat/infile=all.msf{*}/begin=333/end=1163/delete/msf

Here are the trees that result from:

PROTPARS:

((Pantr,(Ponpy,Human)),Gorgo);

(Make distances using Kimura assumption for the following methods:)

    4
Gorgo          0.00000  0.04134  0.03546  0.09296
Human          0.04134  0.00000  0.03429  0.08412
Pantr          0.03546  0.03429  0.00000  0.08916
Ponpy          0.09296  0.08412  0.08916  0.00000


FITCH:
((Ponpy:0.06873,Human:0.01539):0.00465,Pantr:0.01445,Gorgo:0.02101);

KITSCH:
(((Pantr:0.01715,Human:0.01715):0.00183,Gorgo:0.01898):0.02525,Ponpy:0.04422);

Neighbor-Joining:
(Human:0.01544,Ponpy:0.06868,(Gorgo:0.02044,Pantr:0.01502):0.00465);

UPGMA:
((Gorgo:0.01920,(Human:0.01715,Pantr:0.01715):0.00206):0.02517,Ponpy:0.04437);

These are in pretty good agreement - Orangutan is clearly the most
dissimilar of the four, but Human, Chimp and Gorilla are roughly
equidistant from the branch point, so the exact order of branching is
not well defined, and comes out different ways for different algorithms.
This can lead to seemingly odd results, such as Protein Parsimony
putting Humans closest to Orangutans (best understood this way - the 
algorithm has to put *something* with Orangutans, and because of the 
uncertainty about the exact order of branching, it can place humans there.)
In terms of distance, Orangutan is closer to Human than to Chimps or
Gorilla, but you need to keep in mind that the limiting factor in this
analysis is the lack of data - there are less than 100 residues for which
differences exist between at least one species and the others.  Or put
another way, the determining factor in what is closest to Orangutan comes
down to the identities of just a couple of residues.

So, if you really are interested in this problem - sequence more Orangutan
and Gorilla genes!


 
$! Multi.com
$! 21-FEB-1996
$! David Mathog, Biology Division, Caltech
$!
$! align and reformat into a single MSF file a bunch of
$! human, gorilla, chimp, and orang sequences.
$! In each instance, the sequence is kept in that order
$! even if a single common entry holds the common sequence
$!
$!
$ call doit "b2mg" "sw:b2mg_human,sw:b2mg_human,sw:b2mg_human,sw:b2mg_ponpy"
$ call doit "hba"  "sw:hba_human,sw:hba_gorgo,sw:hba_human,sw:hba_ponpy"
$ call doit "hsp1" "sw:hsp1_human,sw:hsp1_gorgo,sw:hsp1_pantr,sw:hsp1_ponpy"
$ call doit "hsp2" "sw:hsp2_human,sw:hsp2_gorgo,sw:hsp2_pantr,sw:hsp2_ponpy"
$ call doit "nu4m" "sw:nu4m_human,sw:nu4m_gorgo,sw:nu4m_pantr,sw:nu4m_ponpy"
$ call doit "nu5m" "sw:nu5m_human,sw:nu5m_gorgo,sw:nu5m_pantr,sw:nu5m_ponpy"
$ call doit "prio" "sw:prio_human,sw:prio_gorgo,sw:prio_pantr,sw:prio_ponpy"
$!
$! merge all of the .MSF files
$!
$ call no_dot_txt
$!
$ write sys$output "Splitting MSF files out into text files"
$ totext/infile=b2mg.msf{*}/default
$ totext/infile=hba.msf{*}/default
$ totext/infile=hsp1.msf{*}/default
$ totext/infile=hsp2.msf{*}/default
$ totext/infile=nu4m.msf{*}/default
$ totext/infile=nu5m.msf{*}/default
$ totext/infile=prio.msf{*}/default
$!
$ write sys$output "Merging all text files to one/species"
$ append/new human.txt;* human.all
$ append/new gorgo.txt;* gorgo.all
$ append/new pantr.txt;* pantr.all
$ append/new ponpy.txt;* ponpy.all
$ write sys$output "Merging all species into one .MSF file"
$!
$! Note to other sites, the /NODots qualifier is a local modification
$! to REFORMAT - write if you want it.
$!
$ reformat/msf/protein/infile=*.all/outfile=all.msf/nodots
$!
$ call no_dot_txt
$!
$ write sys$output "Final MSF file is called ALL.MSF"
$ exit
$!
$!
$!
$ doit:  SUBROUTINE
$ write sys$output "Processing ''P1' = ''P2'"
$!
$! first extract the pieces into standard named files
$!
$ entry = f$element(0,",",P2)
$ reformat/infile='entry'/outfile=human.killme/default
$ entry = f$element(1,",",P2)
$ reformat/infile='entry'/outfile=gorgo.killme/default
$ entry = f$element(2,",",P2)
$ reformat/infile='entry'/outfile=pantr.killme/default
$ entry = f$element(3,",",P2)
$ reformat/infile='entry'/outfile=ponpy.killme/default
$!
$! now run pileup
$!
$ pileup/infile=*.killme/outfile='P1'.msf/noplot/default
$!
$! then clean up the temporary files and return
$!
$ delete *.killme.*
$ exit
$ ENDSUBROUTINE 
$!
$ no_dot_txt: SUBROUTINE
$ if(f$search("human.txt") .nes. "")then delete/nolog human.txt.*
$ if(f$search("gorgo.txt") .nes. "")then delete/nolog gorgo.txt.*
$ if(f$search("pantr.txt") .nes. "")then delete/nolog pantr.txt.*
$ if(f$search("ponpy.txt") .nes. "")then delete/nolog ponpy.txt.*
$ exit
$ ENDSUBROUTINE