Fundamentals of Sequence Analysis, 1998-1999

Lecture 7. Phylogenetic analysis.


Overview of phylogenetic analysis

Before we start discussing phylogeny programs, we should think for a minute about what we are asking these tools to do. In general, the problem we are addressing is this: we have a collection of "things" (an intentionally vague term) which we believe to have been all derived from ancestral "things", which were present in some common progenitor. We want to derive from these "things" a tree of some sort which shows how all of these "things" derived from that ancestor, or in some cases, just shows how they relate to each other.

These "things" are now most typically sequences, but might also be traits of various types, such as the existence of particular restriction enzyme locations, or anatomical traits, such as having a particular tooth or skull shape.

There is a bit of nomenclature you should be aware of when dealing with trees. Trees are composed of branches which connect at nodes. Most of the trees that these programs generate are strictly bifurcating, meaning that each internal node connects to exactly three branches, one of which can generally be thought of as "in", and the other two as "out". Tip nodes, or end nodes consist of the end of a branch and are usually marked with the name of the sequence or species that corresponds to it. Trees may be either unrooted or rooted. Unrooted trees indicate the relationship between the end nodes, rooted trees indicate, in addition to that, the relationship between a root end node and all of the other end nodes. The lengths of the branches of the tree may or may not be meaningful, depending on how the tree was generated. Strictly bifurcating trees cannot correctly represent a true multispecies (>2) radiation from an internal node; they have to represent these with extra internal nodes connected by extra branches. These extra branches typically have small positive or negative lengths.

There are an assortment of problems when it comes time to do a phylogenetic analysis. The most important of these are:

  1. The number of trees, just speaking of the manner of arranging the branches without regard to their length, grows astronomically with the number of end nodes. It is often not possible to construct all trees to test against a given hypothesis and so various approximations are often required - guaranteeing at best only a "good" tree, not the "best" tree.
  2. Due to statistical limitations it is often impossible to determine which of several alternate trees is "best".
  3. The amount of informative data present in a collection of "things" is often much less than one would first suppose. For instance, when comparing protein sequences, if an amino acid is 100% conserved at some site it provides no information that could be used to infer the phylogenetic relationship between the members of that data set.
  4. The definition of "best tree" is ambiguous. It might mean the most likely tree, or the tree with the fewest changes (most parsimonious), or the tree that best fits a molecular clock model, and so forth. One or more of the assumptions in each of these models may be incorrect. The trees that result from each of these methods may differ from each other. Nevertheless, in order to compare trees one must assume some model so that one tree may be tested against the other.
  5. Population effects are often difficult to take into account, yet they may account for all of the putative phylogenetic information in the data. For instance, consider a population with 10 alleles for some protein which over time spins off three isolated populations which subsequently diverge into different species, each having resulted from a founder population with a different mix of the same 10 alleles. If one attempted to deduce the order of these branches from one sequence per population the resulting tree would most likely be incorrect. For instance, the single sequence from species 1 and 2 might both share allele A, whereas that from species 3 might be for allele B. If the two alleles diverged from each other at a time 10 times that of the species divergence, and the rate of change was constant, one might conclude that species 3 diverged from 1 and 2 at that ancient time, and 1 and 2 diverged more recently. Recombination between alleles in this example would further tend to obfuscate the evolutionary relationships.

Input data for phylogenetic programs is of several types.

  1. Aligned sequences. Gaps are problematical. Most programs treat them as an extra character in the DNA or Protein alphabet. It is sometimes correct in the PHYLIP package to replace all characters but one in the gap with a question mark (?), which effectively removes them from the calculation. For instance, if this isn't done, distance methods score sequences with larger gaps as being more distant than those with an equivalent number of smaller gaps.
  2. Discrete state traits. Written as "0" (lacking) or "1" (having) for each species. Most programs also accept "B" or "P" to indicate that the organism was polymorphic for that trait. "?" indicates no information.
  3. Distances. Usually derived from aligned sequences. These give a distance between each pair of organisms.
  4. Trees. Some methods will test a user supplied tree, usually in addition to generating some of their own.

Overview of phylogeny programs

There are many, many programs available for performing various sorts of phylogenetic analyses. Mostly these are for PCs and Macintoshes, and most address a single type of phylogenetic analysis. There is an extensive list of these tools at:

Notable among these are PHYLIP (free) and PAUP (commercial) which have both Macintosh and PC versions.

Today we will cover those tools which are available on Seqaxp. These consist of a few programs in the GCG package, Phylo_Win, and the comprehensive PHYLIP package, from Joe Felsenstein at U. Washington. We also have the MOLPHY programs available, but I won't talk about those today. Here is a description of the programs we have, and what each does:


Phylip programs:

Programs that work on molecular sequence data
  DNAPARS   Parsimony method for DNA
  DNAMOVE   Interactive DNA parsimony
  DNAPENNY  Branch and bound for DNA
  DNACOMP   Compatibility for DNA
  DNAINVAR  Phylogenetic invariants
  DNAML     Maximum likelihood method
  DNAMLK    DNA ML with molecular clock
  DNADIST   Distances from sequences
  PROTDIST  Distances from proteins
  PROTPARS  Protein parsimony
  PROTML    Maximum likelihood method for protein (unsupported)
  RESTML    ML for restriction sites
  SEQBOOT   Bootstraps sequence data sets

Programs that work on distance matrix data
  FITCH     Fitch-Margoliash and least-squares methods
  KITSCH    Fitch-Margoliash and least squares methods with evolutionary clock
  NEIGHBOR  Neighbor-joining and UPGMA methods

Programs that work on gene frequencies and continuous characters 
  CONTML    Maximum likelihood method 
  GENDIST   Computes genetic distances
  CONTRAST  Computes contrasts and correlations for comparative method studies

Programs that work on 0-1 discrete state data
  MIX       Wagner, Camin-Sokal, and mixed parsimony criteria
  MOVE      Interactive Wagner, C-S, mixed parsimony program
  PENNY     Finds all most parsimonious trees by branch-and-bound
  DOLLOP, DOLMOVE, DOLPENNY   same as preceding four programs, but for
     the Dollo and polymorphism parsimony criteria
  CLIQUE    Compatibility method
  FACTOR    recode multistate characters

Programs that plot or modify trees 
  DRAWGRAM  Draws cladograms and phenograms on screens, plotters and printers
  DRAWTREE  Draws unrooted phylogenies on screens, plotters and printers
  CONSENSE  Majority-rule and strict consensus trees
  RETREE    Reroots, changes names and branch lengths, and flips trees

GCG programs:

  DISTANCES Generate a distance matrix from a sequence alignment
  DIVERGE   Measures percent divergence between two aligned sequences
              by the method of Perler
  NEWDIVERGE  Like DIVERGE, but uses the method of Li. (Called
                 DIVERGE in release 9.0).
  GROWTREE  Makes a phylogenetic tree using the UPGMA or Neigbour-joining
              method.  Input is a distance matrix from DISTANCES.

EGCG programs:
  TOPHYLIP  Convert GCG .MSF files to PHYLIP format.

Miscellaneous:
  PHYLO_WIN Neighbor-Joining, UPGMA, and others.  X11 GUI.
  USEPHYLIP Interactive shell for use with all of the PHYLIP programs.

Running the Phylip programs

The standard method

At this point we'll start running through how to use these, beginning with the PHYLIP programs. The programs in this package are designed to be controlled through strings stored in logical names (on OpenVMS) or shell environmental variables (on Unix). Here is the standard method you would use to run an arbitrary PHYLIP program:

$ define INFILE   your_in_file
$ define OUTFILE  your_out_file
$ define TREEFILE your_in_or_out_tree_file
$ define FONTFILE PHYEXEDIR:FONT1.;
$ run PHYEXEDIR:PROGNAME

The documentation for each PHYLIP program states explicitly which assumptions the algorithm requires. You can use HELP to read the documentation from version 3.4. If you want to read the current documentation, which isn't much different, it is in one of these files:

$ DIR PHYDOCDIR:*.doc
CLIQUE.DOC     CONSENSE.DOC   CONTCHAR.DOC   CONTML.DOC  
CONTRAST.DOC   DISCRETE.DOC   DISTANCE.DOC   DNACOMP.DOC  
DNADIST.DOC    DNAINVAR.DOC   DNAML.DOC      DNAMLK.DOC  
DNAMOVE.DOC    DNAPARS.DOC    DNAPENNY.DOC   DOLLOP.DOC  
DOLMOVE.DOC    DOLPENNY.DOC   DRAW.DOC       DRAWGRAM.DOC  
DRAWTREE.DOC   FACTOR.DOC     FITCH.DOC      GENDIST.DOC  
KITSCH.DOC     MAIN.DOC       MAKEINF.DOC    MIX.DOC  
MOVE.DOC       NEIGHBOR.DOC   PENNY.DOC      PROTDIST.DOC  
PROTML.DOC     PROTPARS.DOC   RESTML.DOC     RETREE.DOC  
SEQBOOT.DOC    SEQUENCE.DOC  

Using EGCG versions

We have EGCG versions of many of the Phylip programs. These have a standard GCG command line interface. These versions are unsupported - if they work for you, great, if not, use the regular Phylip variant. To see if there is a version of the program which interests you, use commands like these:

$ SHOW SYMBOL *DNADIST*
  EDNADIST == "$ EGENUTIL:EDNADIST"
$ SHOW SYMBOL *DRAWTREE*
%DCL-W-UNDSYM, undefined symbol - check validity and spelling

Using option menus

Since the standard method of running these programs is a bit alien for users who are accustomed to working through GUIs, we'll use the USEPHYLIP menu driven interface intstead. If you start USEPHYLIP, then immediately exit from it, the logical names PHYDOCDIR and PHYEXEDIR will be defined for you. At that point you may use the standard method of running these programs if you so desire - and you may want to if you have to run one or more of them many times in a batch job. USEPHYLIP is menu driven (as are the PHYLIP programs themselves). It has two sections:

  1. Status: shows the names of the selected program and input,output tree, and font files.
  2. Options: the menu options which are available

USEPHYLIP has only two menus, the main one has these options:


0          Exit
1          Help  (currently only ver3.4 help is available)
2          Run program
3          Change Options
4          List current directory
5          Spawn a subprocess  (use this to rename, delete, etc.)
6          Convert Infile, in GCG .MSF format, to Phylip format
i          Type the input file
o          Type the output file
t          Type the tree file

and the Change Options menu has these values:


0        Return to main menu
1        change program
2        change infile
3        change root  (->root.plot,root.out,root.tree)
4        change fontfile
5        change text output method

Input data formats

PHYLIP programs require aligned sequences to do their work. Consequently, the first thing you need to know to use PHYLIP is how to convert MSF files to PHYLIP format. You can do that from the command line with the EGCG program TOPHYLIP. Or, you can run that program indirectly from within USEPHYLIP as in this example:

$ usephylip
Status:
  Program    none
  Infile     none
  Outfile    none.out
  Plotfile   none.plot
  Treefile   none.tree
  Fontfile   PHYEXEDIR:FONT1..
  root       none
  Text Out   TYPE 
3
2
azurin.msf
Status:
  Program    none
  Infile     AZURIN.MSF
  Outfile    none.out
  Plotfile   none.plot
  Treefile   none.tree
  Fontfile   PHYEXEDIR:FONT1..
  root       none
  Text Out   TYPE 
0
6
Status:
  Program    none
  Infile     AZURIN.PHYLIP
  Outfile    none.out
  Plotfile   none.plot
  Treefile   none.tree
  Fontfile   PHYEXEDIR:FONT1..
  root       none
  Text Out   TYPE

The Infile now shows AZURIN.PHYLIP where AZURIN.MSF used to be. If you select option i you will see that the format has indeed changed to the PHYLIP format.


Parsimony

The PARSIMONY programs make the assumption that the desired tree is the one with the fewest changes in it to get from one sequence to another. The DNAPARS program also makes some assumptions about the rate of evolution - that is, that it is pretty much the same at all sites and all lineages.

We'll use the example data provided with the PHYLIP program, and run it from within the USEPHYLIP shell.

$ Usephylip
3
1
DNAPARS
2
CLASS:EXAMPLE_1.INPUT
3
DNAPARS
0
Status:
  Program    DNAPARS
  Infile     EXAMPLE_1.INPUT
  Outfile    DNAPARS.out
  Plotfile   DNAPARS.plot
  Treefile   DNAPARS.tree
  Fontfile   PHYEXEDIR:FONT1..
  root       DNAPARS
  Text Out   TYPE

i
   5   13
Alpha     AACGUGGCCAAAU
Beta      AAGGUCGCCAAAC
Gamma     CAUUUCGUCACAA
Delta     GGUAUUUCGGCCU
Epsilon   GGGAUCUCGGCCC
2
5           DNAPARS prompt: print sequence
Y           DNAPARS prompt: all settings are correct
o
DNA parsimony algorithm, version 3.572c

One most parsimonious tree found:

           +--Epsilon
        +--4
     +--3  +--Delta
     !  !
  +--2  +-----Gamma
  !  !
--1  +--------Beta
  !
  +-----------Alpha

  remember: this is an unrooted tree!


requires a total of     19.000

From    To     Any Steps?    State at upper node
                             ( . means same as in the node below it on tree)

          1                AABGTSGCCA AAY
   1      2        maybe   .....C.... ...
   2      3         yes    V.KD...... C..
   3      4         yes    GG.A..T.GG .C.
   4   Epsilon     maybe   ..G....... ..C
   4   Delta        yes    ..T..T.... ..T
   3   Gamma        yes    C.TT...T.. ..A
   2   Beta        maybe   ..G....... ..C
   1   Alpha       maybe   ..C..G.... ..T


t

((((Epsilon,Delta),Gamma),Beta),Alpha);

There is a variant of this program called DNAMOVE which allows you to manually adjust pieces of the tree, and then rerun the analysis on just some parts of it.

$ UsePhylip
3
1
dnamove
2
CLASS:EXAMPLE_1.INPUT
3
dnamove
0
2
Y          DNAMOVE prompt: settings are correct


  +---------------5:Epsilon                       (unrooted)
--9
  | +------------ 4:Delta
  +-8
    |  +--------- 3:Gamma        Steps:  19.00000
    +- 7
       |  +------ 2:Beta
       +- 6
          +------ 1:Alpha           11 sites compatible

NEXT? (Options: R # + - S . T U W O F C H ? X Q) (H or ? for Help)

Use the T command to see whether a better score could result if a particular node was moved.

T
Try other positions for which node? 7
WAIT ...
       BETTER:  NONE
       TIED:     5 9: 19.00

Shows that the node 7 could swap with nodes 5 or 9 without changing the score. To actually make that rearrangement, use


R
Remove everything to the right of which node? 7
Add before which node? 5

        +-------- 3:Gamma                         (unrooted)
     +- 7
     |  | +------ 2:Beta
  +- 8  +-6
  |  |    +------ 1:Alpha        Steps:  19.00000
--9  |
  |  +----------- 5:Epsilon        (as good as best)
  | 
  +-------------- 4:Delta           11 sites compatible

For simple cosmetic changes (flipping branches) use F


F
9
F
8

  +---------------4:Delta                         (unrooted)
--9
  | +------------ 5:Epsilon
  +-8
    |  +--------- 3:Gamma        Steps:  19.00000
    +- 7
       |  +------ 2:Beta
       +- 6
          +------ 1:Alpha           11 sites compatible

This makes it a bit clearer that the position of the Epsilon and Delta nodes in this tree may be reversed and the resulting tree will have the same score. To see what is the expected state of each position use S (+,- are like S, but move relative to the current site)

S
2
  gggggggggggg4:Delta       SITE   2         (unrooted)
gg8
  g  ggggggggg5:Epsilon      a:A, c:C, g:G, t:T, ?:?
  ggg9
     a  aaaaaa3:Gamma        Steps:  19.00000
     aaa7
        a  aaa2:Beta
        aaa6
           aaa1:Alpha           11 sites compatible
X
N                    Write the tree to a file?

DNAPENNY will find all most parsimonious trees, as opposed to DNAPARS, which will only find one. It uses a search method called "branch and bound" which is said to be very efficient. The method is described in the program's documentation so I won't go over it here. The main problem is that the space it is searching is quite large, and expands factorially with the number of species, so that it is only practical for small data sets. Above 12 species it is very slow.

$ UsePhylip
3
1
DNAPENNY
2
CLASS:EXAMPLE_2.INPUT
3
DNAPENNY
0
i
    8    6
Alpha1    AAGAAG
Alpha2    AAGAAG
Beta1     AAGGGG
Beta2     AAGGGG
Gamma1    AGGAAG
Gamma2    AGGAAG
Delta     GGAGGA
Epsilon   GGAAAG
2
4    print out steps in each site
5    print sequences at all nodes
Y    all settings are correct
o    to save space, only the first 3 trees are shown

Penny algorithm for DNA, version 3.572c
 branch-and-bound to find all most parsimonious trees



requires a total of              8.000

     9 trees in all found




  +--------------------Alpha1    
  !  
  !                 +--Delta     
  !              +--3  
  !           +--7  +--Epsilon   
--1           !  !  
  !     +-----6  +-----Gamma2    
  !     !     !  
  !  +--4     +--------Gamma1    
  !  !  !  
  !  !  !           +--Beta2     
  +--2  +-----------5  
     !              +--Beta1     
     !  
     +-----------------Alpha2    

  remember: this is an unrooted tree!


 steps in each site:
         0   1   2   3   4   5   6   7   8   9
     *-----------------------------------------
    0!       1   1   1   2   2   1            

From    To     Any Steps?    State at upper node
                             ( . means same as in the node below it on tree)


          1                AAGAAG
   1   Alpha1       no     ......
   1      2         no     ......
   2      4         no     ......
   4      6         yes    .G....
   6      7         no     ......
   7      3         yes    G.A...
   3   Delta        yes    ...GGA
   3   Epsilon      no     ......
   7   Gamma2       no     ......
   6   Gamma1       no     ......
   4      5         yes    ...GG.
   5   Beta2        no     ......
   5   Beta1        no     ......
   2   Alpha2       no     ......





  +--------------------Alpha1    
  !  
  !                 +--Delta     
  !           +-----3  
  !           !     +--Epsilon   
--1     +-----6  
  !     !     !     +--Gamma2    
  !     !     +-----7  
  !  +--4           +--Gamma1    
  !  !  !  
  !  !  !           +--Beta2     
  +--2  +-----------5  
     !              +--Beta1     
     !  
     +-----------------Alpha2    

  remember: this is an unrooted tree!


 steps in each site:
         0   1   2   3   4   5   6   7   8   9
     *-----------------------------------------
    0!       1   1   1   2   2   1            

From    To     Any Steps?    State at upper node
                             ( . means same as in the node below it on tree)


          1                AAGAAG
   1   Alpha1       no     ......
   1      2         no     ......
   2      4         no     ......
   4      6         yes    .G....
   6      3         yes    G.A...
   3   Delta        yes    ...GGA
   3   Epsilon      no     ......
   6      7         no     ......
   7   Gamma2       no     ......
   7   Gamma1       no     ......
   4      5         yes    ...GG.
   5   Beta2        no     ......
   5   Beta1        no     ......
   2   Alpha2       no     ......





  +--------------------Alpha1    
  !  
  !                 +--Delta     
  !              +--3  
  !           +--6  +--Epsilon   
--1           !  !  
  !     +-----7  +-----Gamma1    
  !     !     !  
  !  +--4     +--------Gamma2    
  !  !  !  
  !  !  !           +--Beta2     
  +--2  +-----------5  
     !              +--Beta1     
     !  
     +-----------------Alpha2    

  remember: this is an unrooted tree!


 steps in each site:
         0   1   2   3   4   5   6   7   8   9
     *-----------------------------------------
    0!       1   1   1   2   2   1            

From    To     Any Steps?    State at upper node
                             ( . means same as in the node below it on tree)


          1                AAGAAG
   1   Alpha1       no     ......
   1      2         no     ......
   2      4         no     ......
   4      7         yes    .G....
   7      6         no     ......
   6      3         yes    G.A...
   3   Delta        yes    ...GGA
   3   Epsilon      no     ......
   6   Gamma1       no     ......
   7   Gamma2       no     ......
   4      5         yes    ...GG.
   5   Beta2        no     ......
   5   Beta1        no     ......
   2   Alpha2       no     ......

If you are doing anything large, you will want to change the F option to something much larger, so that the program won't spend all of its time reporting its progress. The test case completes instantaneously on an Alphaserver 2100 4/200.


Compatibility

The compatibility method is this: if a position has 3 of the 4 possible bases, then evaluate different trees subject to the criterion that the minimum number of changes required in the tree is two (one less than three).

$ UsePhylip
3
1
DNACOMP
2
CLASS:EXAMPLE_1.INPUT
3
DNACOMP
0
2
4       Print data for each site
5       Print sequence at each node
Y       Settings are correct
o
DNA compatibility algorithm, version 3.572c

One most parsimonious tree found:

           +--Epsilon
        +--4
     +--3  +--Delta
     !  !
  +--2  +-----Gamma
  !  !
--1  +--------Beta
  !
  +-----------Alpha

  remember: this is an unrooted tree!

total number of compatible sites is       11.0

 steps in each site:
         0   1   2   3   4   5   6   7   8   9
     *-----------------------------------------
    0!       2   1   3   2   0   2   1   1   1
   10!   1   1   1   3

 compatibility (Y or N) of each site with this tree:

      0123456789
     *----------
   0 ! YYNYYYYYY
  10 !YYYN

From    To     Any Steps?    State at upper node
                             ( . means same as in the node below it on tree)

          1                AABGTSGCCA AAY
   1      2        maybe   .....C.... ...
   2      3         yes    V.KD...... C..
   3      4         yes    GG.A..T.GG .C.
   4   Epsilon     maybe   ..G....... ..C
   4   Delta        yes    ..T..T.... ..T
   3   Gamma        yes    C.TT...T.. ..A
   2   Beta        maybe   ..G....... ..C
   1   Alpha       maybe   ..C..G.... ..T

In this case, 11 of the 13 bases were "compatible" with the best tree, that is, the tree could be constructed for the minimum number of substitutions. Sites that weren't compatible required more than this number of substitutions on the tree.


Maximum likelihood

DNAML implements a maximum likelihood method. It tries to find the most likely tree given a model of how nucleotides mutate. It makes certain assumptions about the frequencies of transitions (C<->T,A<->G) and transversions (C or T<->A or G). It also allows you to specify that there are up to 9 classes of sites which mutate at different rates (but you don't tell it which site is which).

$ UsePhylip
3
1
DNAML
2
CLASS:EXAMPLE_3.INPUT
3
DNAML
0
i
   5   13
Alpha     AACGTGGCCAAAT
Beta      AAGGTCGCCAAAC
Gamma     CATTTCGTCACAA
Delta     GGTATTTCGGCCT
Epsilon   GGGATCTCGGCCC
2
Y       Settings are correct
o
Nucleic acid sequence Maximum Likelihood method, version 3.572c

Empirical Base Frequencies:

   A       0.24615
   C       0.29231
   G       0.24615
  T(U)     0.21538

Transition/transversion ratio =   2.000000

(Transition/transversion parameter =   1.523077)
     +Beta
  +--1
  !  !                                            +Epsilon
  !  +--------------------------------------------3
  !                                               +--------Delta
  !
--2------------------------------Gamma
  !
  +-----Alpha


remember: this is an unrooted tree!

Ln Likelihood =   -72.25079

Examined   17 trees

 Between        And            Length      Approx. Confidence Limits
 -------        ---            ------      ------- ---------- ------
   2             1              0.09407     (     zero,     0.40908)
   1          Beta              0.00003     (     zero,     0.32890)
   1             3              1.50823     (     zero,     3.29744) **
   3          Epsilon           0.00006     (     zero,     0.34286)
   3          Delta             0.28112     (     zero,     0.62627) **
   2          Gamma             1.01708     (     zero,     2.33211) **
   2          Alpha             0.20748     (     zero,     0.56581)

     *  = significantly positive, P < 0.05
     ** = significantly positive, P < 0.01

Maximum likelihood with a molecular clock

DNAMLK implements a maximum likelihood method assuming a molecular clock. That is, the distance from the root of the tree to each of the end nodes is the same. Like DNAML, it makes certain assumptions about the frequencies of transitions and transversions.

$ UsePhylip
3
1
DNAMLK
2
CLASS:EXAMPLE_3.INPUT
3
DNAMLK
0
2
C
2
1.0 3.2
0.4 0.6
R
1.5
Y       Settings are correct
o

   5 Species,   13 Sites

Nucleic acid sequence
   Maximum Likelihood method with molecular clock, version 3.572c

Site category   Rate of change

          1        1.000
          2        3.200

Empirical Base Frequencies:

   A       0.24615
   C       0.29231
   G       0.24615
  T(U)     0.21538

Transition/transversion ratio =   2.000000

(Transition/transversion parameter =   1.523077)


                                                           +--Delta     
  +--------------------------------------------------------4  
  !                                                        +--Epsilon   
--3  
  !                                             +-------------Gamma     
  +---------------------------------------------2  
                                                !          +--Alpha     
                                                +----------1  
                                                           +--Beta      

Ln Likelihood =   -72.40499

 Ancestor      Node      Node Time     Length
 --------      ----      ---- ----     ------
 root            3      
   3             4         3.40036     3.40036
   4          Delta        3.55871     0.15834
   4          Epsilon      3.55871     0.15834
   3             2         2.75249     2.75249
   2          Gamma        3.55871     0.80621
   2             1         3.38652     0.63403
   1          Alpha        3.55871     0.17218
   1          Beta         3.55871     0.17218

Combination of categories that contributes the most to the likelihood:
             2222221111 112

Protein parsimony

Protein Parsimony is pretty much the same thing as DNA parsimony, except that it works on protein sequence. The program is called PROTPARS.


Protein maximum likelihood

Protein Maximum likelihood is unsupported, however it does run. Instead of modelling nucleotide mutation it uses the amino acid replacement frequencies in the Dayhoff matrix.

$ UsePhylip
3
1
PROTML
2
CLASS:EXAMPLE_4.INPUT
3
PROTML
0
0 
$ type CLASS:EXAMPLE_4.INPUT
     5   40   S
 species1 ARNDCQEGHILKAFPMTWYVARNDCQEGHISKMFGWTWYV
 species2 ARNHNQCGHILKMFPMTSYVARNCCAEHHILKHFPSTWIV
 species3 AINDCQEGHHLKMFPMTMYSVRNRIQEMHIQKHCPHTHYV
 species4 AINHCQCEHILWMFPSTPYVARNDIQNYHILKMPPSTWWV
 species5 AINDCSCGHHLWMFPSLCYVRRNECQGGHIWKMFPLTVCA

 ((species1,species2,species3),species4,species5)
$ define/user sys$output protml.out
$ run phyexedir:protml
$ type protml.out
 ---------------------------------------------------------
              PROTML Ver.1.00b  in MOLPHY
    Maximum Likelihood Inference of Protein Phylogeny
                based on Dayhoff model
 ---------------------------------------------------------

 Sequences Data    5 Species,   40 Sites

 Species     Amino Acid Sequences
 -------     --------------------

 Consensus   AINDCQCGHI LKMFPMTWYV ARNDCQEGHI LKMFPSTWYV 
 species1    .R....E... ..A....... .......... S...GW.... 
 species2    .R.HN..... .......S.. ...C.A.H.. ..H.....I. 
 species3    ......E..H .......M.S V..RI..M.. Q.HC.H.H.. 
 species4    ...H...E.. .W...S.P.. ....I.NY.. ...P....W. 
 species5    .....S...H .W...SLC.. R..E..G... W....L.VCA 


                   :-----------------------------1.species1  
  :----------------6
  :                :----------------------------------2.species2  
  :                :
  :                :+++++++++++++++++++++++++++--3.species3  
 0:
  :-------------------------4.species4  
  :
  :+++++++++++++++++++++++++++--5.species5  

 No.  0 -  1    number   Length       S.E.      non convergence
 ----------------------------------------------
     species1      1    27.42084  ( 10.35696 )
     species2      2    32.65231  ( 10.71586 )
     species3      3    66.83496  ( 17.16932 )
     species4      4    23.81696  (  9.64345 )
     species5      5    54.67129  ( 15.74179 )
                   6    14.74376  (  7.55932 )
 ----------------------------------------------
  ln L :  -450.137 (  35.632 )  AIC :  912.274
 ----------------------------------------------

                       :----------------------------1.species1  
                 :-----7
                 :     :---------------------------------2.species2  
  :--------------6
  :              :+++++++++++++++++++++++++++--3.species3  
 0:
  :-------------------------4.species4  
  :
  :+++++++++++++++++++++++++++--5.species5  

 ( ((species1,species2),species3), species4, species5 )

 No.  1 -  1    number   Length       S.E.      non convergence
 ----------------------------------------------
     species1      1    26.21945  (  9.93006 )
     species2      2    31.67416  ( 10.37910 )
     species3      3    64.36898  ( 16.54411 )
     species4      4    23.95289  (  9.69605 )
     species5      5    54.66043  ( 15.74315 )
                   6    12.39223  (  6.88107 )
                   7     3.15279  (  3.70619 )
 ----------------------------------------------
  ln L :  -449.197 (  35.804 )  AIC :  912.394
 ----------------------------------------------

                                   :-------------------1.species1  
                  :----------------7
                  :                :+++++++++++++++++++++++++++--3.species3  
  :---------------6
  :               :--------------------------------2.species2  
 0:
  :-----------------------4.species4  
  :
  :+++++++++++++++++++++++++++--5.species5  

 ( ((species1,species3),species2), species4, species5 )

 No.  1 -  2    number   Length       S.E.      non convergence
 ----------------------------------------------
     species1      1    17.00191  (  8.75742 )
     species2      2    30.29431  ( 10.24536 )
     species3      3    56.89282  ( 15.54253 )
     species4      4    21.27627  (  9.11808 )
     species5      5    58.60585  ( 16.48033 )
                   6    13.80483  (  7.44834 )
                   7    14.48062  (  7.71363 )
 ----------------------------------------------
  ln L :  -447.247 (  35.645 )  AIC :  908.494
 ----------------------------------------------

                  :----------------------------1.species1  
  :---------------6
  :               :            :-------------------------2.species2  
  :               :------------7
  :                            :+++++++++++++++++++++++++++--3.species3  
 0:
  :--------------------------4.species4  
  :
  :+++++++++++++++++++++++++++--5.species5  

 ( (species1,(species2,species3)), species4, species5 )

 No.  1 -  3    number   Length       S.E.      non convergence
 ----------------------------------------------
     species1      1    26.98879  ( 10.21729 )
     species2      2    23.84699  (  9.53037 )
     species3      3    57.08666  ( 15.78534 )
     species4      4    24.16718  (  9.71627 )
     species5      5    54.29548  ( 15.68420 )
                   6    13.76134  (  7.34501 )
                   7    10.78488  (  6.22688 )
 ----------------------------------------------
  ln L :  -448.573 (  35.661 )  AIC :  911.146
 ----------------------------------------------

       NO.  1 stage    3 trees
 -----------------------------------------------------------------------
  Tree  ln L   Diff ln L       #Para   AIC    Diff AIC           Boot P
 -----------------------------------------------------------------------
   1  -449.20    -1.95 +-  5.05   7   912.39     3.90 +- 10.10
   2  -447.25 <-- best            7   908.49 <-- best         
   3  -448.57    -1.33 +-  5.95   7   911.15     2.65 +- 11.90
 -----------------------------------------------------------------------

                                   :-------------------1.species1  
                  :----------------7
                  :                :+++++++++++++++++++++++++++--3.species3  
  :---------------6
  :               :--------------------------------2.species2  
 0:
  :-----------------------4.species4  
  :
  :+++++++++++++++++++++++++++--5.species5  

 No.  1 -  0    number   Length       S.E.      non convergence
 ----------------------------------------------
     species1      1    17.00218  (  8.75739 )
     species2      2    30.29451  ( 10.24535 )
     species3      3    56.89304  ( 15.54246 )
     species4      4    21.27633  (  9.11799 )
     species5      5    58.60578  ( 16.48025 )
                   6    13.80468  (  7.44826 )
                   7    14.47989  (  7.71304 )
 ----------------------------------------------
  ln L :  -447.247 (  35.645 )  AIC :  908.494
 ----------------------------------------------

This program does not ask for parameters, and always sends its output to the screen. That is why in the example above it was configured with USEPHYLIP, but the actual run was performed outside of that menu environment.


Phylogeny from restriction site data

If you have only data on the existence of particular restriction sites, that is still enough to allow a phylogenetic analysis with the program RESTML. The input file contains "+" if a site is present, "-" if not, and "?" if it isn't known.

$ UsePhylip
3
1
RESTML
2
CLASS:EXAMPLE_5.INPUT
3
RESTML
0
i
   5   13   2
Alpha     ++-+-++--+++-
Beta      ++++--+--+++-
Gamma     -+--+-++-+-++
Delta     ++-+----++---
Epsilon   ++++----++---
2
Y       Settings are correct
o
Restriction site Maximum Likelihood method, version 3.572c

  Recognition sequences all 6 bases long

Sites absent from all species are assumed to have been omitted

      +Epsilon   
  +---3  
  !   +Delta     
  !  
  !  +-----Gamma     
--2--1  
  !  +Beta      
  !  
  +Alpha     

remember: this is an unrooted tree!

Ln Likelihood =   -40.35858

Examined   15 trees
 
 Between        And            Length      Approx. Confidence Limits
 -------        ---            ------      ------- ---------- ------
   2             3            0.05855     (  0.05681,     0.06030) **
   3          Epsilon         0.00005     (  0.00000,     0.00010)
   3          Delta           0.01458     (  0.01374,     0.01543) **
   2             1            0.00002     (     zero,     0.00006)
   1          Gamma           0.11430     (  0.11177,     0.11683) **
   1          Beta            0.00008     (  0.00002,     0.00014)
   2          Alpha           0.02468     (  0.02357,     0.02578) **

     *  = significantly positive, P < 0.05
     ** = significantly positive, P < 0.01


Methods based on distances

There are two programs in the PHYLIP package for calculating the required distance matrix from aligned sequence, DNADIST and PROTDIST. Here is an example with DNADIST:

$ UsePhylip
3
1
DNADIST
2
CLASS:EXAMPLE_3.INPUT
3
DNADIST
0
2
Y       Settings are correct
o

There are several ways of defining "distance" between sequences, and choosing different ones may change the trees that you get later. For proteins, the methods available are uncorrected, Jukes-Cantor, or Kimura [default]. For nucleotide sequences, the available methods are uncorrected, Tajima-Nei, Kimura 2-parameter, Jin-Nei Gamma. The uncorrected distances underestimate the divergence between two sequences once they begin to accumulate multiple substitution events at single sites. Therefore you should only ever use the uncorrected distances when there are very few differences between the sequences, and so multiple events are unlikely to have occurred. The other distance methods listed above attempt to correct for the multiple substitution problem by taking into account the properties of various models of mutation. More complete descriptions of these methods are presented in the GCG help section on the program GrowTree. The unit of "distance" that comes out of these can be thought of as "the amount of time required for X substitutions per site to have occurred." For instance, if X = 0.01 then one would expect one substitution per 100 sites. X can be greater than 1.0! This means, that on average, each residue was substituted more than once (some of these can be back mutations.)

There are two terms that you need to be familiar with when working with distance programs. The first is "additive distances", which means that the distance measured between two sequences is the same as that calculated along the path in the tree that separates their respective nodes. "Ultrametric distances" means that, in addition to the previous constraint, the distance from the root to each end node is the same. You should realize that it is unlikely that real data will be consistent with "additive distances", and even less likely that it will be consistent with "ultrametric distances." One way that this tends to show up is as branches that have negative lengths. If these are only a little negative you might be willing to write it off to small statistical fluctuations, or the presence of a multifurcating internal node, and accept the tree. However, if these negative distances are large, you might take the hint and realize that your data does not fit the model which you are trying to impose on it.

Probably the simplest of the distance methods to understand is that implemented in the FITCH program. It tests the fit of trees by calculating a least squares error value, which sums the squares of the differences between the distances measured between each pair of sequences, and the distances computed on the tree.

$ UsePhylip
3
1
FITCH
2
CLASS:EXAMPLE_6.INPUT
3
FITCH
0
i
    5
Alpha      0.000 1.000 2.000 3.000 3.000
Beta       1.000 0.000 2.000 3.000 3.000
Gamma      2.000 2.000 0.000 3.000 3.000
Delta      3.000 3.000 3.000 0.000 1.000
Epsilon    3.000 3.000 3.000 1.000 0.000
2
Y       Settings are correct
o

The KITSCH program is similar to FITCH, above, except that it requires that all sequences tested are for a single time point, and that the data is ultrametric.

$ UsePhylip
3
1
KITSCH
2
CLASS:EXAMPLE_6.INPUT
3
KITSCH
0
2
Y       Settings are correct
o

The NEIGHBOR program implements both the neighbor-joining and unweighted pair group method with arithmetic mean, or UPGMA. The GCG program GROWTREE also implmenents these two methods. Effectively, both of these methods are distance based clustering algorithms, with somewhat different weightings and end results. Neigbor-joining clusters to a form a tree connecting all end nodes, whereas UPGMA clusters to construct a tree leading from the end nodes to a single root node. Both work by reducing a distance matrix sequentially, taking two carefully chosen entries at a time and merging them, then recalculating the distances from the merged node to all of the remaining sequences. It is important to note that the neighbor-joining method requires additive distances, and UPGMA ultrametric distances. As noted above, these constraints are unlikely to be satisfied with real data. On the other hand, these methods are reasonably fast even with huge numbers of sequences as a result of the assumptions that they make. They deduce the tree algorithmically from the starting distances, rather than having to test all possible trees.

$ UsePhylip
3
1
NEIGHBOR
2
CLASS:EXAMPLE_6.INPUT
3
NEIGHBOR
0
2
Y       Settings are correct
o

To change from Neighbor joining to UPGMA do

$ UsePhylip
2
N
Y       Settings are correct
o

Remember, the UPGMA tree is rooted, the Neighbor-Joining tree is not.


A comparison of five methods

Let's have a look now at the sort of results one gets if a single data set is run through a variety of phylogeny programs. The following set of azurins were aligned with PILEUP:


Azurin protein sequences from several bacterial species.


 H81_Neigo      neisseria gonorrhoeae. h.8 outer membrane protein precursor. 
                  2/94 183bp
 H8_Neime       neisseria meningitidis. h.8 outer membrane protein precursor.
                 10/93 183bp
 Azur_Alcde     alcaligenes denitrificans. azurin precursor. 2/95 149bp
 Azur_Alcfa     alcaligenes faecalis. azurin. 10/89 128bp
 Azur_Alcsp     alcaligenes sp. azurin. 2/95 129bp
 Azur_Borbr     bordetella bronchiseptica. azurin. 2/95 129bp
 Azur_Pseae     pseudomonas aeruginosa. azurin precursor. 11/95 148bp
 Azur_Psede     pseudomonas denitrificans. azurin. 11/95 128bp
 Azur_Psefb     pseudomonas fluorescens biotype b. azurin. 2/95 128bp
 Azur_Psefc     pseudomonas fluorescens biotype c. azurin. 2/95 128bp
 Azur_Psefd     pseudomonas fluorescens biotype d. azurin. 2/95 128bp
 Azur_Psepu     pseudomonas putida. azurin. 2/95 128bp
 Azu1_Metj      methylomonas j. azurin iso-1. 2/95 128bp
 Azu2_Metj      methylomonas j. azurin iso-2. 2/95 129bp

After alignment, the first 57 residues were removed, as some of these sequences are precursor forms, some aren't. If this (artificially) variable region had not been removed the longer precursor sequences would have been "closer" to each other than is otherwise warranted, based solely on the inclusion of the leader sequence. The remaining aligned sequence was converted to PHYLIP format with the EGCG TOPHYLIP program. This alignment was then fed into PROTDIST, which used the Kimura method to generate a distance matrix. Then FITCH, KITSCH, and NEIGHBOR were run with all variables at their default setting. NEIGHBOR was also used to generate an UPGMA alignment, and PROTPARS was run on the trimmed alignment. Starting from the trimmed MSF file, here are the commands which were used:

$ USEPHYLIP
0
$ TOPHYLIP/infile=AZURIN.MSF{*}/out=AZURIN.PHYLIP
$ create plotit.com
$!
$! run drawtree or drawgram to make postscript plot
$!
r phyexedir:'P1'
L
N
Y
$ fname = f$trnlnm("PLOTFILE")
$ printg/noflag/delete 'fname'
^Z
$ create setlogs.com
$ define outfile   'P1'.out
$ define treefile  'P1'.tree
$ define plotfile  'P1'.plot
$ exit
^Z
$ define infile   azurin.phylip
$ @setlogs protpars
$ r phyexedir:protpars
Y          run PROTPARS, all defaults
$ @plotit drawtree
$ define outfile  azurin.distances
$ r phyexedir:protdist
P          Kimura distances
Y
$ define infile   azurin.distances
$ @setlogs fitch
$ r phyexedir:fitch
Y
$ @plotit drawtree
$ @setlogs kitsch
$ r phyexedir:kitsch
Y
$ @plotit drawtree
$ @setlogs NJ
$ r phyexedir:neighbor
Y
$ @plotit drawtree
$ @setlogs UPGMA
$ r phyexedir:neighbor
N      use UPGMA instead of NJ
Y
$ @plotit drawtree

Protpars

Fitch

Kitsch

Neighbor-Joining

UPGMA

Here are the contents of the tree files:

Protpars Fitch Kitsch Neighbor-Joining UPGMA

Notice that while all methods agree on some of the terminal branchings, the order of the most ancient branchings differs from one method to another. For instance, look at the Azu1/Azu2 Metj versus neisseria positions. Also, the two neisseria isolates are identical in the region analyzed - the distance programs correctly draw them right on top of each other, but PROTPARS doesn't. All trees were drawn in the same style to make comparisons easier. The Kitsch and UPGMA trees have a root near the node which connects Azu2 Metj to the other entries. A black rectangle has been placed in the illustrations to show its approximate position - the DRAWTREE program does not mark the location of this node.


Discrete state methods

There are several programs in the PHYLIP package for working with discrete state data. The only program that we've covered so far that uses this sort of data is RESTML, and it makes certain assumptions that the others don't make, since the only trait it allows is the existence, or not, of restriction sites. Since I think it unlikely that many of you will ever use these other programs we won't go through them in class.


Manipulating trees

The next topic that I want to cover is how to get the trees that these programs produce into a graphics format.

The first program, is RETREE. It allows you to flip the tree at any node. This changes the look of the tree, but not what it says.

$ UsePhylip
3
1
RETREE
2
NEIGHBOR.TREE
3
RETREE
0
2
Y       Settings are correct

Another program deals with a slightly different problem - let's say that you have run 5 or 6 different methods - how do you figure out which parts of the various trees are in agreement? For small trees you can do this by inspection, but for bigger trees that can be really tedious. The CONSENSE program will do this for you. First you need to put all of the trees into one file, which has the format shown here:

$ UsePhylip
i
3
1
CONSENSE
2
CLASS:EXAMPLE_7.input
3
CONSENSE
0
i
(A,(B,(H,(D,(J,(((G,E),(F,I)),C))))));
(A,(B,(D,((J,H),(((G,E),(F,I)),C)))));
(A,(B,(D,(H,(J,(((G,E),(F,I)),C))))));
(A,(B,(E,(G,((F,I),((J,(H,D)),C))))));
(A,(B,(E,(G,((F,I),(((J,H),D),C))))));
(A,(B,(E,((F,I),(G,((J,(H,D)),C))))));
(A,(B,(E,((F,I),(G,(((J,H),D),C))))));
(A,(B,(E,((G,(F,I)),((J,(H,D)),C)))));
(A,(B,(E,((G,(F,I)),(((J,H),D),C)))));
2
Y       Settings are correct
o

Majority-rule and strict consensus tree program, version 3.572c

Species in order: 

  A
  B
  H
  D
  J
  G
  E
  F
  I
  C


Sets included in the consensus tree

Set (species in order)     How many times out of   9.00

.......**.                   9.00
..********                   9.00
..***....*                   6.00
..****.***                   6.00
..***.....                   6.00
..*.*.....                   4.00
..***..***                   2.00


Sets NOT included in consensus tree:

Set (species in order)     How many times out of   9.00

.....**...                   3.00
.....****.                   3.00
..**......                   3.00
.....*****                   3.00
..*.******                   2.00
.....*.**.                   2.00
..****...*                   2.00
....******                   2.00
...*******                   1.00


CONSENSUS TREE:
the numbers at the forks indicate the number
of times the group consisting of the species
which are to the right of that fork occurred
among the trees, out of   9.00 trees

  +---------------------------------------A                   
  !
  !         +-----------------------------E                   
  !         !
  !         !                        +----I                   
  !         !         +------------9.0
  !         !         !              +----F                   
  !    +--9.0         !
  !    !    !    +--2.0         +---------D                   
  !    !    !    !    !    +--6.0
  !    !    !    !    !    !    !    +----J                   
  !    !    !    !    +--6.0    +--4.0
  +--9.0    +--6.0         !         +----H                   
       !         !         !
       !         !         +--------------C                   
       !         !
       !         +------------------------G                   
       !
       +----------------------------------B                   


  remember: this is an unrooted tree!


t
(A:9.0,((E:9.0,(((I:9.0,F:9.0):9.0,((D:9.0,(J:9.0,H:9.0):4.0)
:6.0,C:9.0):6.0):2.0,G:9.0):6.0):9.0,B:9.0):9.0);


Drawing trees

To draw the trees which have been generated use DRAWGRAM for trees that are rooted, and DRAWTREE for trees that are unrooted. Let's first draw the unrooted tree that CONSENSE made, we will generate final graphics in Postscript format, and preview in Tektronix mode. USEPHYLIP requires that INFILE point to a valid file, but the draw programs will ignore it, they only convert the TREEFILE to a PLOTFILE. You can use the preview mode to make changes iteratively to the tree that will be reflected in the final plot.

$ UsePhylip
3
1
DRAWTREE
2
CLASS:EXAMPLE_1.INPUT
3
CONSENSE
0
2
L       Set final output graphics format (postscript)
K       Set preview graphics format (tektronix)
Y       Settings are correct
0

Y Make plot (N to change a variable and re-preview)

If you selected postscript output, and generally you do, then print it with:

$ printg/noflag consense.plot

DRAWGRAM works similarly, but on rooted trees.

$ UsePhylip
3
1
DRAWGRAM
2
CLASS:EXAMPLE_1.INPUT
3
DNAMLK
0
2
L
K
Y

Y 0

That's pretty much it for the mechanics of using the PHYLIP package. It is not difficult to use - but it very important that you verify that the assumptions that each program makes are at least approximately correct for the data that you are trying to analyze.


Phylogeny programs in the GCG package

Briefly, here is how to use the GCG programs. Use the Distances program to find the distances between sequences. (the distance matrix file you use with the PHYLIP package won't work verbatim in the GCG package or vice versa.) Once a distance file has been generated, use Growtree to do either a Neighbor-Joining (menu=1) or UPGMA (menu=2) analysis.

$ distances/infile=azurin.msf{*}/default
$ setplot
$ growtree/infile=azurin.distances/color=2/menu=1
$ type azurin.trees
#NEXUS
[ Trees from file: Azurin.Distances ]
begin trees;
utree Tree_1 = ((((('H81_Neigo':0.00,'H8_Neime':0.00):39.66,'Azu1_Metj':21.95)
:5.46,(('Az_Alcsp':10.25,'Az_Borbr':13.50):7.04,(('Az_Pseae'
:9.79,'Az_Psede':5.96):7.10,((('Az_Psefb':7.76,'Az_Psepu'
:3.27):6.97,'Az_Psefc':9.88):2.93,'Az_Psefd':10.45):7.65)
:4.30):3.59):2.09,('Az_Alcde':18.36,'Azu2_Metj':53.53):3.98)
:12.38,'Az_Alcfa':12.38):0.00;
endblock;

$ distances/infile=azurin.msf{*}/default
$ setplot
$ growtree/infile=azurin.distances/color=2/menu=2
$ type azurin.trees
#NEXUS
[ Trees from file: Azurin.Distances ]
begin trees;
tree Tree_1 = ((('H81_Neigo':0.00,'H8_Neime':0.00):33.73,((('Az_Alcde':21.51,
'Az_Alcfa':21.51):25.71,(('Az_Alcsp':11.88,'Az_Borbr':11.88):22.18,
(('Az_Pseae':7.88,'Az_Psede':7.88):16.84,((('Az_Psefb':5.51,
'Az_Psepu':5.51):11.18,'Az_Psefc':11.18):12.55,'Az_Psefd':12.55)
:16.84):22.18):25.71):27.44,'Azu1_Metj':27.44):33.73):42.24,'Azu2_Metj'
:42.24):0.00;
endblock;


The Phylo_Win program

Nicholas Galtier's Phylo_Win program is a relatively simple to use phylogeny program with a graphics user interface. It is straightforward in this program to determine which regions and sequences to analyze, and also to rearrange trees interactively. The version we have on Seqaxp requires an X11 server though. The program will accept a PAUP formatted file, along with several others, but not MSF format. Here is an example of how to start the program.

$ readseq -f17 -oazurin.paup -a azurin.msf
$ phylo_win azurin.paup

Here is an example of the node swapping mode, showing a tree produced by a maximum parsimony analysis. Clicking on one of the black boxes swaps the positions of the two branches to the right of that box.


Next week we'll go over how to format data for publication. Are there any questions?