Before we start discussing phylogeny programs, we should think for a minute about what we are asking these tools to do. In general, the problem we are addressing is this: we have a collection of "things" (an intentionally vague term) which we believe to have been all derived from ancestral "things", which were present in some common progenitor. We want to derive from these "things" a tree of some sort which shows how all of these "things" derived from that ancestor, or in some cases, just shows how they relate to each other.
These "things" are now most typically sequences, but might also be traits of various types, such as the existence of particular restriction enzyme locations, or anatomical traits, such as having a particular tooth or skull shape.
There is a bit of nomenclature you should be aware of when dealing with trees. Trees are composed of branches which connect at nodes. Most of the trees that these programs generate are strictly bifurcating, meaning that each internal node connects to exactly three branches, one of which can generally be thought of as "in", and the other two as "out". Tip nodes, or end nodes consist of the end of a branch and are usually marked with the name of the sequence or species that corresponds to it. Trees may be either unrooted or rooted. Unrooted trees indicate the relationship between the end nodes, rooted trees indicate, in addition to that, the relationship between a root end node and all of the other end nodes. The lengths of the branches of the tree may or may not be meaningful, depending on how the tree was generated. Strictly bifurcating trees cannot correctly represent a true multispecies (>2) radiation from an internal node; they have to represent these with extra internal nodes connected by extra branches. These extra branches typically have small positive or negative lengths.
There are an assortment of problems when it comes time to do a phylogenetic analysis. The most important of these are:
Input data for phylogenetic programs is of several types.
There are many, many programs available for performing various sorts of phylogenetic analyses. Mostly these are for PCs and Macintoshes, and most address a single type of phylogenetic analysis. There is an extensive list of these tools at:
Today we will cover those tools which are available on Seqaxp. These consist of a few programs in the GCG package, Phylo_Win, and the comprehensive PHYLIP package, from Joe Felsenstein at U. Washington. We also have the MOLPHY programs available, but I won't talk about those today. Here is a description of the programs we have, and what each does:
Phylip programs:
Programs that work on molecular sequence data
DNAPARS Parsimony method for DNA
DNAMOVE Interactive DNA parsimony
DNAPENNY Branch and bound for DNA
DNACOMP Compatibility for DNA
DNAINVAR Phylogenetic invariants
DNAML Maximum likelihood method
DNAMLK DNA ML with molecular clock
DNADIST Distances from sequences
PROTDIST Distances from proteins
PROTPARS Protein parsimony
PROTML Maximum likelihood method for protein (unsupported)
RESTML ML for restriction sites
SEQBOOT Bootstraps sequence data sets
Programs that work on distance matrix data
FITCH Fitch-Margoliash and least-squares methods
KITSCH Fitch-Margoliash and least squares methods with evolutionary clock
NEIGHBOR Neighbor-joining and UPGMA methods
Programs that work on gene frequencies and continuous characters
CONTML Maximum likelihood method
GENDIST Computes genetic distances
CONTRAST Computes contrasts and correlations for comparative method studies
Programs that work on 0-1 discrete state data
MIX Wagner, Camin-Sokal, and mixed parsimony criteria
MOVE Interactive Wagner, C-S, mixed parsimony program
PENNY Finds all most parsimonious trees by branch-and-bound
DOLLOP, DOLMOVE, DOLPENNY same as preceding four programs, but for
the Dollo and polymorphism parsimony criteria
CLIQUE Compatibility method
FACTOR recode multistate characters
Programs that plot or modify trees
DRAWGRAM Draws cladograms and phenograms on screens, plotters and printers
DRAWTREE Draws unrooted phylogenies on screens, plotters and printers
CONSENSE Majority-rule and strict consensus trees
RETREE Reroots, changes names and branch lengths, and flips trees
GCG programs:
DISTANCES Generate a distance matrix from a sequence alignment
DIVERGE Measures percent divergence between two aligned sequences
by the method of Perler
NEWDIVERGE Like DIVERGE, but uses the method of Li. (Called
DIVERGE in release 9.0).
GROWTREE Makes a phylogenetic tree using the UPGMA or Neigbour-joining
method. Input is a distance matrix from DISTANCES.
EGCG programs:
TOPHYLIP Convert GCG .MSF files to PHYLIP format.
Miscellaneous:
PHYLO_WIN Neighbor-Joining, UPGMA, and others. X11 GUI.
USEPHYLIP Interactive shell for use with all of the PHYLIP programs.
At this point we'll start running through how to use these, beginning with the PHYLIP programs. The programs in this package are designed to be controlled through strings stored in logical names (on OpenVMS) or shell environmental variables (on Unix). Here is the standard method you would use to run an arbitrary PHYLIP program:
$ define INFILE your_in_file $ define OUTFILE your_out_file $ define TREEFILE your_in_or_out_tree_file $ define FONTFILE PHYEXEDIR:FONT1.; $ run PHYEXEDIR:PROGNAME
The documentation for each PHYLIP program states explicitly which assumptions the algorithm requires. You can use HELP to read the documentation from version 3.4. If you want to read the current documentation, which isn't much different, it is in one of these files:
$ DIR PHYDOCDIR:*.doc CLIQUE.DOC CONSENSE.DOC CONTCHAR.DOC CONTML.DOC CONTRAST.DOC DISCRETE.DOC DISTANCE.DOC DNACOMP.DOC DNADIST.DOC DNAINVAR.DOC DNAML.DOC DNAMLK.DOC DNAMOVE.DOC DNAPARS.DOC DNAPENNY.DOC DOLLOP.DOC DOLMOVE.DOC DOLPENNY.DOC DRAW.DOC DRAWGRAM.DOC DRAWTREE.DOC FACTOR.DOC FITCH.DOC GENDIST.DOC KITSCH.DOC MAIN.DOC MAKEINF.DOC MIX.DOC MOVE.DOC NEIGHBOR.DOC PENNY.DOC PROTDIST.DOC PROTML.DOC PROTPARS.DOC RESTML.DOC RETREE.DOC SEQBOOT.DOC SEQUENCE.DOC
We have EGCG versions of many of the Phylip programs. These have a standard GCG command line interface. These versions are unsupported - if they work for you, great, if not, use the regular Phylip variant. To see if there is a version of the program which interests you, use commands like these:
$ SHOW SYMBOL *DNADIST* EDNADIST == "$ EGENUTIL:EDNADIST" $ SHOW SYMBOL *DRAWTREE* %DCL-W-UNDSYM, undefined symbol - check validity and spelling
Since the standard method of running these programs is a bit alien for users who are accustomed to working through GUIs, we'll use the USEPHYLIP menu driven interface intstead. If you start USEPHYLIP, then immediately exit from it, the logical names PHYDOCDIR and PHYEXEDIR will be defined for you. At that point you may use the standard method of running these programs if you so desire - and you may want to if you have to run one or more of them many times in a batch job. USEPHYLIP is menu driven (as are the PHYLIP programs themselves). It has two sections:
USEPHYLIP has only two menus, the main one has these options:
0 Exit 1 Help (currently only ver3.4 help is available) 2 Run program 3 Change Options 4 List current directory 5 Spawn a subprocess (use this to rename, delete, etc.) 6 Convert Infile, in GCG .MSF format, to Phylip format i Type the input file o Type the output file t Type the tree file
and the Change Options menu has these values:
0 Return to main menu 1 change program 2 change infile 3 change root (->root.plot,root.out,root.tree) 4 change fontfile 5 change text output method
PHYLIP programs require aligned sequences to do their work. Consequently, the first thing you need to know to use PHYLIP is how to convert MSF files to PHYLIP format. You can do that from the command line with the EGCG program TOPHYLIP. Or, you can run that program indirectly from within USEPHYLIP as in this example:
$ usephylip Status: Program none Infile none Outfile none.out Plotfile none.plot Treefile none.tree Fontfile PHYEXEDIR:FONT1.. root none Text Out TYPE 3 2 azurin.msf Status: Program none Infile AZURIN.MSF Outfile none.out Plotfile none.plot Treefile none.tree Fontfile PHYEXEDIR:FONT1.. root none Text Out TYPE 0 6 Status: Program none Infile AZURIN.PHYLIP Outfile none.out Plotfile none.plot Treefile none.tree Fontfile PHYEXEDIR:FONT1.. root none Text Out TYPE
The Infile now shows AZURIN.PHYLIP where AZURIN.MSF used to be. If you select option i you will see that the format has indeed changed to the PHYLIP format.
The PARSIMONY programs make the assumption that the desired tree is the one with the fewest changes in it to get from one sequence to another. The DNAPARS program also makes some assumptions about the rate of evolution - that is, that it is pretty much the same at all sites and all lineages.
We'll use the example data provided with the PHYLIP program, and run it from within the USEPHYLIP shell.
$ Usephylip
3
1
DNAPARS
2
CLASS:EXAMPLE_1.INPUT
3
DNAPARS
0
Status:
Program DNAPARS
Infile EXAMPLE_1.INPUT
Outfile DNAPARS.out
Plotfile DNAPARS.plot
Treefile DNAPARS.tree
Fontfile PHYEXEDIR:FONT1..
root DNAPARS
Text Out TYPE
i
5 13
Alpha AACGUGGCCAAAU
Beta AAGGUCGCCAAAC
Gamma CAUUUCGUCACAA
Delta GGUAUUUCGGCCU
Epsilon GGGAUCUCGGCCC
2
5 DNAPARS prompt: print sequence
Y DNAPARS prompt: all settings are correct
o
DNA parsimony algorithm, version 3.572c
One most parsimonious tree found:
+--Epsilon
+--4
+--3 +--Delta
! !
+--2 +-----Gamma
! !
--1 +--------Beta
!
+-----------Alpha
remember: this is an unrooted tree!
requires a total of 19.000
From To Any Steps? State at upper node
( . means same as in the node below it on tree)
1 AABGTSGCCA AAY
1 2 maybe .....C.... ...
2 3 yes V.KD...... C..
3 4 yes GG.A..T.GG .C.
4 Epsilon maybe ..G....... ..C
4 Delta yes ..T..T.... ..T
3 Gamma yes C.TT...T.. ..A
2 Beta maybe ..G....... ..C
1 Alpha maybe ..C..G.... ..T
t
((((Epsilon,Delta),Gamma),Beta),Alpha);
There is a variant of this program called DNAMOVE which allows you to manually adjust pieces of the tree, and then rerun the analysis on just some parts of it.
$ UsePhylip
3
1
dnamove
2
CLASS:EXAMPLE_1.INPUT
3
dnamove
0
2
Y DNAMOVE prompt: settings are correct
+---------------5:Epsilon (unrooted)
--9
| +------------ 4:Delta
+-8
| +--------- 3:Gamma Steps: 19.00000
+- 7
| +------ 2:Beta
+- 6
+------ 1:Alpha 11 sites compatible
NEXT? (Options: R # + - S . T U W O F C H ? X Q) (H or ? for Help)
Use the T command to see whether a better score could result if a particular node was moved.
T
Try other positions for which node? 7
WAIT ...
BETTER: NONE
TIED: 5 9: 19.00
Shows that the node 7 could swap with nodes 5 or 9 without changing the score. To actually make that rearrangement, use
R
Remove everything to the right of which node? 7
Add before which node? 5
+-------- 3:Gamma (unrooted)
+- 7
| | +------ 2:Beta
+- 8 +-6
| | +------ 1:Alpha Steps: 19.00000
--9 |
| +----------- 5:Epsilon (as good as best)
|
+-------------- 4:Delta 11 sites compatible
For simple cosmetic changes (flipping branches) use F
F
9
F
8
+---------------4:Delta (unrooted)
--9
| +------------ 5:Epsilon
+-8
| +--------- 3:Gamma Steps: 19.00000
+- 7
| +------ 2:Beta
+- 6
+------ 1:Alpha 11 sites compatible
This makes it a bit clearer that the position of the Epsilon and Delta nodes in this tree may be reversed and the resulting tree will have the same score. To see what is the expected state of each position use S (+,- are like S, but move relative to the current site)
S
2
gggggggggggg4:Delta SITE 2 (unrooted)
gg8
g ggggggggg5:Epsilon a:A, c:C, g:G, t:T, ?:?
ggg9
a aaaaaa3:Gamma Steps: 19.00000
aaa7
a aaa2:Beta
aaa6
aaa1:Alpha 11 sites compatible
X
N Write the tree to a file?
DNAPENNY will find all most parsimonious trees, as opposed to DNAPARS, which will only find one. It uses a search method called "branch and bound" which is said to be very efficient. The method is described in the program's documentation so I won't go over it here. The main problem is that the space it is searching is quite large, and expands factorially with the number of species, so that it is only practical for small data sets. Above 12 species it is very slow.
$ UsePhylip
3
1
DNAPENNY
2
CLASS:EXAMPLE_2.INPUT
3
DNAPENNY
0
i
8 6
Alpha1 AAGAAG
Alpha2 AAGAAG
Beta1 AAGGGG
Beta2 AAGGGG
Gamma1 AGGAAG
Gamma2 AGGAAG
Delta GGAGGA
Epsilon GGAAAG
2
4 print out steps in each site
5 print sequences at all nodes
Y all settings are correct
o to save space, only the first 3 trees are shown
Penny algorithm for DNA, version 3.572c
branch-and-bound to find all most parsimonious trees
requires a total of 8.000
9 trees in all found
+--------------------Alpha1
!
! +--Delta
! +--3
! +--7 +--Epsilon
--1 ! !
! +-----6 +-----Gamma2
! ! !
! +--4 +--------Gamma1
! ! !
! ! ! +--Beta2
+--2 +-----------5
! +--Beta1
!
+-----------------Alpha2
remember: this is an unrooted tree!
steps in each site:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 1 1 1 2 2 1
From To Any Steps? State at upper node
( . means same as in the node below it on tree)
1 AAGAAG
1 Alpha1 no ......
1 2 no ......
2 4 no ......
4 6 yes .G....
6 7 no ......
7 3 yes G.A...
3 Delta yes ...GGA
3 Epsilon no ......
7 Gamma2 no ......
6 Gamma1 no ......
4 5 yes ...GG.
5 Beta2 no ......
5 Beta1 no ......
2 Alpha2 no ......
+--------------------Alpha1
!
! +--Delta
! +-----3
! ! +--Epsilon
--1 +-----6
! ! ! +--Gamma2
! ! +-----7
! +--4 +--Gamma1
! ! !
! ! ! +--Beta2
+--2 +-----------5
! +--Beta1
!
+-----------------Alpha2
remember: this is an unrooted tree!
steps in each site:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 1 1 1 2 2 1
From To Any Steps? State at upper node
( . means same as in the node below it on tree)
1 AAGAAG
1 Alpha1 no ......
1 2 no ......
2 4 no ......
4 6 yes .G....
6 3 yes G.A...
3 Delta yes ...GGA
3 Epsilon no ......
6 7 no ......
7 Gamma2 no ......
7 Gamma1 no ......
4 5 yes ...GG.
5 Beta2 no ......
5 Beta1 no ......
2 Alpha2 no ......
+--------------------Alpha1
!
! +--Delta
! +--3
! +--6 +--Epsilon
--1 ! !
! +-----7 +-----Gamma1
! ! !
! +--4 +--------Gamma2
! ! !
! ! ! +--Beta2
+--2 +-----------5
! +--Beta1
!
+-----------------Alpha2
remember: this is an unrooted tree!
steps in each site:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 1 1 1 2 2 1
From To Any Steps? State at upper node
( . means same as in the node below it on tree)
1 AAGAAG
1 Alpha1 no ......
1 2 no ......
2 4 no ......
4 7 yes .G....
7 6 no ......
6 3 yes G.A...
3 Delta yes ...GGA
3 Epsilon no ......
6 Gamma1 no ......
7 Gamma2 no ......
4 5 yes ...GG.
5 Beta2 no ......
5 Beta1 no ......
2 Alpha2 no ......
If you are doing anything large, you will want to change the F option to something much larger, so that the program won't spend all of its time reporting its progress. The test case completes instantaneously on an Alphaserver 2100 4/200.
The compatibility method is this: if a position has 3 of the 4 possible bases, then evaluate different trees subject to the criterion that the minimum number of changes required in the tree is two (one less than three).
$ UsePhylip
3
1
DNACOMP
2
CLASS:EXAMPLE_1.INPUT
3
DNACOMP
0
2
4 Print data for each site
5 Print sequence at each node
Y Settings are correct
o
DNA compatibility algorithm, version 3.572c
One most parsimonious tree found:
+--Epsilon
+--4
+--3 +--Delta
! !
+--2 +-----Gamma
! !
--1 +--------Beta
!
+-----------Alpha
remember: this is an unrooted tree!
total number of compatible sites is 11.0
steps in each site:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 2 1 3 2 0 2 1 1 1
10! 1 1 1 3
compatibility (Y or N) of each site with this tree:
0123456789
*----------
0 ! YYNYYYYYY
10 !YYYN
From To Any Steps? State at upper node
( . means same as in the node below it on tree)
1 AABGTSGCCA AAY
1 2 maybe .....C.... ...
2 3 yes V.KD...... C..
3 4 yes GG.A..T.GG .C.
4 Epsilon maybe ..G....... ..C
4 Delta yes ..T..T.... ..T
3 Gamma yes C.TT...T.. ..A
2 Beta maybe ..G....... ..C
1 Alpha maybe ..C..G.... ..T
In this case, 11 of the 13 bases were "compatible" with the best tree, that is, the tree could be constructed for the minimum number of substitutions. Sites that weren't compatible required more than this number of substitutions on the tree.
DNAML implements a maximum likelihood method. It tries to find the most likely tree given a model of how nucleotides mutate. It makes certain assumptions about the frequencies of transitions (C<->T,A<->G) and transversions (C or T<->A or G). It also allows you to specify that there are up to 9 classes of sites which mutate at different rates (but you don't tell it which site is which).
$ UsePhylip
3
1
DNAML
2
CLASS:EXAMPLE_3.INPUT
3
DNAML
0
i
5 13
Alpha AACGTGGCCAAAT
Beta AAGGTCGCCAAAC
Gamma CATTTCGTCACAA
Delta GGTATTTCGGCCT
Epsilon GGGATCTCGGCCC
2
Y Settings are correct
o
Nucleic acid sequence Maximum Likelihood method, version 3.572c
Empirical Base Frequencies:
A 0.24615
C 0.29231
G 0.24615
T(U) 0.21538
Transition/transversion ratio = 2.000000
(Transition/transversion parameter = 1.523077)
+Beta
+--1
! ! +Epsilon
! +--------------------------------------------3
! +--------Delta
!
--2------------------------------Gamma
!
+-----Alpha
remember: this is an unrooted tree!
Ln Likelihood = -72.25079
Examined 17 trees
Between And Length Approx. Confidence Limits
------- --- ------ ------- ---------- ------
2 1 0.09407 ( zero, 0.40908)
1 Beta 0.00003 ( zero, 0.32890)
1 3 1.50823 ( zero, 3.29744) **
3 Epsilon 0.00006 ( zero, 0.34286)
3 Delta 0.28112 ( zero, 0.62627) **
2 Gamma 1.01708 ( zero, 2.33211) **
2 Alpha 0.20748 ( zero, 0.56581)
* = significantly positive, P < 0.05
** = significantly positive, P < 0.01
DNAMLK implements a maximum likelihood method assuming a molecular clock. That is, the distance from the root of the tree to each of the end nodes is the same. Like DNAML, it makes certain assumptions about the frequencies of transitions and transversions.
$ UsePhylip
3
1
DNAMLK
2
CLASS:EXAMPLE_3.INPUT
3
DNAMLK
0
2
C
2
1.0 3.2
0.4 0.6
R
1.5
Y Settings are correct
o
5 Species, 13 Sites
Nucleic acid sequence
Maximum Likelihood method with molecular clock, version 3.572c
Site category Rate of change
1 1.000
2 3.200
Empirical Base Frequencies:
A 0.24615
C 0.29231
G 0.24615
T(U) 0.21538
Transition/transversion ratio = 2.000000
(Transition/transversion parameter = 1.523077)
+--Delta
+--------------------------------------------------------4
! +--Epsilon
--3
! +-------------Gamma
+---------------------------------------------2
! +--Alpha
+----------1
+--Beta
Ln Likelihood = -72.40499
Ancestor Node Node Time Length
-------- ---- ---- ---- ------
root 3
3 4 3.40036 3.40036
4 Delta 3.55871 0.15834
4 Epsilon 3.55871 0.15834
3 2 2.75249 2.75249
2 Gamma 3.55871 0.80621
2 1 3.38652 0.63403
1 Alpha 3.55871 0.17218
1 Beta 3.55871 0.17218
Combination of categories that contributes the most to the likelihood:
2222221111 112
Protein Parsimony is pretty much the same thing as DNA parsimony, except that it works on protein sequence. The program is called PROTPARS.
Protein Maximum likelihood is unsupported, however it does run. Instead of modelling nucleotide mutation it uses the amino acid replacement frequencies in the Dayhoff matrix.
$ UsePhylip
3
1
PROTML
2
CLASS:EXAMPLE_4.INPUT
3
PROTML
0
0
$ type CLASS:EXAMPLE_4.INPUT
5 40 S
species1 ARNDCQEGHILKAFPMTWYVARNDCQEGHISKMFGWTWYV
species2 ARNHNQCGHILKMFPMTSYVARNCCAEHHILKHFPSTWIV
species3 AINDCQEGHHLKMFPMTMYSVRNRIQEMHIQKHCPHTHYV
species4 AINHCQCEHILWMFPSTPYVARNDIQNYHILKMPPSTWWV
species5 AINDCSCGHHLWMFPSLCYVRRNECQGGHIWKMFPLTVCA
((species1,species2,species3),species4,species5)
$ define/user sys$output protml.out
$ run phyexedir:protml
$ type protml.out
---------------------------------------------------------
PROTML Ver.1.00b in MOLPHY
Maximum Likelihood Inference of Protein Phylogeny
based on Dayhoff model
---------------------------------------------------------
Sequences Data 5 Species, 40 Sites
Species Amino Acid Sequences
------- --------------------
Consensus AINDCQCGHI LKMFPMTWYV ARNDCQEGHI LKMFPSTWYV
species1 .R....E... ..A....... .......... S...GW....
species2 .R.HN..... .......S.. ...C.A.H.. ..H.....I.
species3 ......E..H .......M.S V..RI..M.. Q.HC.H.H..
species4 ...H...E.. .W...S.P.. ....I.NY.. ...P....W.
species5 .....S...H .W...SLC.. R..E..G... W....L.VCA
:-----------------------------1.species1
:----------------6
: :----------------------------------2.species2
: :
: :+++++++++++++++++++++++++++--3.species3
0:
:-------------------------4.species4
:
:+++++++++++++++++++++++++++--5.species5
No. 0 - 1 number Length S.E. non convergence
----------------------------------------------
species1 1 27.42084 ( 10.35696 )
species2 2 32.65231 ( 10.71586 )
species3 3 66.83496 ( 17.16932 )
species4 4 23.81696 ( 9.64345 )
species5 5 54.67129 ( 15.74179 )
6 14.74376 ( 7.55932 )
----------------------------------------------
ln L : -450.137 ( 35.632 ) AIC : 912.274
----------------------------------------------
:----------------------------1.species1
:-----7
: :---------------------------------2.species2
:--------------6
: :+++++++++++++++++++++++++++--3.species3
0:
:-------------------------4.species4
:
:+++++++++++++++++++++++++++--5.species5
( ((species1,species2),species3), species4, species5 )
No. 1 - 1 number Length S.E. non convergence
----------------------------------------------
species1 1 26.21945 ( 9.93006 )
species2 2 31.67416 ( 10.37910 )
species3 3 64.36898 ( 16.54411 )
species4 4 23.95289 ( 9.69605 )
species5 5 54.66043 ( 15.74315 )
6 12.39223 ( 6.88107 )
7 3.15279 ( 3.70619 )
----------------------------------------------
ln L : -449.197 ( 35.804 ) AIC : 912.394
----------------------------------------------
:-------------------1.species1
:----------------7
: :+++++++++++++++++++++++++++--3.species3
:---------------6
: :--------------------------------2.species2
0:
:-----------------------4.species4
:
:+++++++++++++++++++++++++++--5.species5
( ((species1,species3),species2), species4, species5 )
No. 1 - 2 number Length S.E. non convergence
----------------------------------------------
species1 1 17.00191 ( 8.75742 )
species2 2 30.29431 ( 10.24536 )
species3 3 56.89282 ( 15.54253 )
species4 4 21.27627 ( 9.11808 )
species5 5 58.60585 ( 16.48033 )
6 13.80483 ( 7.44834 )
7 14.48062 ( 7.71363 )
----------------------------------------------
ln L : -447.247 ( 35.645 ) AIC : 908.494
----------------------------------------------
:----------------------------1.species1
:---------------6
: : :-------------------------2.species2
: :------------7
: :+++++++++++++++++++++++++++--3.species3
0:
:--------------------------4.species4
:
:+++++++++++++++++++++++++++--5.species5
( (species1,(species2,species3)), species4, species5 )
No. 1 - 3 number Length S.E. non convergence
----------------------------------------------
species1 1 26.98879 ( 10.21729 )
species2 2 23.84699 ( 9.53037 )
species3 3 57.08666 ( 15.78534 )
species4 4 24.16718 ( 9.71627 )
species5 5 54.29548 ( 15.68420 )
6 13.76134 ( 7.34501 )
7 10.78488 ( 6.22688 )
----------------------------------------------
ln L : -448.573 ( 35.661 ) AIC : 911.146
----------------------------------------------
NO. 1 stage 3 trees
-----------------------------------------------------------------------
Tree ln L Diff ln L #Para AIC Diff AIC Boot P
-----------------------------------------------------------------------
1 -449.20 -1.95 +- 5.05 7 912.39 3.90 +- 10.10
2 -447.25 <-- best 7 908.49 <-- best
3 -448.57 -1.33 +- 5.95 7 911.15 2.65 +- 11.90
-----------------------------------------------------------------------
:-------------------1.species1
:----------------7
: :+++++++++++++++++++++++++++--3.species3
:---------------6
: :--------------------------------2.species2
0:
:-----------------------4.species4
:
:+++++++++++++++++++++++++++--5.species5
No. 1 - 0 number Length S.E. non convergence
----------------------------------------------
species1 1 17.00218 ( 8.75739 )
species2 2 30.29451 ( 10.24535 )
species3 3 56.89304 ( 15.54246 )
species4 4 21.27633 ( 9.11799 )
species5 5 58.60578 ( 16.48025 )
6 13.80468 ( 7.44826 )
7 14.47989 ( 7.71304 )
----------------------------------------------
ln L : -447.247 ( 35.645 ) AIC : 908.494
----------------------------------------------
This program does not ask for parameters, and always sends its output to the screen. That is why in the example above it was configured with USEPHYLIP, but the actual run was performed outside of that menu environment.
If you have only data on the existence of particular restriction sites, that is still enough to allow a phylogenetic analysis with the program RESTML. The input file contains "+" if a site is present, "-" if not, and "?" if it isn't known.
$ UsePhylip
3
1
RESTML
2
CLASS:EXAMPLE_5.INPUT
3
RESTML
0
i
5 13 2
Alpha ++-+-++--+++-
Beta ++++--+--+++-
Gamma -+--+-++-+-++
Delta ++-+----++---
Epsilon ++++----++---
2
Y Settings are correct
o
Restriction site Maximum Likelihood method, version 3.572c
Recognition sequences all 6 bases long
Sites absent from all species are assumed to have been omitted
+Epsilon
+---3
! +Delta
!
! +-----Gamma
--2--1
! +Beta
!
+Alpha
remember: this is an unrooted tree!
Ln Likelihood = -40.35858
Examined 15 trees
Between And Length Approx. Confidence Limits
------- --- ------ ------- ---------- ------
2 3 0.05855 ( 0.05681, 0.06030) **
3 Epsilon 0.00005 ( 0.00000, 0.00010)
3 Delta 0.01458 ( 0.01374, 0.01543) **
2 1 0.00002 ( zero, 0.00006)
1 Gamma 0.11430 ( 0.11177, 0.11683) **
1 Beta 0.00008 ( 0.00002, 0.00014)
2 Alpha 0.02468 ( 0.02357, 0.02578) **
* = significantly positive, P < 0.05
** = significantly positive, P < 0.01
There are two programs in the PHYLIP package for calculating the required distance matrix from aligned sequence, DNADIST and PROTDIST. Here is an example with DNADIST:
$ UsePhylip 3 1 DNADIST 2 CLASS:EXAMPLE_3.INPUT 3 DNADIST 0 2 Y Settings are correct o
There are several ways of defining "distance" between sequences, and choosing different ones may change the trees that you get later. For proteins, the methods available are uncorrected, Jukes-Cantor, or Kimura [default]. For nucleotide sequences, the available methods are uncorrected, Tajima-Nei, Kimura 2-parameter, Jin-Nei Gamma. The uncorrected distances underestimate the divergence between two sequences once they begin to accumulate multiple substitution events at single sites. Therefore you should only ever use the uncorrected distances when there are very few differences between the sequences, and so multiple events are unlikely to have occurred. The other distance methods listed above attempt to correct for the multiple substitution problem by taking into account the properties of various models of mutation. More complete descriptions of these methods are presented in the GCG help section on the program GrowTree. The unit of "distance" that comes out of these can be thought of as "the amount of time required for X substitutions per site to have occurred." For instance, if X = 0.01 then one would expect one substitution per 100 sites. X can be greater than 1.0! This means, that on average, each residue was substituted more than once (some of these can be back mutations.)
There are two terms that you need to be familiar with when working with distance programs. The first is "additive distances", which means that the distance measured between two sequences is the same as that calculated along the path in the tree that separates their respective nodes. "Ultrametric distances" means that, in addition to the previous constraint, the distance from the root to each end node is the same. You should realize that it is unlikely that real data will be consistent with "additive distances", and even less likely that it will be consistent with "ultrametric distances." One way that this tends to show up is as branches that have negative lengths. If these are only a little negative you might be willing to write it off to small statistical fluctuations, or the presence of a multifurcating internal node, and accept the tree. However, if these negative distances are large, you might take the hint and realize that your data does not fit the model which you are trying to impose on it.
Probably the simplest of the distance methods to understand is that implemented in the FITCH program. It tests the fit of trees by calculating a least squares error value, which sums the squares of the differences between the distances measured between each pair of sequences, and the distances computed on the tree.
$ UsePhylip
3
1
FITCH
2
CLASS:EXAMPLE_6.INPUT
3
FITCH
0
i
5
Alpha 0.000 1.000 2.000 3.000 3.000
Beta 1.000 0.000 2.000 3.000 3.000
Gamma 2.000 2.000 0.000 3.000 3.000
Delta 3.000 3.000 3.000 0.000 1.000
Epsilon 3.000 3.000 3.000 1.000 0.000
2
Y Settings are correct
o
The KITSCH program is similar to FITCH, above, except that it requires that all sequences tested are for a single time point, and that the data is ultrametric.
$ UsePhylip 3 1 KITSCH 2 CLASS:EXAMPLE_6.INPUT 3 KITSCH 0 2 Y Settings are correct o
The NEIGHBOR program implements both the neighbor-joining and unweighted pair group method with arithmetic mean, or UPGMA. The GCG program GROWTREE also implmenents these two methods. Effectively, both of these methods are distance based clustering algorithms, with somewhat different weightings and end results. Neigbor-joining clusters to a form a tree connecting all end nodes, whereas UPGMA clusters to construct a tree leading from the end nodes to a single root node. Both work by reducing a distance matrix sequentially, taking two carefully chosen entries at a time and merging them, then recalculating the distances from the merged node to all of the remaining sequences. It is important to note that the neighbor-joining method requires additive distances, and UPGMA ultrametric distances. As noted above, these constraints are unlikely to be satisfied with real data. On the other hand, these methods are reasonably fast even with huge numbers of sequences as a result of the assumptions that they make. They deduce the tree algorithmically from the starting distances, rather than having to test all possible trees.
$ UsePhylip 3 1 NEIGHBOR 2 CLASS:EXAMPLE_6.INPUT 3 NEIGHBOR 0 2 Y Settings are correct o
To change from Neighbor joining to UPGMA do
$ UsePhylip 2 N Y Settings are correct o
Remember, the UPGMA tree is rooted, the Neighbor-Joining tree is not.
Let's have a look now at the sort of results one gets if a single data set is run through a variety of phylogeny programs. The following set of azurins were aligned with PILEUP:
Azurin protein sequences from several bacterial species.
H81_Neigo neisseria gonorrhoeae. h.8 outer membrane protein precursor.
2/94 183bp
H8_Neime neisseria meningitidis. h.8 outer membrane protein precursor.
10/93 183bp
Azur_Alcde alcaligenes denitrificans. azurin precursor. 2/95 149bp
Azur_Alcfa alcaligenes faecalis. azurin. 10/89 128bp
Azur_Alcsp alcaligenes sp. azurin. 2/95 129bp
Azur_Borbr bordetella bronchiseptica. azurin. 2/95 129bp
Azur_Pseae pseudomonas aeruginosa. azurin precursor. 11/95 148bp
Azur_Psede pseudomonas denitrificans. azurin. 11/95 128bp
Azur_Psefb pseudomonas fluorescens biotype b. azurin. 2/95 128bp
Azur_Psefc pseudomonas fluorescens biotype c. azurin. 2/95 128bp
Azur_Psefd pseudomonas fluorescens biotype d. azurin. 2/95 128bp
Azur_Psepu pseudomonas putida. azurin. 2/95 128bp
Azu1_Metj methylomonas j. azurin iso-1. 2/95 128bp
Azu2_Metj methylomonas j. azurin iso-2. 2/95 129bp
After alignment, the first 57 residues were removed, as some of these sequences are precursor forms, some aren't. If this (artificially) variable region had not been removed the longer precursor sequences would have been "closer" to each other than is otherwise warranted, based solely on the inclusion of the leader sequence. The remaining aligned sequence was converted to PHYLIP format with the EGCG TOPHYLIP program. This alignment was then fed into PROTDIST, which used the Kimura method to generate a distance matrix. Then FITCH, KITSCH, and NEIGHBOR were run with all variables at their default setting. NEIGHBOR was also used to generate an UPGMA alignment, and PROTPARS was run on the trimmed alignment. Starting from the trimmed MSF file, here are the commands which were used:
$ USEPHYLIP
0
$ TOPHYLIP/infile=AZURIN.MSF{*}/out=AZURIN.PHYLIP
$ create plotit.com
$!
$! run drawtree or drawgram to make postscript plot
$!
r phyexedir:'P1'
L
N
Y
$ fname = f$trnlnm("PLOTFILE")
$ printg/noflag/delete 'fname'
^Z
$ create setlogs.com
$ define outfile 'P1'.out
$ define treefile 'P1'.tree
$ define plotfile 'P1'.plot
$ exit
^Z
$ define infile azurin.phylip
$ @setlogs protpars
$ r phyexedir:protpars
Y run PROTPARS, all defaults
$ @plotit drawtree
$ define outfile azurin.distances
$ r phyexedir:protdist
P Kimura distances
Y
$ define infile azurin.distances
$ @setlogs fitch
$ r phyexedir:fitch
Y
$ @plotit drawtree
$ @setlogs kitsch
$ r phyexedir:kitsch
Y
$ @plotit drawtree
$ @setlogs NJ
$ r phyexedir:neighbor
Y
$ @plotit drawtree
$ @setlogs UPGMA
$ r phyexedir:neighbor
N use UPGMA instead of NJ
Y
$ @plotit drawtree
Protpars
Fitch
Kitsch
Neighbor-Joining
UPGMA
Here are the contents of the tree files:
ProtparsNotice that while all methods agree on some of the terminal branchings, the order of the most ancient branchings differs from one method to another. For instance, look at the Azu1/Azu2 Metj versus neisseria positions. Also, the two neisseria isolates are identical in the region analyzed - the distance programs correctly draw them right on top of each other, but PROTPARS doesn't. All trees were drawn in the same style to make comparisons easier. The Kitsch and UPGMA trees have a root near the node which connects Azu2 Metj to the other entries. A black rectangle has been placed in the illustrations to show its approximate position - the DRAWTREE program does not mark the location of this node.
There are several programs in the PHYLIP package for working with discrete state data. The only program that we've covered so far that uses this sort of data is RESTML, and it makes certain assumptions that the others don't make, since the only trait it allows is the existence, or not, of restriction sites. Since I think it unlikely that many of you will ever use these other programs we won't go through them in class.
The next topic that I want to cover is how to get the trees that these programs produce into a graphics format.
The first program, is RETREE. It allows you to flip the tree at any node. This changes the look of the tree, but not what it says.
$ UsePhylip 3 1 RETREE 2 NEIGHBOR.TREE 3 RETREE 0 2 Y Settings are correct
Another program deals with a slightly different problem - let's say that you have run 5 or 6 different methods - how do you figure out which parts of the various trees are in agreement? For small trees you can do this by inspection, but for bigger trees that can be really tedious. The CONSENSE program will do this for you. First you need to put all of the trees into one file, which has the format shown here:
$ UsePhylip
i
3
1
CONSENSE
2
CLASS:EXAMPLE_7.input
3
CONSENSE
0
i
(A,(B,(H,(D,(J,(((G,E),(F,I)),C))))));
(A,(B,(D,((J,H),(((G,E),(F,I)),C)))));
(A,(B,(D,(H,(J,(((G,E),(F,I)),C))))));
(A,(B,(E,(G,((F,I),((J,(H,D)),C))))));
(A,(B,(E,(G,((F,I),(((J,H),D),C))))));
(A,(B,(E,((F,I),(G,((J,(H,D)),C))))));
(A,(B,(E,((F,I),(G,(((J,H),D),C))))));
(A,(B,(E,((G,(F,I)),((J,(H,D)),C)))));
(A,(B,(E,((G,(F,I)),(((J,H),D),C)))));
2
Y Settings are correct
o
Majority-rule and strict consensus tree program, version 3.572c
Species in order:
A
B
H
D
J
G
E
F
I
C
Sets included in the consensus tree
Set (species in order) How many times out of 9.00
.......**. 9.00
..******** 9.00
..***....* 6.00
..****.*** 6.00
..***..... 6.00
..*.*..... 4.00
..***..*** 2.00
Sets NOT included in consensus tree:
Set (species in order) How many times out of 9.00
.....**... 3.00
.....****. 3.00
..**...... 3.00
.....***** 3.00
..*.****** 2.00
.....*.**. 2.00
..****...* 2.00
....****** 2.00
...******* 1.00
CONSENSUS TREE:
the numbers at the forks indicate the number
of times the group consisting of the species
which are to the right of that fork occurred
among the trees, out of 9.00 trees
+---------------------------------------A
!
! +-----------------------------E
! !
! ! +----I
! ! +------------9.0
! ! ! +----F
! +--9.0 !
! ! ! +--2.0 +---------D
! ! ! ! ! +--6.0
! ! ! ! ! ! ! +----J
! ! ! ! +--6.0 +--4.0
+--9.0 +--6.0 ! +----H
! ! !
! ! +--------------C
! !
! +------------------------G
!
+----------------------------------B
remember: this is an unrooted tree!
t
(A:9.0,((E:9.0,(((I:9.0,F:9.0):9.0,((D:9.0,(J:9.0,H:9.0):4.0)
:6.0,C:9.0):6.0):2.0,G:9.0):6.0):9.0,B:9.0):9.0);
$ UsePhylip 3 1 DRAWTREE 2 CLASS:EXAMPLE_1.INPUT 3 CONSENSE 0 2 L Set final output graphics format (postscript) K Set preview graphics format (tektronix) Y Settings are correct 0
If you selected postscript output, and generally you do, then print it with:
$ printg/noflag consense.plot
DRAWGRAM works similarly, but on rooted trees.
$ UsePhylip 3 1 DRAWGRAM 2 CLASS:EXAMPLE_1.INPUT 3 DNAMLK 0 2 L K YY 0
![]()
That's pretty much it for the mechanics of using the PHYLIP package. It is not difficult to use - but it very important that you verify that the assumptions that each program makes are at least approximately correct for the data that you are trying to analyze.
Briefly, here is how to use the GCG programs. Use the Distances program to find the distances between sequences. (the distance matrix file you use with the PHYLIP package won't work verbatim in the GCG package or vice versa.) Once a distance file has been generated, use Growtree to do either a Neighbor-Joining (menu=1) or UPGMA (menu=2) analysis.
$ distances/infile=azurin.msf{*}/default
$ setplot
$ growtree/infile=azurin.distances/color=2/menu=1
$ type azurin.trees
#NEXUS
[ Trees from file: Azurin.Distances ]
begin trees;
utree Tree_1 = ((((('H81_Neigo':0.00,'H8_Neime':0.00):39.66,'Azu1_Metj':21.95)
:5.46,(('Az_Alcsp':10.25,'Az_Borbr':13.50):7.04,(('Az_Pseae'
:9.79,'Az_Psede':5.96):7.10,((('Az_Psefb':7.76,'Az_Psepu'
:3.27):6.97,'Az_Psefc':9.88):2.93,'Az_Psefd':10.45):7.65)
:4.30):3.59):2.09,('Az_Alcde':18.36,'Azu2_Metj':53.53):3.98)
:12.38,'Az_Alcfa':12.38):0.00;
endblock;
$ distances/infile=azurin.msf{*}/default
$ setplot
$ growtree/infile=azurin.distances/color=2/menu=2
$ type azurin.trees
#NEXUS
[ Trees from file: Azurin.Distances ]
begin trees;
tree Tree_1 = ((('H81_Neigo':0.00,'H8_Neime':0.00):33.73,((('Az_Alcde':21.51,
'Az_Alcfa':21.51):25.71,(('Az_Alcsp':11.88,'Az_Borbr':11.88):22.18,
(('Az_Pseae':7.88,'Az_Psede':7.88):16.84,((('Az_Psefb':5.51,
'Az_Psepu':5.51):11.18,'Az_Psefc':11.18):12.55,'Az_Psefd':12.55)
:16.84):22.18):25.71):27.44,'Azu1_Metj':27.44):33.73):42.24,'Azu2_Metj'
:42.24):0.00;
endblock;
The Phylo_Win program
Nicholas Galtier's Phylo_Win program is a relatively simple to use phylogeny program with a graphics user interface. It is straightforward in this program to determine which regions and sequences to analyze, and also to rearrange trees interactively. The version we have on Seqaxp requires an X11 server though. The program will accept a PAUP formatted file, along with several others, but not MSF format. Here is an example of how to start the program.
$ readseq -f17 -oazurin.paup -a azurin.msf $ phylo_win azurin.paupHere is an example of the node swapping mode, showing a tree produced by a maximum parsimony analysis. Clicking on one of the black boxes swaps the positions of the two branches to the right of that box.
![]()
Next week we'll go over how to format data for publication. Are there any questions?