fpromlk 
Please help by correcting and extending the Wiki pages.
Estimates phylogenies from protein amino acid sequences by maximum likelihood. The PAM, JTT, or PMB models can be employed, and also use of a Hidden Markov model of rates, with the program inferring which sites have which rates. This also allows gammadistribution and gammaplusinvariant sites distributions of rates across sites. It also allows different rates of change at known sites.
The assumptions of the model are:
Note the assumption that we are looking at all positions, including those that have not changed at all. It is important not to restrict attention to some positions based on whether or not they have changed; doing that would bias branch lengths by making them too long, and that in turn would cause the method to misinterpret the meaning of those positions that had changed.
This program uses a Hidden Markov Model (HMM) method of inferring different rates of evolution at different amino acid positions. This was described in a paper by me and Gary Churchill (1996). It allows us to specify to the program that there will be a number of different possible evolutionary rates, what the prior probabilities of occurrence of each is, and what the average length of a patch of positions all having the same rate. The rates can also be chosen by the program to approximate a Gamma distribution of rates, or a Gamma distribution plus a class of invariant positions. The program computes the likelihood by summing it over all possible assignments of rates to positions, weighting each by its prior probability of occurrence.
For example, if we have used the C and A options (described below) to specify that there are three possible rates of evolution, 1.0, 2.4, and 0.0, that the prior probabilities of a position having these rates are 0.4, 0.3, and 0.3, and that the average patch length (number of consecutive positions with the same rate) is 2.0, the program will sum the likelihood over all possibilities, but giving less weight to those that (say) assign all positions to rate 2.4, or that fail to have consecutive positions that have the same rate.
The Hidden Markov Model framework for rate variation among positions was independently developed by Yang (1993, 1994, 1995). We have implemented a general scheme for a Hidden Markov Model of rates; we allow the rates and their prior probabilities to be specified arbitrarily by the user, or by a discrete approximation to a Gamma distribution of rates (Yang, 1995), or by a mixture of a Gamma distribution and a class of invariant positions.
This feature effectively removes the artificial assumption that all positions have the same rate, and also means that we need not know in advance the identities of the positions that have a particular rate of evolution.
Another layer of rate variation also is available. The user can assign categories of rates to each positions (for example, we might want amino acid positions in the active site of a protein to change more slowly than other positions. This is done with the categories input file and the C option. We then specify (using the menu) the relative rates of evolution of amino acid positions in the different categories. For example, we might specify that positions in the active site evolve at relative rates of 0.2 compared to 1.0 at other positions. If we are assuming that a particular position maintains a cysteine bridge to another, we may want to put it in a category of positions (including perhaps the initial position of the protein sequence which maintains methionine) which changes at a rate of 0.0.
If both userassigned rate categories and Hidden Markov Model rates are allowed, the program assumes that the actual rate at a position is the product of the userassigned category rate and the Hidden Markov Model regional rate. (This may not always make perfect biological sense: it would be more natural to assume some upper bound to the rate, as we have discussed in the Felsenstein and Churchill paper). Nevertheless you may want to use both types of rate variation.
% fpromlk Protein phylogeny by maximum likelihood Input (aligned) protein sequence set(s): promlk.dat Phylip tree file (optional): Phylip promlk program output file [promlk.fpromlk]: Adding species: 1. Alpha 2. Beta 3. Gamma 4. Delta 5. Epsilon Output written to file "promlk.fpromlk" Tree also written onto file "promlk.treefile" Done. 
Go to the input files for this example
Go to the output files for this example
Protein phylogeny by maximum likelihood Version: EMBOSS:6.4.0.0 Standard (Mandatory) qualifiers: [sequence] seqsetall File containing one or more sequence alignments [intreefile] tree Phylip tree file (optional) [outfile] outfile [*.fpromlk] Phylip promlk program output file Additional (Optional) qualifiers (* if not always prompted): ncategories integer [1] Number of substitution rate categories (Integer from 1 to 9) * rate array Rate for each category * categories properties File of substitution rate categories weights properties Weights file * lengths boolean [N] Use branch lengths from user trees model menu [JonesTaylorThornton] Probability model for amino acid change (Values: j (JonesTaylorThornton); h (Henikoff/Tillier PMBs); d (Dayhoff PAM)) gamma menu [n] Rate variation among sites (Values: g (Gamma distributed rates); i (Gamma+invariant sites); h (User defined HMM of rates); n (Constant rate)) * gammacoefficient float [1] Coefficient of variation of substitution rate among sites (Number 0.001 or more) * ngammacat integer [1] Number of categories (19) (Integer from 1 to 9) * invarcoefficient float [1] Coefficient of variation of substitution rate among sites (Number 0.001 or more) * ninvarcat integer [1] Number of categories (19) including one for invariant sites (Integer from 1 to 9) * invarfrac float [0.0] Fraction of invariant sites (Number from 0.000 to 1.000) * nhmmcategories integer [1] Number of HMM rate categories (Integer from 1 to 9) * hmmrates array [1.0] HMM category rates * hmmprobabilities array [1.0] Probability for each HMM category * adjsite boolean [N] Rates at adjacent sites correlated * lambda float [1.0] Mean block length of sites having the same rate (Number 1.000 or more) * njumble integer [0] Number of times to randomise (Integer 0 or more) * seed integer [1] Random number seed between 1 and 32767 (must be odd) (Integer from 1 to 32767) * global boolean [N] Global rearrangements outgrno integer [0] Species number to use as outgroup (Integer 0 or more) [no]trout toggle [Y] Write out trees to tree file * outtreefile outfile [*.fpromlk] Phylip tree output file (optional) printdata boolean [N] Print data at start of run [no]progress boolean [Y] Print indications of progress of run [no]treeprint boolean [Y] Print out tree hypstate boolean [N] Reconstruct hypothetical sequence Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "sequence" associated qualifiers sbegin1 integer Start of each sequence to be used send1 integer End of each sequence to be used sreverse1 boolean Reverse (if DNA) sask1 boolean Ask for begin/end/reverse snucleotide1 boolean Sequence is nucleotide sprotein1 boolean Sequence is protein slower1 boolean Make lower case supper1 boolean Make upper case sformat1 string Input sequence format sdbname1 string Database name sid1 string Entryname ufo1 string UFO features fformat1 string Features format fopenfile1 string Features file name "outfile" associated qualifiers odirectory3 string Output directory "outtreefile" associated qualifiers odirectory string Output directory General qualifiers: auto boolean Turn off prompts stdout boolean Write first file to standard output filter boolean Read first file from standard input, write first file to standard output options boolean Prompt for standard and additional values debug boolean Write debug output to program.dbg verbose boolean Report some/full command line options help boolean Report command line options and exit. More information on associated and general qualifiers can be found with help verbose warning boolean Report warnings error boolean Report errors fatal boolean Report fatal errors die boolean Report dying program messages version boolean Report version number and exit 
Qualifier  Type  Description  Allowed values  Default  

Standard (Mandatory) qualifiers  
[sequence] (Parameter 1) 
seqsetall  File containing one or more sequence alignments  Readable sets of sequences  Required  
[intreefile] (Parameter 2) 
tree  Phylip tree file (optional)  Phylogenetic tree  
[outfile] (Parameter 3) 
outfile  Phylip promlk program output file  Output file  <*>.fpromlk  
Additional (Optional) qualifiers  
ncategories  integer  Number of substitution rate categories  Integer from 1 to 9  1  
rate  array  Rate for each category  List of floating point numbers  
categories  properties  File of substitution rate categories  Property value(s)  
weights  properties  Weights file  Property value(s)  
lengths  boolean  Use branch lengths from user trees  Boolean value Yes/No  No  
model  list  Probability model for amino acid change 

JonesTaylorThornton  
gamma  list  Rate variation among sites 

n  
gammacoefficient  float  Coefficient of variation of substitution rate among sites  Number 0.001 or more  1  
ngammacat  integer  Number of categories (19)  Integer from 1 to 9  1  
invarcoefficient  float  Coefficient of variation of substitution rate among sites  Number 0.001 or more  1  
ninvarcat  integer  Number of categories (19) including one for invariant sites  Integer from 1 to 9  1  
invarfrac  float  Fraction of invariant sites  Number from 0.000 to 1.000  0.0  
nhmmcategories  integer  Number of HMM rate categories  Integer from 1 to 9  1  
hmmrates  array  HMM category rates  List of floating point numbers  1.0  
hmmprobabilities  array  Probability for each HMM category  List of floating point numbers  1.0  
adjsite  boolean  Rates at adjacent sites correlated  Boolean value Yes/No  No  
lambda  float  Mean block length of sites having the same rate  Number 1.000 or more  1.0  
njumble  integer  Number of times to randomise  Integer 0 or more  0  
seed  integer  Random number seed between 1 and 32767 (must be odd)  Integer from 1 to 32767  1  
global  boolean  Global rearrangements  Boolean value Yes/No  No  
outgrno  integer  Species number to use as outgroup  Integer 0 or more  0  
[no]trout  toggle  Write out trees to tree file  Toggle value Yes/No  Yes  
outtreefile  outfile  Phylip tree output file (optional)  Output file  <*>.fpromlk  
printdata  boolean  Print data at start of run  Boolean value Yes/No  No  
[no]progress  boolean  Print indications of progress of run  Boolean value Yes/No  Yes  
[no]treeprint  boolean  Print out tree  Boolean value Yes/No  Yes  
hypstate  boolean  Reconstruct hypothetical sequence  Boolean value Yes/No  No  
Advanced (Unprompted) qualifiers  
(none)  
Associated qualifiers  
"sequence" associated seqsetall qualifiers  
sbegin1 sbegin_sequence 
integer  Start of each sequence to be used  Any integer value  0  
send1 send_sequence 
integer  End of each sequence to be used  Any integer value  0  
sreverse1 sreverse_sequence 
boolean  Reverse (if DNA)  Boolean value Yes/No  N  
sask1 sask_sequence 
boolean  Ask for begin/end/reverse  Boolean value Yes/No  N  
snucleotide1 snucleotide_sequence 
boolean  Sequence is nucleotide  Boolean value Yes/No  N  
sprotein1 sprotein_sequence 
boolean  Sequence is protein  Boolean value Yes/No  N  
slower1 slower_sequence 
boolean  Make lower case  Boolean value Yes/No  N  
supper1 supper_sequence 
boolean  Make upper case  Boolean value Yes/No  N  
sformat1 sformat_sequence 
string  Input sequence format  Any string  
sdbname1 sdbname_sequence 
string  Database name  Any string  
sid1 sid_sequence 
string  Entryname  Any string  
ufo1 ufo_sequence 
string  UFO features  Any string  
fformat1 fformat_sequence 
string  Features format  Any string  
fopenfile1 fopenfile_sequence 
string  Features file name  Any string  
"outfile" associated outfile qualifiers  
odirectory3 odirectory_outfile 
string  Output directory  Any string  
"outtreefile" associated outfile qualifiers  
odirectory  string  Output directory  Any string  
General qualifiers  
auto  boolean  Turn off prompts  Boolean value Yes/No  N  
stdout  boolean  Write first file to standard output  Boolean value Yes/No  N  
filter  boolean  Read first file from standard input, write first file to standard output  Boolean value Yes/No  N  
options  boolean  Prompt for standard and additional values  Boolean value Yes/No  N  
debug  boolean  Write debug output to program.dbg  Boolean value Yes/No  N  
verbose  boolean  Report some/full command line options  Boolean value Yes/No  Y  
help  boolean  Report command line options and exit. More information on associated and general qualifiers can be found with help verbose  Boolean value Yes/No  N  
warning  boolean  Report warnings  Boolean value Yes/No  Y  
error  boolean  Report errors  Boolean value Yes/No  Y  
fatal  boolean  Report fatal errors  Boolean value Yes/No  Y  
die  boolean  Report dying program messages  Boolean value Yes/No  Y  
version  boolean  Report version number and exit  Boolean value Yes/No  N 
5 13 Alpha AACGTGGCCAAAT Beta AAGGTCGCCAAAC Gamma CATTTCGTCACAA Delta GGTATTTCGGCCT Epsilon GGGATCTCGGCCC 
If the R (HMM rates) option is used a table of the relative rates of expected substitution at each category of positions is printed, as well as the probabilities of each of those rates.
There then follow the data sequences, if the user has selected the menu option to print them out, with the base sequences printed in groups of ten amino acids. The trees found are printed as a rooted tree topology. The internal nodes are numbered arbitrarily for the sake of identification. The number of trees evaluated so far and the log likelihood of the tree are also given. The branch lengths in the diagram are roughly proportional to the estimated branch lengths, except that very short branches are printed out at least three characters in length so that the connections can be seen. The unit of branch length is the expected fraction of amino acids changed (so that 1.0 is 100 PAMs).
A table is printed showing the length of each tree segment, and the time (in units of expected amino acid substitutions per position) of each fork in the tree, measured from the root of the tree. I have not attempted in include code for approximate confidence limits on branch points, as I have done for branch lengths in PROML, both because of the extreme crudeness of that test, and because the variation of times for different forks would be highly correlated.
The log likelihood printed out with the final tree can be used to perform various likelihood ratio tests. One can, for example, compare runs with different values of the relative rate of change in the active site and in the rest of the protein to determine which value is the maximum likelihood estimate, and what is the allowable range of values (using a likelihood ratio test, which you will find described in mathematical statistics books). One could also estimate the base frequencies in the same way. Both of these, particularly the latter, require multiple runs of the program to evaluate different possibl values, and this might get expensive.
This program makes possible a (reasonably) legitimate statistical test of the molecular clock. To do such a test, run PROML and PROMLK on the same data. If the trees obtained are of the same topology (when considered as unrooted), it is legitimate to compare their likelihoods by the likelihood ratio test. In PROML the likelihood has been computed by estimating 2n3 branch lengths, if their are n tips on the tree. In PROMLK it has been computed by estimating n1 branching times (in effect, n1 branch lengths). The difference in the number of parameters is (2n3)(n1) = n2. To perform the test take the difference in log likelihoods between the two runs (PROML should be the higher of the two, barring numerical iteration difficulties) and double it. Look this up on a chisquare distribution with n2 degrees of freedom. If the result is significant, the log likelihood has been significantly increased by allowing all 2n3 branch lengths to be estimated instead of just n1, and molecular clock may be rejected.
If the U (User Tree) option is used and more than one tree is supplied, and the program is not told to assume autocorrelation between the rates at different amino acid positions, the program also performs a statistical test of each of these trees against the one with highest likelihood. If there are two user trees, the test done is one which is due to Kishino and Hasegawa (1989), a version of a test originally introduced by Templeton (1983). In this implementation it uses the mean and variance of loglikelihood differences between trees, taken across amino acid positions. If the two trees' means are more than 1.96 standard deviations different then the trees are declared significantly different. This use of the empirical variance of loglikelihood differences is more robust and nonparametric than the classical likelihood ratio test, and may to some extent compensate for the any lack of realism in the model underlying this program.
If there are more than two trees, the test done is an extension of the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that a correction for the number of trees was necessary, and they introduced a resampling method to make this correction. In the version used here the variances and covariances of the sum of log likelihoods across amino acid positions are computed for all pairs of trees. To test whether the difference between each tree and the best one is larger than could have been expected if they all had the same expected loglikelihood, loglikelihoods for all trees are sampled with these covariances and equal means (Shimodaira and Hasegawa's "least favorable hypothesis"), and a P value is computed from the fraction of times the difference between the tree's value and the highest loglikelihood exceeds that actually observed. Note that this sampling needs random numbers, and so the program will prompt the user for a random number seed if one has not already been supplied. With the twotree KHT test no random numbers are used.
In either the KHT or the SH test the program prints out a table of the loglikelihoods of each tree, the differences of each from the highest one, the variance of that quantity as determined by the loglikelihood differences at individual sites, and a conclusion as to whether that tree is or is not significantly worse than the best one. However the test is not available if we assume that there is autocorrelation of rates at neighboring positions (option A) and is not done in those cases.
The branch lengths printed out are scaled in terms of 100 times the expected numbers of amino acid substitutions, scaled so that the average rate of change, averaged over all the positions analyzed, is set to 100.0, if there are multiple categories of positions. This means that whether or not there are multiple categories of positions, the expected percentage of change for very small branches is equal to the branch length. Of course, when a branch is twice as long this does not mean that there will be twice as much net change expected along it, since some of the changes occur in the same position and overlie or even reverse each other. underlying numbers of changes. That means that a branch of length 26 is 26 times as long as one which would show a 1% difference between the amino acid sequences at the beginning and end of the branch, but we would not expect the sequences at the beginning and end of the branch to be 26% different, as there would be some overlaying of changes.
Because of limitations of the numerical algorithm, branch length estimates of zero will often print out as small numbers such as 0.00001. If you see a branch length that small, it is really estimated to be of zero length.
Another possible source of confusion is the existence of negative values for the log likelihood. This is not really a problem; the log likelihood is not a probability but the logarithm of a probability. When it is negative it simply means that the corresponding probability is less than one (since we are seeing its logarithm). The log likelihood is maximized by being made more positive: 30.23 is worse than 29.14.
At the end of the output, if the R option is in effect with multiple HMM rates, the program will print a list of what amino acid position categories contributed the most to the final likelihood. This combination of HMM rate categories need not have contributed a majority of the likelihood, just a plurality. Still, it will be helpful as a view of where the program infers that the higher and lower rates are. Note that the use in this calculations of the prior probabilities of different rates, and the average patch length, gives this inference a "smoothed" appearance: some other combination of rates might make a greater contribution to the likelihood, but be discounted because it conflicts with this prior information. See the example output below to see what this printout of rate categories looks like. A second list will also be printed out, showing for each position which rate accounted for the highest fraction of the likelihood. If the fraction of the likelihood accounted for is less than 95%, a dot is printed instead.
Option 3 in the menu controls whether the tree is printed out into the output file. This is on by default, and usually you will want to leave it this way. However for runs with multiple data sets such as bootstrapping runs, you will primarily be interested in the trees which are written onto the output tree file, rather than the trees printed on the output file. To keep the output file from becoming too large, it may be wisest to use option 3 to prevent trees being printed onto the output file.
Option 4 in the menu controls whether the tree estimated by the program is written onto a tree file. The default name of this output tree file is "outtree". If the U option is in effect, all the userdefined trees are written to the output tree file.
Option 5 in the menu controls whether ancestral states are estimated at each node in the tree. If it is in effect, a table of ancestral sequences is printed out (including the sequences in the tip species which are the input sequences). The symbol printed out is for the amino acid which accounts for the largest fraction of the likelihood at that position. In that table, if a position has an amino acid which accounts for more than 95% of the likelihood, its symbol printed in capital letters (W rather than w). One limitation of the current version of the program is that when there are multiple HMM rates (option R) the reconstructed amino acids are based on only the single assignment of rates to positions which accounts for the largest amount of the likelihood. Thus the assessment of 95% of the likelihood, in tabulating the ancestral states, refers to 95% of the likelihood that is accounted for by that particular combination of rates.
Amino acid sequence Maximum Likelihood method with molecular clock, version 3.69 JonesTaylorThornton model of amino acid change +Epsilon +4 ! +Delta 3 ! +Gamma +2 ! +Beta +1 +Alpha Ln Likelihood = 134.70332 Ancestor Node Node Height Length      root 3 3 4 0.66464 0.66464 4 Epsilon 0.85971 0.19507 4 Delta 0.85971 0.19507 3 2 0.37420 0.37420 2 Gamma 0.85971 0.48551 2 1 0.70208 0.32788 1 Beta 0.85971 0.15763 1 Alpha 0.85971 0.15763 
((Epsilon:0.19507,Delta:0.19507):0.66464,(Gamma:0.48551, (Beta:0.15763,Alpha:0.15763):0.32788):0.37420); 
Program name  Description 

distmat  Create a distance matrix from a multiple sequence alignment 
ednacomp  DNA compatibility algorithm 
ednadist  Nucleic acid sequence Distance Matrix program 
ednainvar  Nucleic acid sequence Invariants method 
ednaml  Phylogenies from nucleic acid Maximum Likelihood 
ednamlk  Phylogenies from nucleic acid Maximum Likelihood with clock 
ednapars  DNA parsimony algorithm 
ednapenny  Penny algorithm for DNA 
eprotdist  Protein distance algorithm 
eprotpars  Protein parsimony algorithm 
erestml  Restriction site Maximum Likelihood method 
eseqboot  Bootstrapped sequences algorithm 
fdiscboot  Bootstrapped discrete sites algorithm 
fdnacomp  DNA compatibility algorithm 
fdnadist  Nucleic acid sequence Distance Matrix program 
fdnainvar  Nucleic acid sequence Invariants method 
fdnaml  Estimates nucleotide phylogeny by maximum likelihood 
fdnamlk  Estimates nucleotide phylogeny by maximum likelihood 
fdnamove  Interactive DNA parsimony 
fdnapars  DNA parsimony algorithm 
fdnapenny  Penny algorithm for DNA 
fdolmove  Interactive Dollo or Polymorphism Parsimony 
ffreqboot  Bootstrapped genetic frequencies algorithm 
fproml  Protein phylogeny by maximum likelihood 
fprotdist  Protein distance algorithm 
fprotpars  Protein parsimony algorithm 
frestboot  Bootstrapped restriction sites algorithm 
frestdist  Distance matrix from restriction sites or fragments 
frestml  Restriction site maximum Likelihood method 
fseqboot  Bootstrapped sequences algorithm 
fseqbootall  Bootstrapped sequences algorithm 
Please report all bugs to the EMBOSS bug team (embossbug © emboss.openbio.org) not to the original author.
Converted (August 2004) to an EMBASSY program by the EMBOSS team.