Please help by correcting and extending the Wiki pages.
The default (gene frequency) mode treats the input as gene frequencies at a series of loci, and square-root-transforms the allele frequencies (constructing the frequency of the missing allele at each locus first). This enables us to use the Brownian motion model on the resulting coordinates, in an approximation equivalent to using Cavalli-Sforza and Edwards's (1967) chord measure of genetic distance and taking that to give distance between particles undergoing pure Brownian motion. It assumes that each locus evolves independently by pure genetic drift.
The alternative continuous characters mode (menu option C) treats the input as a series of coordinates of each species in N dimensions. It assumes that we have transformed the characters to remove correlations and to standardize their variances.
Many current users of CONTML use it to analyze microsatellite data. There are three ways to do this:
Those distance measures allow for mutation during the divergence of the populations. But even they are not perfect -- they do not allow us to use all the information contained in the gene frequency differences of within a copy number allele. There is a need for a more complete statistical treatment of inference of phylogenies from microsatellite models, ones that take both mutation and genetic drift fully into account.
0.10 X 18 + 0.24 X 19 + 0.60 X 20 + 0.06 X 21 = 19.62
copies. These values can, I believe, be calculated by a spreadsheet program. Each microsatellite is represented by one character, and the continuous character mode of CONTML is used (not the gene frequencies mode). This coding allows for mutation that changes copy number. It does not make complete use of all data, but neither does the treatment of microsatellite gene frequencies as changing only genetic drift. frequency
% fcontml -printdata Gene frequency and continuous character Maximum Likelihood Input file: contml.dat Phylip tree file (optional): Phylip contml program output file [contml.fcontml]: Adding species: 1. European 2. African 3. Chinese 4. American 5. Australian Output written to file "contml.fcontml" Tree also written onto file "contml.treefile" Done.
Go to the input files for this example
Go to the output files for this example
Gene frequency and continuous character Maximum Likelihood Version: EMBOSS:188.8.131.52 Standard (Mandatory) qualifiers: [-infile] frequencies File containing one or more sets of data [-intreefile] tree Phylip tree file (optional) [-outfile] outfile [*.fcontml] Phylip contml program output file Additional (Optional) qualifiers (* if not always prompted): -datatype menu [g] Input type in infile (Values: g (Gene frequencies); i (Continuous characters)) * -lengths boolean [N] Use branch lengths from user trees * -njumble integer  Number of times to randomise (Integer 0 or more) * -seed integer  Random number seed between 1 and 32767 (must be odd) (Integer from 1 to 32767) * -global boolean [N] Global rearrangements -outgrno integer  Species number to use as outgroup (Integer 0 or more) -[no]trout toggle [Y] Write out trees to tree file * -outtreefile outfile [*.fcontml] Phylip tree output file (optional) -printdata boolean [N] Print data at start of run -[no]progress boolean [Y] Print indications of progress of run -[no]treeprint boolean [Y] Print out tree Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-outfile" associated qualifiers -odirectory3 string Output directory "-outtreefile" associated qualifiers -odirectory string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages -version boolean Report version number and exit
|Standard (Mandatory) qualifiers|
|frequencies||File containing one or more sets of data||Frequency value(s)|
|tree||Phylip tree file (optional)||Phylogenetic tree|
|outfile||Phylip contml program output file||Output file||<*>.fcontml|
|Additional (Optional) qualifiers|
|-datatype||list||Input type in infile||
|-lengths||boolean||Use branch lengths from user trees||Boolean value Yes/No||No|
|-njumble||integer||Number of times to randomise||Integer 0 or more||0|
|-seed||integer||Random number seed between 1 and 32767 (must be odd)||Integer from 1 to 32767||1|
|-global||boolean||Global rearrangements||Boolean value Yes/No||No|
|-outgrno||integer||Species number to use as outgroup||Integer 0 or more||0|
|-[no]trout||toggle||Write out trees to tree file||Toggle value Yes/No||Yes|
|-outtreefile||outfile||Phylip tree output file (optional)||Output file||<*>.fcontml|
|-printdata||boolean||Print data at start of run||Boolean value Yes/No||No|
|-[no]progress||boolean||Print indications of progress of run||Boolean value Yes/No||Yes|
|-[no]treeprint||boolean||Print out tree||Boolean value Yes/No||Yes|
|Advanced (Unprompted) qualifiers|
|"-outfile" associated outfile qualifiers|
|string||Output directory||Any string|
|"-outtreefile" associated outfile qualifiers|
|-odirectory||string||Output directory||Any string|
|-auto||boolean||Turn off prompts||Boolean value Yes/No||N|
|-stdout||boolean||Write first file to standard output||Boolean value Yes/No||N|
|-filter||boolean||Read first file from standard input, write first file to standard output||Boolean value Yes/No||N|
|-options||boolean||Prompt for standard and additional values||Boolean value Yes/No||N|
|-debug||boolean||Write debug output to program.dbg||Boolean value Yes/No||N|
|-verbose||boolean||Report some/full command line options||Boolean value Yes/No||Y|
|-help||boolean||Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose||Boolean value Yes/No||N|
|-warning||boolean||Report warnings||Boolean value Yes/No||Y|
|-error||boolean||Report errors||Boolean value Yes/No||Y|
|-fatal||boolean||Report fatal errors||Boolean value Yes/No||Y|
|-die||boolean||Report dying program messages||Boolean value Yes/No||Y|
|-version||boolean||Report version number and exit||Boolean value Yes/No||N|
When the gene frequencies data are used in CONTML or GENDIST, this involves the following assumptions:
How these assumptions affect the methods will be seen in my papers on inference of phylogenies from gene frequency and continuous character data (Felsenstein, 1973b, 1981c, 1985c).
The input formats are fairly similar to the discrete-character programs, but with one difference. When CONTML is used in the gene-frequency mode (its usual, default mode), or when GENDIST is used, the first line contains the number of species (or populations) and the number of loci and the options information. There then follows a line which gives the numbers of alleles at each locus, in order. This must be the full number of alleles, not the number of alleles which will be input: i. e. for a two-allele locus the number should be 2, not 1. There then follow the species (population) data, each species beginning on a new line. The first 10 characters are taken as the name, and thereafter the values of the individual characters are read free-format, preceded and separated by blanks. They can go to a new line if desired, though of course not in the middle of a number. Missing data is not allowed - an important limitation. In the default configuration, for each locus, the numbers should be the frequencies of all but one allele. The menu option A (All) signals that the frequencies of all alleles are provided in the input data -- the program will then automatically ignore the last of them. So without the A option, for a three-allele locus there should be two numbers, the frequencies of two of the alleles (and of course it must always be the same two!). Here is a typical data set without the A option:
5 3 2 3 2 Alpha 0.90 0.80 0.10 0.56 Beta 0.72 0.54 0.30 0.20 Gamma 0.38 0.10 0.05 0.98 Delta 0.42 0.40 0.43 0.97 Epsilon 0.10 0.30 0.70 0.62
whereas here is what it would have to look like if the A option were invoked:
5 3 2 3 2 Alpha 0.90 0.10 0.80 0.10 0.10 0.56 0.44 Beta 0.72 0.28 0.54 0.30 0.16 0.20 0.80 Gamma 0.38 0.62 0.10 0.05 0.85 0.98 0.02 Delta 0.42 0.58 0.40 0.43 0.17 0.97 0.03 Epsilon 0.10 0.90 0.30 0.70 0.00 0.62 0.38
The first line has the number of species (or populations) and the number of loci. The second line has the number of alleles for each of the 3 loci. The species lines have names (filled out to 10 characters with blanks) followed by the gene frequencies of the 2 alleles for the first locus, the 3 alleles for the second locus, and the 2 alleles for the third locus. You can start a new line after any of these allele frequencies, and continue to give the frequencies on that line (without repeating the species name).
If all alleles of a locus are given, it is important to have them add up to 1. Roundoff of the frequencies may cause the program to conclude that the numbers do not sum to 1, and stop with an error message.
While many compilers may be more tolerant, it is probably wise to make sure that each number, including the first, is preceded by a blank, and that there are digits both preceding and following any decimal points.
CONTML and CONTRAST also treat quantitative characters (the continuous-characters mode in CONTML, which is option C). It is assumed that each character is evolving according to a Brownian motion model, at the same rate, and independently. In reality it is almost always impossible to guarantee this. The issue is discussed at length in my review article in Annual Review of Ecology and Systematics (Felsenstein, 1988a), where I point out the difficulty of transforming the characters so that they are not only genetically independent but have independent selection acting on them. If you are going to use CONTML to model evolution of continuous characters, then you should at least make some attempt to remove genetic correlations between the characters (usually all one can do is remove phenotypic correlations by transforming the characters so that there is no within-population covariance and so that the within-population variances of the characters are equal -- this is equivalent to using Canonical Variates). However, this will only guarantee that one has removed phenotypic covariances between characters. Genetic covariances could only be removed by knowing the coheritabilities of the characters, which would require genetic experiments, and selective covariances (covariances due to covariation of selection pressures) would require knowledge of the sources and extent of selection pressure in all variables.
CONTRAST is a program designed to infer, for a given phylogeny that is provided to the program, the covariation between characters in a data set. Thus we have a program in this set that allow us to take information about the covariation and rates of evolution of characters and make an estimate of the phylogeny (CONTML), and a program that takes an estimate of the phylogeny and infers the variances and covariances of the character changes. But we have no program that infers both the phylogenies and the character covariation from the same data set.
In the quantitative characters mode, a typical small data set would be:
5 6 Alpha 0.345 0.467 1.213 2.2 -1.2 1.0 Beta 0.457 0.444 1.1 1.987 -0.2 2.678 Gamma 0.6 0.12 0.97 2.3 -0.11 1.54 Delta 0.68 0.203 0.888 2.0 1.67 Epsilon 0.297 0.22 0.90 1.9 1.74
Note that in the latter case, there is no line giving the numbers of alleles at each locus. In this latter case no square-root transformation of the coordinates is done: each is assumed to give directly the position on the Brownian motion scale.
For further discussion of options and modifiable constants in CONTML, GENDIST, and CONTRAST see the documentation files for those programs.
5 10 2 2 2 2 2 2 2 2 2 2 European 0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205 0.8055 0.5043 African 0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600 0.7582 0.6207 Chinese 0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726 0.7482 0.7334 American 0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000 0.8086 0.8636 Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396 0.9097 0.2976
The units of length are amounts of expected accumulated variance (not time). The log likelihood (natural log) of each tree is also given, and it is indicated how many topologies have been tried. The tree does not necessarily have all tips contemporary, and the log likelihood may be either positive or negative (this simply corresponds to whether the density function does or does not exceed 1) and a negative log likelihood does not indicate any error. The log likelihood allows various formal likelihood ratio hypothesis tests. The description of the tree includes approximate standard errors on the lengths of segments of the tree. These are calculated by considering only the curvature of the likelihood surface as the length of the segment is varied, holding all other lengths constant. As such they are most probably underestimates of the variance, and hence may give too much confidence in the given tree.
One should use caution in interpreting the likelihoods that are printed out. If the model is wrong, it will not be possible to use the likelihoods to make formal statistical statements. Thus, if gene frequencies are being analyzed, but the gene frequencies change not only by genetic drift, but also by mutation, the model is not correct. It would be as well-justified in this case to use GENDIST to compute the Nei (1972) genetic distance and then use FITCH, KITSCH or NEIGHBOR to make a tree. If continuous characters are being analyzed, but if the characters have not been transformed to new coordinates that evolve independently and at equal rates, then the model is also violated and no statistical analysis is possible. Doing such a transformation is not easy, and usually not even possible.
If the U (User Tree) option is used and more than one tree is supplied, the program also performs a statistical test of each of these trees against the one with highest likelihood. If there are two user trees, the test done is one which is due to Kishino and Hasegawa (1989), a version of a test originally introduced by Templeton (1983). In this implementation it uses the mean and variance of log-likelihood differences between trees, taken across loci. If the two trees means are more than 1.96 standard deviations different then the trees are declared significantly different. This use of the empirical variance of log-likelihood differences is more robust and nonparametric than the classical likelihood ratio test, and may to some extent compensate for the any lack of realism in the model underlying this program.
If there are more than two trees, the test done is an extension of the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that a correction for the number of trees was necessary, and they introduced a resampling method to make this correction. The version used here is a multivariate normal approximation to their test; it is due to Shimodaira (1998). The variances and covariances of the sum of log likelihoods across loci are computed for all pairs of trees. To test whether the difference between each tree and the best one is larger than could have been expected if they all had the same expected log-likelihood, log-likelihoods for all trees are sampled with these covariances and equal means (Shimodaira and Hasegawa's "least favorable hypothesis"), and a P value is computed from the fraction of times the difference between the tree's value and the highest log-likelihood exceeds that actually observed. Note that this sampling needs random numbers, and so the program will prompt the user for a random number seed if one has not already been supplied. With the two-tree KHT test no random numbers are used.
In either the KHT or the SH test the program prints out a table of the log-likelihoods of each tree, the differences of each from the highest one, the variance of that quantity as determined by the log-likelihood differences at individual sites, and a conclusion as to whether that tree is or is not significantly worse than the best one.
One problem which sometimes arises is that the program is fed two species (or populations) with identical transformed gene frequencies: this can happen if sample sizes are small and/or many loci are monomorphic. In this case the program "gets its knickers in a twist" and can divide by zero, usually causing a crash. If you suspect that this has happened, check for two species with identical coordinates. If you find them, eliminate one from the problem: the two must always show up as being at the same point on the tree anyway.
Continuous character Maximum Likelihood method version 3.69 5 Populations, 10 Loci Numbers of alleles at the loci: ------- -- ------- -- --- ----- 2 2 2 2 2 2 2 2 2 2 Name Gene Frequencies ---- ---- ----------- locus: 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 European 0.28680 0.71320 0.56840 0.43160 0.44220 0.55780 0.42860 0.57140 0.38280 0.61720 0.72850 0.27150 0.63860 0.36140 0.02050 0.97950 0.80550 0.19450 0.50430 0.49570 African 0.13560 0.86440 0.48400 0.51600 0.06020 0.93980 0.03970 0.96030 0.59770 0.40230 0.96750 0.03250 0.95110 0.04890 0.06000 0.94000 0.75820 0.24180 0.62070 0.37930 Chinese 0.16280 0.83720 0.59580 0.40420 0.72980 0.27020 1.00000 0.00000 0.38110 0.61890 0.79860 0.20140 0.77820 0.22180 0.07260 0.92740 0.74820 0.25180 0.73340 0.26660 American 0.01440 0.98560 0.69900 0.30100 0.32800 0.67200 0.74210 0.25790 0.66060 0.33940 0.86030 0.13970 0.79240 0.20760 0.00000 1.00000 0.80860 0.19140 0.86360 0.13640 Australian 0.12110 0.87890 0.22740 0.77260 0.58210 0.41790 1.00000 0.00000 0.20180 0.79820 0.90000 0.10000 0.98370 0.01630 0.03960 0.96040 0.90970 0.09030 0.29760 0.70240 +-----------------------------------------------------------African ! ! +-------------------------------Australian 1-------------3 ! ! +-----------------------American ! +-----2 ! +Chinese ! +European remember: this is an unrooted tree! Ln Likelihood = 38.71914 Between And Length Approx. Confidence Limits ------- --- ------ ------- ---------- ------ 1 African 0.09693444 ( 0.03123910, 0.19853604) 1 3 0.02252816 ( 0.00089799, 0.05598045) 3 Australian 0.05247405 ( 0.01177094, 0.11542374) 3 2 0.00945315 ( -0.00897717, 0.03795670) 2 American 0.03806240 ( 0.01095938, 0.07997877) 2 Chinese 0.00208822 ( -0.00960622, 0.02017433) 1 European 0.00000000 ( -0.01627246, 0.02516630)
|egendist||Genetic Distance Matrix program|
|fgendist||Compute genetic distances from gene frequencies|
Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.
Converted (August 2004) to an EMBASSY program by the EMBOSS team.