Phylogenetic Reconstruction Using Maximum Parsimony and Maximum Likelihood using Molecular Data on iNquiry
Phylogenetic trees reconstruct relationships between species or individuals using molecular data. Two commonly used methods for reconstructing relationships are maximum likelihood (ML) and maximum parsimony (MP). Maximum likelihood evaluates a hypothesis about evolutionary history in terms of the probability that the proposed model and the hypothesized history would give rise to the observed data set. The topology with the highest maximum probability (likelihood) is chosen. Maximum parsimony infers a phylogenetic tree by minimizing the total number of evolutionary steps required to explain a given set of data.
Maximum Likelihood
Advantages of maximum likelihood methods over other methods are: may have lower variance than other methods (least affected by sampling error), tend to be robust to violations of the assumptions in the evolutionary model, are statistically well founded, can statistically evaluate different tree topologies and use all of the sequence information. There are also some disadvantages: very computationally intensive (slow) and the result depends on the model of evolution (Opperdoes, 1997a).
There are four maximum likelihood programs available on iNquiry.
| Program | Data |
|---|---|
| Tree-Puzzle | DNA or Protein sequence |
| PAML (CODEML) | DNA or Protein sequence |
| DNAML (PHYLIP) | DNA sequence |
| fastDNAml | DNA sequence |
Tree-Puzzle
Tree-Puzzle is a program for maximum likelihood analysis of DNA or protein sequence data. This program implements a fast tree search algorithm, quartet puzzling, that allows analysis of large data sets and automatically assigns estimations of branch support to each internal branch. It also computes pairwise maximum likelihood distances as well as branch lengths for user specified trees.
Input: Sequence input is requested as an alignment file in PHYLIP interleaved format. The user must choose the type of sequence, DNA or protein.
Options: The user may choose the model of substitution to be applied, HKY (Hasegawa et al 1985) is the default for DNA and Dayhoff (Dayhoff et al. 1978) is the default for protein sequence. The user may input the transition/transversion ratio and nucleotide frequencies, however if these are left blank the program will estimate them from the data set. There are options for the model of rate heterogeneity, the default is uniform rate. The last two options are for a user-specified tree and the output options. In the output the user may specify a sequence to be designated as the outgroup, this should be the number of the individual in the alignment file (for example, the first sequence would be 1, the fourth sequence would be 4).
Output: Tree-Puzzle, when used with the default options, gives a summary of the sequence data input, maximum likelihood distances, an quartet puzzling tree and any other trees that occurred more than 5% of the time in the 1000 (default) puzzling steps.
MAXIMUM LIKELIHOOD BRANCH LENGTHS ON QUARTET PUZZLING TREE (NO CLOCK)
Branch lengths are computed using the selected model of
substitution and rate heterogeneity.
:----3 AF157877
:----6
: :----------4 AF157953
:-----7
: :----------------------5 GVO389531
:
:---2 AF157941
:
:--1 AF157928
branch length S.E. branch length S.E.
AF157928 1 0.01919 0.00460 6 0.02588 0.00694
AF157941 2 0.02246 0.00491 7 0.03991 0.00797
AF157877 3 0.03741 0.00725
AF157953 4 0.10455 0.01119 8 iterations until
convergence
GVO389531 5 0.23022 0.01835 log L: -3347.60
Quartet puzzling tree with maximum likelihood branch lengths
(in CLUSTAL W notation):
(AF157928:0.01919,((AF157877:0.03741,AF157953:0.10455)51:0.02588,
GVO389531:0.23022)100:0.03991,AF157941:0.02246);
PAML
In the PAML package on iNquiry is the program codeml, which does maximum likelihood for DNA or protein sequence. Two old PAML programs, baseml and codonml, were combined to create codeml.
Input: DNA or protein sequence may be directly pasted in or a file may be specified. The sequence data must have the number of sequences and the number of characters, followed by the sequence name, then the sequence (see example input for ProtPars). The user may also input a tree structure file.
Options: There are options for the general run of the program and ones specific for DNA and protein. The common options are for the output file names, the type of sequence, the tree, and other parameters for estimating trees. It is very important to specify the tree to be used (the user must choose an option from the pull-down list, the default is 0, or user-specified tree, if not supplying a tree). Codon sequence options are for DNA sequence data and include model, codon frequency, genetic code, kappa and omega values. It is very important to specify the genetic code to be used (the default of 0, universal code does not work for mammalian mitochondrial DNA sequence). Amino acid sequence options are the model, alpha and the matrix. If the empirical models are chosen from the pulldown menu the user must specify a matrix file.
Output: There are three output files from paml: rst gives codon sites with position differences and star trees, mlc gives site patterns, sequence differences, codon usage in sequences, a distance matrix and the best tree.
best tree: (((1, 2), 4), 3, 5); lnL: -2853.476553
DNAml
DNAml is part of the PHYLIP package, fastDNAml performs the same functions using less memory.
fastDNAml
FastDNAml performs unrooted maximum likelihood on aligned DNA sequence. It is faster than DNAml and has the ability to save progress toward finding a tree (can be restarted from a checkpoint).
Input: Aligned DNA sequence.
Options: The user may specify the base frequencies or check the box for the program to derive them from the sequence data. The user may specify an outgroup (by the order of the sequences as in Tree-Puzzle) and the transition/transversion ratio. If the interleaved box is left checked the program will convert the sequence from FASTA format to PHYLIP interleaved format. There are options for bootstrapping the tree(s) found by the program. There are also options for the display of the output and the rearrangements of trees. The last two options are for user-specified weights and trees.
Output: The first output file is a tree and the second is a summary of the results. It gives the aligned sequences with any variable positions, which are called distinct data patterns. The program finds an unrooted tree and gives branch length values and approximate confidence limits.
Maximum Parsimony
Maximum parsimony methods search all possible tree topologies for the optimal (or minimal) tree. Advantages of maximum parsimony are: it is based on shared and derived characters, therefore a cladistic method, it tries to provide information on the ancestral sequences and evaluates different trees. Disadvantages are: does not use all the sequence information (only informative sites), does not correct for multiple mutations (no model of evolution), does not provide information on branch lengths and it is sensitive to codon bias (Opperdoes, 1997b). For more information on parsimony see Felsenstien (2004). There are two maximum parsimony programs for sequence data available on iNquiry, both are from the PHYLIP package.
| Program | Data |
|---|---|
| PROTPARS | Protein sequence |
| DNAPARS | DNA sequence |
PROTPARS
This program applies a novel method for inferring unrooted phylogeny from protein sequences. The user should consult the fine manual for the program for the assumptions of the method.
Input: Aligned protein sequence, where the first line contains the number of species and the number of amino acid positions, then the species data. Each sequence starts on a new line, has a ten-character species name, immediately followed by the species data in one-letter code.
Options: There is an option for using threshold parsimony and specifying the threshold value as well as specifying the genetic code to be used. There are also options for randomizing and bootstrapping as well as input for a user-specified tree. The user may choose the output options and specify an outgroup, by designating the sequence by the order (the first sequence is 1, etc.).
Output: The program gives the most parsimonious tree (or trees).
One most parsimonious tree found:
+-----------05CYB_GLV
4
! +--------04CYB_MAM
+--3
! +-----03CYB_SPT
+--2
! +--02CYB_SPT
+--1
+--01CYB_SPM
remember: (although rooted by outgroup) this is an unrooted tree!
requires a total of 64.000
DNAPARS
This program searches bifurcating and multifurcating trees for the most parsimonious trees and saves a number of trees tied for best and rearranges all of the saved trees.
Input: Aligned protein sequence, where the first line contains the number of species and the number of amino acid positions, then the species data. Each sequence starts on a new line, has a ten-character species name, immediately followed by the species data in one-letter code.
Options: There is an option for using threshold parsimony and specifying the threshold value. There are also options for randomizing and bootstrapping as well as input for a user-specified tree. The user may choose the weight and output options and specify an outgroup, by designating the sequence by the order (the first sequence is 1, etc.).
Output: The program gives the most parsimonious tree (or trees) and distances.
References
All information contained in this document was obtained from the respective fine manual of the program or Nei, M. and Kumar, S. 2000. Molecular Evolution and Phylogenetics. Oxford University Press, Inc., New York, unless cited otherwise.
Felsenstein, J. 2004. Statistical properties of parsimony, pp. 97-122 and A digression on history and philosophy, pp. 123-146, in Inferring Phylogenies. Sinaur Associates, Inc., Sunderland, Massachusetts.
Opperdoes, F. 1997. Maximum Likelihood. Retreived 20 April 2004. http://www.icp.ucl.ac.be/~opperd/private/max_likeli.html .
Opperdoes, F. 1997. Maximum Parsimony Analysis. Retreived 20 April 2004. http://www.icp.ucl.ac.be/~opperd/private/parsimony.html .