Skip to content

Welcome courses Facilities research Members News Search
  You are not logged in Log in
You are here: Home » biocourses » Bioinformatics for Biologists - BIOS 599 - Spring 2004 » projects » Bayesian Phylogenetic Inference & MrBayes

« September 2008 »
Su Mo Tu We Th Fr Sa
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30        

Bayesian Phylogenetic Inference & MrBayes

Bayesian Phylogenetic Inference using MrBayes

 

 

This summary will provide a brief introduction to Bayesian theory, in particular its relationship to maximum likelihood techniques, and summarize the most important commands used in the most common phylogenetic software using the Bayesian approach: MrBayes.

Background: Probability and Likelihood

While both probability and likelihood are related, they are subtly different in meaning. First, some basic terminology must be defined:

P (A) = probability of A occurring

P (A,B) = probability of both A and B occurring (joint probability)

P (A|B) = probability of A occurring given that B has occurred (conditional probability)

In phylogenetics we wish to calculate a tree using data and a model of evolution. In the maximum likelihood approach to tree building, we calculate P(data|model + tree), which is the likelihood of the tree given the data (as well as the probability of the data given the tree). Unfortunately, this implies that the data are dependent on the tree. A better analysis would be to calculate P(model + tree|data), which is the probability of the tree. This correctly assumes the data are independent.

Bayes’ Theorem

Bayes’ Theorem, named after Reverend Thomas Bayes (1702-1760), is a simple statistical proof. Let’s say for example that we have a matrix of students’ characteristics broken by sex and by hair color:

 

Red Hair

Black Hair

Row Sum

Male

10

31

41

Female

36

23

59

Column Sum

46

54

100

We can easily start calculating probabilities:

P (male) = 41/100

P (red) = 46/100

P (red,male) = 10/100

P (red|male) = 10/41

P (male|red) = 10/46

Notice that P (red,male) = P (red|male) x P (male) = 10/41 x 41/100 = 10/100, and that this equals P (male,red) = P (male|red) x P (red) = 10/46 x 46/100 = 10/100.

This leads to the result that P (A|B) x P (B) = P (B|A) x P (A), or

P (A|B) = P (B|A) x P (A) / P (B) (1)

This is Bayes’ theorem.

Bayes’ Theorem and Phylogenetics

Recall that we want to calculate the probability of the tree (and model) given the data, or P (tree + model|data). We can calculate this using Bayes’ theorem:

P (tree + model|data) = P (data|tree + model) x P (tree + model) / P (data) (2)

The term P (data|tree + model) is the likelihood, and can be calculated using maximum likelihood phylogenetic methods. The term P (tree + model) is the prior probability of the tree and model, and has to be assumed. For example, all possible trees are usually considered equally probable. The denominator P (data) is the problematic term, as this equals the sum of the likelihood x prior probability of the tree + model for all possible trees.

P (data) = S P(data|tree + model) x P(tree + model) (3)

This term is impossible to calculate, but the Metropolis-Hastings algorithm (Metropolis et al. 1953; Hastings 1970) can get around this problem. The Metropolis-Hastings algorithm is a type of Markov chain Monte Carlo (MCMC) algorithm, and can be summarized in seven steps:

Step 1: start at some tree Ti

Step 2: pick a neighbor of this tree, Tj

Step 3: calculate ratio of likelihoods for Ti & Tj

R = likelihood (Tj) / likelihood (Ti)

Step 4: if R > 1, accept new tree

Step 5: if R < 1, pick random number between 0 and 1: if random number < R, accept new tree

Step 6: otherwise, reject new tree and continue with Ti

Step 7: return to step 2

To solve equation 2 listed above, we can use it as the ratio used in step 3 of the algorithm. Because only the ratio is examined, the denominator of equation 2 does not need to be calculated as it cancels out. As a result, we can use the Metropolis-Hastings algorithm to maximize the posterior probability of a universe of trees.

Metropolis-Hastings algorithm: multiple chains and support indices

As the Metropolis-Hastings algorithm searches through phylogenetic tree space, we hope that eventually it will converge on the set of optimal trees. One way to make this more likely is to run multiple algorithms (chains). The picking of a neighbor tree in step two of the algorithm can be adjusted. Usually one chain is constrained to pick very near tree neighbors (the cold chain), while the rest are free to make more drastic changes to the tree (heated chains). After every iteration, the score of two chains are compared, and if a heated chain has a higher score than the cold chain, this ‘better’ chain becomes the new cold chain. This algorithm of comparing scores and rejecting or swapping chain states is very similar to the Metropolis-Hastings algorithm. This variation of the Metropolis-Hastings algorithm lets the search avoid becoming trapped at relatively sub-optimal local optima in the phylogenetic tree space.

Once the Metropolis-Hastings algorithm has found the optimal neighborhood of tree space, a large number of iterations are still conducted by the algorithm. As this continues, the algorithm (specifically the cold chain) occasionally saves trees to a file. These trees, if not sampled so close as to be non-randomly chosen, make up a random sample of the optimal tree for the dataset. The program MrBayes uses this fact to make a support index for the phylogenetic analysis. This support index is more straightforward than the bootstrapping method used in other phylogenetic approaches, as it is a probability that a particular phylogenetic clade is supported by all of the optimal trees. This sampling of trees is only optimal if the initial iterations of the algorithm are removed (the ‘burn-in’). Usually the amount of burn-in is calculated by examining the likelihood scores over the course of the MCMC run. At some point the likelihood scores plateau around an optimum, and all of the iterations before this point are removed as burn-in.

Summary of advantages and disadvantages in Bayesian phylogenetic analysis

Advantages:

  • More straightforward statistical measure of phylogeny compared to likelihood
  • Avoids bootstrapping and provides straightforward support values
  • Much faster than maximum likelihood bootstrap analysis

Disadvantages:

  • Needs a prior probability for the tree and model

This point may seem minor, but it is a long-standing debate among statisticians. It is actually a debate on the nature of knowledge, and therefore a philosophical issue!

MrBayes 3.0

This program is only available in command-line version from the authors, but Luobin has graciously developed a web interface for the EGG server. The program is freely distributed in Macintosh, Windows, and Unix formats at http://morphbank.ebc.uu.se/mrbayes3/info.php

Figure 1. MrBayes web interface on EGG.

 

MrBayes requires an input file in NEXUS format. This is a fairly standard format required by PAUP* (Swofford 2003), a popular phylogenetics inference package.

 

Figure 2. Example of input file in NEXUS format.

Although there are many commands in MrBayes, three important commands are needed to start a MrBayes run, ‘Lset’, ‘Prset’, and ‘Mcmc’. These commands can be appended to your NEXUS file (see Figure 2 for examples).

Lset

This command sets the parameters of the likelihood model. There is a large body of literature on models of molecular evolution as applied to neighbor-joining and maximum likelihood models. For a good summary see Felsenstein (2004). Among the many parameters are:

Nucmodel – specifies the general type of substitution model (4by4: standard DNA; codon: codon model)

Nst – specifies the number of substitution types (1: Jukes Cantor; 2: HKY85; 6: General time-reversible). For a good explanation of models of molecular evolution, see Felsenstein (2004).

Code – specifies the genetic code (universal; vertmt: vertebrate mitochondrial)

Rates – specifies the model for among-site rate variation (equal; gamma: incorporates gamma distribution for rate variation; propinv: incorporates a proportion of invariable sites; invgamma: incorporates both gamma distribution and invariable sites). Again, for a good explanation of these parameters, see Felsenstein (2004).

Prset

This command sets the prior probabilities for the parameters. In reality, this line could be omitted and priors left to default settings. It is important that the user have a knowledge of Bayesian priors and their data before changing the default settings. Some of the parameters include:

Tratiopr – changes the prior probability of the transition/transversion ratio. It can be fixed (fixed), or set to a beta distribution with mean x and variance y (beta(x,y)).

Revmatpr – if using the the General time-reversible model of evolution, allows one to change the prior on these values. Again they can be fixed, or set to a Dirichlet distribution (dirichlet(six numbers here)).

Omegapr – allows one to change the prior of the synonymous/non-synonymous rate ratio. It can be fixed, set to a flat or uniform distribution (uniform), or an exponential distribution.

Shapepr – allows the user to change the prior of the gamma distribution shape. Again it can be fixed, uniform, or exponentially distributed.

Pinvarpr – allows the user to set the prior for the proportion of invariable sites. This can be set to be fixed, or a uniformly distributed.

Topologypr – allows the user to constrain the search to a particular tree topology or topologies. In this case, the command is set to = constraints (list of trees).

Mcmc

This command sets the parameters for the search algorithm, making it the most important command. Key parameters include:

Ngen – specifies the number of iterations for the algorithm. This number is usually in the millions.

Samplefreq – specifies how often a tree is sampled from the chain. It is important to not sample too frequently, or the trees may not be random. If millions of generations are run, sampling every 1000 or 10000 generations is fine.

Printfreq – specifies how often the program displays info on the algorithm to the screen.

Nchains – specifies the number of chains in the algorithm. The default is 4 chains, and this is usually sufficient.

Temp – defines the heating parameter for the heated chains, which regulates how distantly across tree space the heated chains swap trees in the MCMC algorithm. The default is 0.2, and should probably not be tinkered with.

Execute

The command used to start the algorithm is ‘execute’. The run can take anywhere from minutes to days, but average runs on large datasets are usually done in less than 24 hours.

Post-algorithm analyses

At the end of the MCMC search, MrBayes will output two files, a file containing all of the sample trees (.t), and a file listing the parameters after each printed iteration (.p). The .t file needs to be processed by the ‘Sumt’ command, which results in three more output files (.con, .parts, .trprobs). The .con file is the consensus tree of all trees in the .p file, and displays the support values for each clade. This file can be viewed and saved in the phylogenetic software package PAUP*, or the program TREEVIEW.

Figure 3. Example tree from MrBayes analysis. Numbers on branches indicate Bayesian support values.

Sumt

This command summarizes the raw tree file, generating three files including the consensus tree file (.con). As explained earlier, it is important for the initial portion of the MCMC search to be removed from subsequent analysis, as the trees saved in this period are not optimal. This period is called the ‘burn-in’, and is set by changing the parameter ‘burnin’ = (number).

 

Literature Cited

Durbin, R., S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological Sequence Analysis. Cambridge University Press, Cambridge.

Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, MA.

Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97-109.

Huelsenbeck, J. P., and F. Ronquist. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754-755.

Huelsenbeck, J. P., F. Ronquist, R. Nielsen, and J. P. Bollback. 2001. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294:2310-2314.

Lewis, P. O. 2001. Phylogenetic systematics turns over a new leaf. Trends in Ecology and Evolution 16:30-37.

Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equation of state calculations by fast computing machines. Journal of Chemical Physics 21:1087-1092.

Shoemaker, J. S., I. S. Painter, and B. S. Weir. 1999. Bayesian statistics in genetics: a guide for the uninitiated. Trends in Genetics 15:354-358.

Swofford, D. L. 2003. Phylogenetic analysis using parsimony (*and other methods). Sinauer Associates, Sunderland, MA.