PAML ACTIVITIES


2018 Workshop on Molecualr Evolution at the MBL


Follow this link to go to back to workshop wiki page.

Overview

The objective of this activity is to help you understand how to use different codon models, and how to test for selection using PAML (and specifically the CODEML program). The activities are designed to build general analytical skills, and are just as relevant to analyses carried out using other software packages, such as HyPhy.

Here are the slides for the PAML learning activity: Demo.pdf.

Here is a copy of the book chapter that accompanies these exercises: Book_Chapter.pdf.

You can download the necessary files from an archive here, or you can download individual files here. Either way, you must copy the files to your home directory on the cluster to do the excercises at the workshop.

Alternatively, you can use wget to download an archive to your home directoy on the cluster as follows:

wget https://molevol.mbl.edu/images/d/df/PamlLab.zip
unzip PamlLab.zip
cd PamlLab

This page has links to useful documents and links to other websites.

The tutorial is divided into 4 excercises.

  1. Maximum likelihood estimation (by hand)
  2. Sensitivity of ω
  3. LRTs for alternative hypothises about temporal changes in selection pressure
  4. Test for sites evolving by postive selection in the nef gene of HIV-2


Exercise 1

The objective of this activity is to use CODEML to evaluate the likelihood of the GstD1 sequences for a variety of ω values. Plot log-likelihood scores against the values of ω and determine the maximum likelihood estimate of ω. Check your finding by running CODEML’s hill-climbing algorithm.

  1. Find the files for exercise 1 on this workshop web-site (ex1_codeml.ctl, seqfile.txt) and familiarize yourself with them. Pay close attention to the modified control file called ex1_codeml.ctl. When you are ready to run CODEML, delete the ex1_ prefix (the control file must be called codeml.ctl).

  2. Create a directory where you want your results to go, and place all your files within it. Now open a terminal, move to the directory that contains your files, and run CODEML.

  3. Familiarize yourself with the results (see annotations in ex1_HelpFile.pdf). If you have not edited the control file the results will be written to a file called results.txt. Identify the line within the results file that gives the likelihood score for the example dataset.

  4. Now change the control files and re-run CODEML. The objective is to compute the likelihood of the example dataset given a fixed value of omega.

  5. Repeat Step 4 for each value of ω given in the comments of the example control file.

  6. Use your favorite spread sheet or statistical package to plot the likelihood score (y-axis) against the fixed value for omega (x-axis). Use a logarithmic scale for the x-axis (do not transform ω). Your figure should look something like this: FigE1.pdf (note: the data points have been intentionally omitted from this version of the plot; you need to generate the data for yourself).

  7. From the plot, try to guess the value of ω that will maximize the likelihood score (i.e., the MLE).

  8. Now change the control file so that CODEML will use its hill-climbing algorithm to find the MLE; set fix_omega=0 in the control file. Compare the result to your guess from Step 7.


Exercise 2

in this exercise you will investigate the sensitivity of ω to the transition/transversion ratio (k), and to the assumed codon frequencies (π's). After you collect the required data you will determine which assumptions yield the largest and smallest values of S, and what effect it as on the value of ω.

  1. Find the files for Exercise 2 on the workshop web-site ex2 codml.ctl, seqfile.txt and familiarize yourself with them. It would be best to create a new directory for exercise 2.

  2. Run CODEML using the settings in the control file for exercise 2. Familiarize yourself with the results (ex2 HelpFile.pdf).In addition to the likelihood score you must be able to identify the part of the result file that provides estimates of the following:

  3. As in exercise 1, you will need to change the control files and re-run CODEML. The objective is to compute the likelihood of the example dataset under different model assumptions. To do this you must:

  4. Repeat step 3 for each set of assumptions about codon frequencies and kappa given as comments at the bottom of the example control file.

  5. In your favorite spreadsheet program create a table like Table E2 in the slides (TableE2.pdf) and fill it in with your results.

  6. se your table to determine which assumptions yield the largest and smallest values of S. What is the effect on ω?


Exercise 3

The objective of this exercise is to use three LRTs to evaluate the following hypotheses: (1) the mutation rate of Ldh-C has increased relative to Ldh-A, (2) a burst of positive selection for functional divergence occurred following the duplication event that gave rise to Ldh-C, and (3) there was a long term shift in selective constraints following the duplication event that gave rise to Ldh-C.

  1. Obtain the files for Exercise 3 from the course web-site (ex3_codeml.ctl, seqfile.txt, treeH0.txt, treeH1.txt, treeH2.txt, treeH3.txt). The tree files represent different hypotheses denoted H0, H1, H2 & H3 (LDHgenetree.pdf). These hypotheses represent the following concepts:

  2. Run CODEML using the settings in the control file for Exercise 3. Familiarize yourself with the results (ex3_HelpFile.pdf). In addition to the likelihood score you must be able to identify the branch-specific estimates of the ω parameter. (In the first run, the branch specific values for omega will all be the same. In later runs there will be differences among some branches).

  3. As in the previous exercises, you will need to change the control files and re-run CODEML. The objective is to compute the likelihood, and estimate ω parameters, under different models of how selection pressure changes in different parts of the tree. Because the relevant model information is contained in the tree file, you will need several tree files (obtained from the course web site) and change the control file so that it reads the different tree files.

  4. Repeat step 3 for each of the four tree files that have been provided to you. Again, keep track of your results by using a table like Table E3 shown in the slides (TableE3.pdf). In addition, carry out likelihood ratio tests (LRT) of the hypotheses below. See the lecture notes for additional details about LRTs. Use 1 degree of freedom to obtain the P-value for each LRT. Helpful: Chi-Square Calculator.


Exercise 4

The objective of this exercise is to use a series of LRTs to test for sites evolving under positive selection in the nef gene. If you find significant evidence for positive selection, then identify the involved sites by using empirical Bayes methods.

  1. Obtain all the files for exercise 4 from the course website (ex4_codeml.ctl.txt, ex4_seqfile.txt, treeM0.txt, treeM1.txt, treeM2.txt, treeM3.txt, treeM7.txt, treeM8.txt).

  2. If you plan to run two or more models at the same time, then create a separate directory for each run and place a sequence file, control file and tree file in each one.

  3. As in all the previous exercises, you will need to change the control file and re-run CODEML several times. In this case you will be fitting six different codon models (M0, M1a, M2a, M3, M7 & M8) to the example dataset.

  4. Keep track of your results (ex4_HelpFile.pdf) by using a table like Table E4 shown in the slides (TableE4.pdf).

  5. In addition, carry out the following likelihood ratio tests:

  6. Lastly, open the rst file generated when you ran model M3 (ex4_rst-HelpFile.pdf). Locate the columns of posterior probabilities for each site under the three site-categories of this model. Use these data to reproduce the plot shown in the slides.