Improved Bayesian methods for detecting recombination and rate heterogeneity in DNA sequence alignments

Alexander Vassilios Mantzaris

Research output: ThesisDoctoral Thesis

Abstract

DNA sequence alignments are usually not homogeneous. Mosaic structures
may result as a consequence of recombination or rate heterogeneity. Interspecific
recombination, in which DNA subsequences are transferred between different
(typically viral or bacterial) strains may result in a change of the topology of
the underlying phylogenetic tree. Rate heterogeneity corresponds to a change of
the nucleotide substitution rate. Various methods for simultaneously detecting
recombination and rate heterogeneity in DNA sequence alignments have recently
been proposed, based on complex probabilistic models that combine phylogenetic trees with factorial hidden Markov models or multiple changepoint processes. The objective of my thesis is to identify potential shortcomings of these models and explore ways of how to improve them.
One shortcoming that I have identified is related to an approximation made in
various recently proposed Bayesian models. The Bayesian paradigm requires the
solution of an integral over the space of parameters. To render this integration
analytically tractable, these models assume that the vectors of branch lengths
of the phylogenetic tree are independent among sites. While this approximation
reduces the computational complexity considerably, I show that it leads to the
systematic prediction of spurious topology changes in the Felsenstein zone, that
is, the area in the branch lengths configuration space where maximum parsimony
consistently infers the wrong topology due to long-branch attraction. I demonstrate these failures by using two Bayesian hypothesis tests, based on an inter- and an intra-model approach to estimating the marginal likelihood. I then propose a revised model that addresses these shortcomings, and demonstrate its improved performance on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone.
The core model explored in my thesis is a phylogenetic factorial hidden Markov
model (FHMM) for detecting two types of mosaic structures in DNA sequence
alignments, related to recombination and rate heterogeneity. The focus of my
work is on improving the modelling of the latter aspect. Earlier research efforts by
other authors have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. Their work fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code.
I have improved these earlier phylogenetic FHMMs in two respects. Firstly,
by sampling the rate vector from the posterior distribution with RJMCMC I
have made the modelling of regional rate heterogeneity more flexible, and I infer
the number of different degrees of divergence directly from the DNA sequence
alignment, thereby dispensing with the need to arbitrarily select this quantity
in advance. Secondly, I explicitly model within-codon rate heterogeneity via a
separate rate modification vector. In this way, the within-codon effect of rate
heterogeneity is imposed on the model a priori, which facilitates the learning of
the biologically more interesting effect of regional rate heterogeneity a posteriori.
I have carried out simulations on synthetic DNA sequence alignments, which have
borne out my conjecture. The existing model, which does not explicitly include the within-codon rate variation, has to model both effects with the same modelling mechanism. As expected, it was found to fail to disentangle these two effects. On the contrary, I have found that my new model clearly separates within-codon rate
variation from regional rate heterogeneity, resulting in more accurate predictions.
LanguageEnglish
QualificationPhD
Awarding Institution
  • UNIVERSITY OF EDINBURGH
Supervisors/Advisors
  • Husmeir, Dirk, Supervisor, External person
Award date24 Nov 2011
Publication statusPublished - 2011

Fingerprint

recombination
DNA
phylogenetics
topology
rate
method
alignment
modeling
prediction
substitution
learning
divergence
effect

Keywords

  • Bayesian
  • recombination
  • rate heterogeneity
  • DNA

Cite this

@phdthesis{b758c7095b904d5a8fdd56463dca9c5d,
title = "Improved Bayesian methods for detecting recombination and rate heterogeneity in DNA sequence alignments",
abstract = "DNA sequence alignments are usually not homogeneous. Mosaic structuresmay result as a consequence of recombination or rate heterogeneity. Interspecificrecombination, in which DNA subsequences are transferred between different(typically viral or bacterial) strains may result in a change of the topology ofthe underlying phylogenetic tree. Rate heterogeneity corresponds to a change ofthe nucleotide substitution rate. Various methods for simultaneously detectingrecombination and rate heterogeneity in DNA sequence alignments have recentlybeen proposed, based on complex probabilistic models that combine phylogenetic trees with factorial hidden Markov models or multiple changepoint processes. The objective of my thesis is to identify potential shortcomings of these models and explore ways of how to improve them.One shortcoming that I have identified is related to an approximation made invarious recently proposed Bayesian models. The Bayesian paradigm requires thesolution of an integral over the space of parameters. To render this integrationanalytically tractable, these models assume that the vectors of branch lengthsof the phylogenetic tree are independent among sites. While this approximationreduces the computational complexity considerably, I show that it leads to thesystematic prediction of spurious topology changes in the Felsenstein zone, thatis, the area in the branch lengths configuration space where maximum parsimonyconsistently infers the wrong topology due to long-branch attraction. I demonstrate these failures by using two Bayesian hypothesis tests, based on an inter- and an intra-model approach to estimating the marginal likelihood. I then propose a revised model that addresses these shortcomings, and demonstrate its improved performance on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone.The core model explored in my thesis is a phylogenetic factorial hidden Markovmodel (FHMM) for detecting two types of mosaic structures in DNA sequencealignments, related to recombination and rate heterogeneity. The focus of mywork is on improving the modelling of the latter aspect. Earlier research efforts byother authors have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. Their work fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code.I have improved these earlier phylogenetic FHMMs in two respects. Firstly,by sampling the rate vector from the posterior distribution with RJMCMC Ihave made the modelling of regional rate heterogeneity more flexible, and I inferthe number of different degrees of divergence directly from the DNA sequencealignment, thereby dispensing with the need to arbitrarily select this quantityin advance. Secondly, I explicitly model within-codon rate heterogeneity via aseparate rate modification vector. In this way, the within-codon effect of rateheterogeneity is imposed on the model a priori, which facilitates the learning ofthe biologically more interesting effect of regional rate heterogeneity a posteriori.I have carried out simulations on synthetic DNA sequence alignments, which haveborne out my conjecture. The existing model, which does not explicitly include the within-codon rate variation, has to model both effects with the same modelling mechanism. As expected, it was found to fail to disentangle these two effects. On the contrary, I have found that my new model clearly separates within-codon ratevariation from regional rate heterogeneity, resulting in more accurate predictions.",
keywords = "Bayesian , recombination, rate heterogeneity, DNA",
author = "Mantzaris, {Alexander Vassilios}",
year = "2011",
language = "English",
school = "UNIVERSITY OF EDINBURGH",

}

Improved Bayesian methods for detecting recombination and rate heterogeneity in DNA sequence alignments. / Mantzaris, Alexander Vassilios.

2011.

Research output: ThesisDoctoral Thesis

TY - THES

T1 - Improved Bayesian methods for detecting recombination and rate heterogeneity in DNA sequence alignments

AU - Mantzaris, Alexander Vassilios

PY - 2011

Y1 - 2011

N2 - DNA sequence alignments are usually not homogeneous. Mosaic structuresmay result as a consequence of recombination or rate heterogeneity. Interspecificrecombination, in which DNA subsequences are transferred between different(typically viral or bacterial) strains may result in a change of the topology ofthe underlying phylogenetic tree. Rate heterogeneity corresponds to a change ofthe nucleotide substitution rate. Various methods for simultaneously detectingrecombination and rate heterogeneity in DNA sequence alignments have recentlybeen proposed, based on complex probabilistic models that combine phylogenetic trees with factorial hidden Markov models or multiple changepoint processes. The objective of my thesis is to identify potential shortcomings of these models and explore ways of how to improve them.One shortcoming that I have identified is related to an approximation made invarious recently proposed Bayesian models. The Bayesian paradigm requires thesolution of an integral over the space of parameters. To render this integrationanalytically tractable, these models assume that the vectors of branch lengthsof the phylogenetic tree are independent among sites. While this approximationreduces the computational complexity considerably, I show that it leads to thesystematic prediction of spurious topology changes in the Felsenstein zone, thatis, the area in the branch lengths configuration space where maximum parsimonyconsistently infers the wrong topology due to long-branch attraction. I demonstrate these failures by using two Bayesian hypothesis tests, based on an inter- and an intra-model approach to estimating the marginal likelihood. I then propose a revised model that addresses these shortcomings, and demonstrate its improved performance on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone.The core model explored in my thesis is a phylogenetic factorial hidden Markovmodel (FHMM) for detecting two types of mosaic structures in DNA sequencealignments, related to recombination and rate heterogeneity. The focus of mywork is on improving the modelling of the latter aspect. Earlier research efforts byother authors have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. Their work fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code.I have improved these earlier phylogenetic FHMMs in two respects. Firstly,by sampling the rate vector from the posterior distribution with RJMCMC Ihave made the modelling of regional rate heterogeneity more flexible, and I inferthe number of different degrees of divergence directly from the DNA sequencealignment, thereby dispensing with the need to arbitrarily select this quantityin advance. Secondly, I explicitly model within-codon rate heterogeneity via aseparate rate modification vector. In this way, the within-codon effect of rateheterogeneity is imposed on the model a priori, which facilitates the learning ofthe biologically more interesting effect of regional rate heterogeneity a posteriori.I have carried out simulations on synthetic DNA sequence alignments, which haveborne out my conjecture. The existing model, which does not explicitly include the within-codon rate variation, has to model both effects with the same modelling mechanism. As expected, it was found to fail to disentangle these two effects. On the contrary, I have found that my new model clearly separates within-codon ratevariation from regional rate heterogeneity, resulting in more accurate predictions.

AB - DNA sequence alignments are usually not homogeneous. Mosaic structuresmay result as a consequence of recombination or rate heterogeneity. Interspecificrecombination, in which DNA subsequences are transferred between different(typically viral or bacterial) strains may result in a change of the topology ofthe underlying phylogenetic tree. Rate heterogeneity corresponds to a change ofthe nucleotide substitution rate. Various methods for simultaneously detectingrecombination and rate heterogeneity in DNA sequence alignments have recentlybeen proposed, based on complex probabilistic models that combine phylogenetic trees with factorial hidden Markov models or multiple changepoint processes. The objective of my thesis is to identify potential shortcomings of these models and explore ways of how to improve them.One shortcoming that I have identified is related to an approximation made invarious recently proposed Bayesian models. The Bayesian paradigm requires thesolution of an integral over the space of parameters. To render this integrationanalytically tractable, these models assume that the vectors of branch lengthsof the phylogenetic tree are independent among sites. While this approximationreduces the computational complexity considerably, I show that it leads to thesystematic prediction of spurious topology changes in the Felsenstein zone, thatis, the area in the branch lengths configuration space where maximum parsimonyconsistently infers the wrong topology due to long-branch attraction. I demonstrate these failures by using two Bayesian hypothesis tests, based on an inter- and an intra-model approach to estimating the marginal likelihood. I then propose a revised model that addresses these shortcomings, and demonstrate its improved performance on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone.The core model explored in my thesis is a phylogenetic factorial hidden Markovmodel (FHMM) for detecting two types of mosaic structures in DNA sequencealignments, related to recombination and rate heterogeneity. The focus of mywork is on improving the modelling of the latter aspect. Earlier research efforts byother authors have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. Their work fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code.I have improved these earlier phylogenetic FHMMs in two respects. Firstly,by sampling the rate vector from the posterior distribution with RJMCMC Ihave made the modelling of regional rate heterogeneity more flexible, and I inferthe number of different degrees of divergence directly from the DNA sequencealignment, thereby dispensing with the need to arbitrarily select this quantityin advance. Secondly, I explicitly model within-codon rate heterogeneity via aseparate rate modification vector. In this way, the within-codon effect of rateheterogeneity is imposed on the model a priori, which facilitates the learning ofthe biologically more interesting effect of regional rate heterogeneity a posteriori.I have carried out simulations on synthetic DNA sequence alignments, which haveborne out my conjecture. The existing model, which does not explicitly include the within-codon rate variation, has to model both effects with the same modelling mechanism. As expected, it was found to fail to disentangle these two effects. On the contrary, I have found that my new model clearly separates within-codon ratevariation from regional rate heterogeneity, resulting in more accurate predictions.

KW - Bayesian

KW - recombination

KW - rate heterogeneity

KW - DNA

M3 - Doctoral Thesis

ER -