Abstract
DNA sequence alignments are usually not homogeneous. Mosaic structures
may result as a consequence of recombination or rate heterogeneity. Interspecific
recombination, in which DNA subsequences are transferred between different
(typically viral or bacterial) strains may result in a change of the topology of
the underlying phylogenetic tree. Rate heterogeneity corresponds to a change of
the nucleotide substitution rate. Various methods for simultaneously detecting
recombination and rate heterogeneity in DNA sequence alignments have recently
been proposed, based on complex probabilistic models that combine phylogenetic trees with factorial hidden Markov models or multiple changepoint processes. The objective of my thesis is to identify potential shortcomings of these models and explore ways of how to improve them.
One shortcoming that I have identified is related to an approximation made in
various recently proposed Bayesian models. The Bayesian paradigm requires the
solution of an integral over the space of parameters. To render this integration
analytically tractable, these models assume that the vectors of branch lengths
of the phylogenetic tree are independent among sites. While this approximation
reduces the computational complexity considerably, I show that it leads to the
systematic prediction of spurious topology changes in the Felsenstein zone, that
is, the area in the branch lengths configuration space where maximum parsimony
consistently infers the wrong topology due to longbranch attraction. I demonstrate these failures by using two Bayesian hypothesis tests, based on an inter and an intramodel approach to estimating the marginal likelihood. I then propose a revised model that addresses these shortcomings, and demonstrate its improved performance on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone.
The core model explored in my thesis is a phylogenetic factorial hidden Markov
model (FHMM) for detecting two types of mosaic structures in DNA sequence
alignments, related to recombination and rate heterogeneity. The focus of my
work is on improving the modelling of the latter aspect. Earlier research efforts by
other authors have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. Their work fails to appreciate the intrinsic difference between two types of rate heterogeneity: longrange regional effects, which are potentially related to differences in the selective pressure, and the shortterm periodic patterns within the codons, which merely capture the signature of the genetic code.
I have improved these earlier phylogenetic FHMMs in two respects. Firstly,
by sampling the rate vector from the posterior distribution with RJMCMC I
have made the modelling of regional rate heterogeneity more flexible, and I infer
the number of different degrees of divergence directly from the DNA sequence
alignment, thereby dispensing with the need to arbitrarily select this quantity
in advance. Secondly, I explicitly model withincodon rate heterogeneity via a
separate rate modification vector. In this way, the withincodon effect of rate
heterogeneity is imposed on the model a priori, which facilitates the learning of
the biologically more interesting effect of regional rate heterogeneity a posteriori.
I have carried out simulations on synthetic DNA sequence alignments, which have
borne out my conjecture. The existing model, which does not explicitly include the withincodon rate variation, has to model both effects with the same modelling mechanism. As expected, it was found to fail to disentangle these two effects. On the contrary, I have found that my new model clearly separates withincodon rate
variation from regional rate heterogeneity, resulting in more accurate predictions.
may result as a consequence of recombination or rate heterogeneity. Interspecific
recombination, in which DNA subsequences are transferred between different
(typically viral or bacterial) strains may result in a change of the topology of
the underlying phylogenetic tree. Rate heterogeneity corresponds to a change of
the nucleotide substitution rate. Various methods for simultaneously detecting
recombination and rate heterogeneity in DNA sequence alignments have recently
been proposed, based on complex probabilistic models that combine phylogenetic trees with factorial hidden Markov models or multiple changepoint processes. The objective of my thesis is to identify potential shortcomings of these models and explore ways of how to improve them.
One shortcoming that I have identified is related to an approximation made in
various recently proposed Bayesian models. The Bayesian paradigm requires the
solution of an integral over the space of parameters. To render this integration
analytically tractable, these models assume that the vectors of branch lengths
of the phylogenetic tree are independent among sites. While this approximation
reduces the computational complexity considerably, I show that it leads to the
systematic prediction of spurious topology changes in the Felsenstein zone, that
is, the area in the branch lengths configuration space where maximum parsimony
consistently infers the wrong topology due to longbranch attraction. I demonstrate these failures by using two Bayesian hypothesis tests, based on an inter and an intramodel approach to estimating the marginal likelihood. I then propose a revised model that addresses these shortcomings, and demonstrate its improved performance on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone.
The core model explored in my thesis is a phylogenetic factorial hidden Markov
model (FHMM) for detecting two types of mosaic structures in DNA sequence
alignments, related to recombination and rate heterogeneity. The focus of my
work is on improving the modelling of the latter aspect. Earlier research efforts by
other authors have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. Their work fails to appreciate the intrinsic difference between two types of rate heterogeneity: longrange regional effects, which are potentially related to differences in the selective pressure, and the shortterm periodic patterns within the codons, which merely capture the signature of the genetic code.
I have improved these earlier phylogenetic FHMMs in two respects. Firstly,
by sampling the rate vector from the posterior distribution with RJMCMC I
have made the modelling of regional rate heterogeneity more flexible, and I infer
the number of different degrees of divergence directly from the DNA sequence
alignment, thereby dispensing with the need to arbitrarily select this quantity
in advance. Secondly, I explicitly model withincodon rate heterogeneity via a
separate rate modification vector. In this way, the withincodon effect of rate
heterogeneity is imposed on the model a priori, which facilitates the learning of
the biologically more interesting effect of regional rate heterogeneity a posteriori.
I have carried out simulations on synthetic DNA sequence alignments, which have
borne out my conjecture. The existing model, which does not explicitly include the withincodon rate variation, has to model both effects with the same modelling mechanism. As expected, it was found to fail to disentangle these two effects. On the contrary, I have found that my new model clearly separates withincodon rate
variation from regional rate heterogeneity, resulting in more accurate predictions.
Original language  English 

Qualification  PhD 
Awarding Institution 

Supervisors/Advisors 

Award date  24 Nov 2011 
Publication status  Published  2011 
Keywords
 Bayesian
 recombination
 rate heterogeneity
 DNA