Co-author: Vincent Plagnol (University College London Genetics Institute)
For many practical applications, for example to uncover the pathogen that caused an infection after the acute phase, very deep short read sequencing can be effective provided that we can reliably assign short sequencing reads to species. This problem of assignment of reads to species is complicated by the fact that, in the absence of very large contigs, most short reads reads match to multiple species. This is essentially a mixture model, where the complete knowledge of all species present in the mixture provides information about the assignment of each read individually. However, metagenomic data analysis rarely formulates the problem in these terms because the very large number of potential species typically makes the inference intractable. Here, we propose a Bayesian model averaging strategy designed to explore the high dimensional space of species present in a metagenomic mixture. We use approximate Bayesian computation and a Monte Carlo strategy to implement the search o f the most appropriate mixture models. Owing to the computationally intensive aspects of the work, we used a population Monte Carlo Markov Chain to leverage the use of parallel computing. We find that the methodolgy is effective to provide a full Bayesian inference for samples with > 10M reads, hence providing interpretable Bayes Factors and posterior probabilities for practical problems that regularly arise in a clinical context.
view more