Large-scale phylodynamics under non-neutral and non-treelike models of evolution

Project Summary Technological breakthroughs such as next-generation sequencing have recently led to the creation of immense "BioBanks" featuring genomic information collected from hundreds of thousands of people, and the ongoing pandemic has resulted in an even more extreme repository containing over 10 million SARS-CoV-2 genomes. Unfortunately, existing techniques for inferring evolutionary models can, in most cases, only analyze a tiny fraction of the information contained in these datasets. At a time when we should be able to use vast quantities of data to answer increasingly nuanced evolutionary questions, lack of adequate methods has limited our opportunities for discovery and hampered our ability to respond to the ongoing pandemic. The proposed research addresses this problem through the creation of novel statistical and computational methods designed to study targeted evolutionary hypotheses using BioBank- and pandemic-scale datasets. First, we will develop new phylodynamic methods for epidemiological inference using tens of thousands of sampled pathogen genomes. Apart from being more scalable, these methods will innovate over previous work by being more biologically realistic and making fewer simplifying assumptions about the data. In particular, we will study systems where multiple strains co-circulate and have differential fitness, and we will use this model to improve our understanding of the role that natural selection has played in shaping the pandemic. We will further extend this method to integrate non-genetic sources of information such as case count data, which will enable public health researchers to partition case counts into different variants and estimate variant-specific effective reproduction numbers. Second, we will develop improved methods for inferring phylogenetic networks, and use them to understand the role that recombination has played in the evolution of the coronavirus, as well as its role in confounding earlier studies that incorrectly assumed that SARS-CoV-2 evolution could be represented by a single tree. All of these advances will be implemented and released as easy to use open source software packages. In summary, this work represents advances in several areas of statistical genetics including phylodynamic modeling, genetic epidemiology, inference of natural selection and phylogenetic network analysis, and will provide empirical researchers with modern tools needed to propel the next generation of discoveries in these fields.

Large-scale phylodynamics under non-neutral and non-treelike models of evolution

Key facts

Abstract