Efficient Statistical Learning Methods for Personalized Medicine Using Large Scale Biomedical Data

Project Summary: Coronavirus disease 19 (COVID-19) has created a major public health crisis around the world. The novelcoronavirus was observed to have a long incubation period and extremely infectious during this period. No proveneffective treatment or vaccine is available. Massive public interventions have been implemented in many countriesand states in the United States (US) at different phases of the outbreak with varying combinations of social dis-tancing, mobility restriction and population behavioral change. Decisions on how to implement these interventions(e.g., when to impose and relax mitigation measures) rely on important statistics of COVID epidemiology (e.g.,effective reproduction number) that characterize and predict the course of COVID-19 outbreak. However, there isa lack of robust and parsimonious model of COVID epidemic that can accurately reﬂect the heterogeneity betweensusceptible populations and regions (e.g., demographics, healthcare capacity, social and economic determinants).There is no rigorous study to guide precision public health interventions that are tailored to a population or regiondepending on their characteristics. Furthermore, due to the non-randomized nature of public health interventions,it is critical to account for biases and confounding when comparing mitigation measures of COVID-19 across re-gions. To address these challenges, this project develops robust and generalizable analytic methods to evaluatepublic health interventions and assess individual patient risks of COVID-19 infection and complications. In Aim 1,we will develop dynamic and robust statistical models to predict the disease epidemic. The models will estimatethe date of the ﬁrst unknown infection case, instantaneous effective reproduction number, and account for the incu-bation period of COVID-19 virus. Furthermore, heterogeneity in population's demographics, social and economicindicators, healthcare capacity and geographic locations will be incorporated to reﬂect their impacts on COVIDepidemic. Under a longitudinal quasi-experimental design, we will provide valid inference for comparing publichealth interventions implemented at different regions while accounting for confounding bias. Multiple sources ofdata from different states in the US will be analyzed to empirically test which states' response strategies are moreeffective and in which subpopulation. In Aim 2, we will focus on developing precise risk assessment tool of individ-ual COVID-19 patients using electronic health records (EHRs) collected at New York Presbyterian hospital in NewYork City, an epicenter of COVID-19. We will engineer features of patient's pre-conditions associated with severeCOVID complications, recovery, or death. More importantly, we will engineer features that represent proxies of virusexposures from patients' geographic information. We will use machine learning techniques to create quantitativesummaries of patient prognosis (e.g., transitioning to serious clinical stages, discharge, death). We will use inter-nal cross-validation and external calibration to validate developed algorithms. The project will generate evidenceto guide precision public health intervention, optimal patient care, and efﬁcient healthcare resource allocation inanticipation of a second wave of COVID epidemic and in preparation of other infectious disease outbreaks.

Efficient Statistical Learning Methods for Personalized Medicine Using Large Scale Biomedical Data

Key facts

Abstract