Using data to improve public health: COVID-19 secondment

This project will develop computational methods to predict the severity and duration of COVID-19 using data on metabolic biomarkers from cohort studies and machine learning. Highly accurate predictions are crucial to identify the individuals that are most at risk of serious effects of COVID-19. The data to be used consists of sociodemographic information (age, sex, ethnicity, etc), information on health conditions before COVID-19, and metabolic markers from biofluids including blood, urine and faeces. Incorporating metabolomic data into the analysis is expected to significantly enhance our ability to predict the severity of COVID-19 compared to methods that focus on, e.g., sociodemographic data only. The project will study both the severity of COVID-19 and the duration of symptoms. The specific aims of the project are the following: Aim 1. To identify metabolic biomarkers associated with severe COVID-19 and long COVID. Aim 2. To train computer programs to predict the susceptibility of individuals to severe COVID-19 and long COVID. In practice, the aims will be separately addressed for the severity of COVID-19 and the duration of symptoms. The aim of the project, however, is to integrate the results for both characteristics and provide a general view on how metabolomics can help understand the manifestations of COVID-19. The severity of COVID-19 will be quantified in terms of whether or not patients show symptoms. For Aim 1, associations between the characteristics of individuals and the presence/absence of symptoms will be explored using statistical methods which will include graphical visualisation, hypothesis testing or logistic regression. Feature selection and dimensionality reduction strategies will be used to identify relevant features in terms of symptoms. For Aim 2, machine learning models will be trained to automatically classify individuals into symptomatic and asymptomatic classes. A variety of machine learning techniques will be implemented; partial least squares discriminant analysis, support vector machines or artificial neural networks are expected to be particularly suitable to deal with the high dimensionality and correlated character of metabolomic data. Several descriptions will be considered for the duration of symptoms which require different degrees of statistical power to be feasible. If the data gives enough statistical power, the most natural approach will be to consider the duration as a continuous random variable. In this case, Aim 1 will be fulfilled by using regression methods to assess the statistical significance of the different predictor variables for each individual. A range of machine learning methods will be explored to train a predictor for the duration of symptoms. Suitable candidates may include partial least squares regression, principal component regression or artificial neural networks. An alternative description of durations that will require less statistical power will consist in discretising the duration into several categories. For example, into short (≤10 days) and long (>10 days) duration to describe short and long COVID, respectively. In this case, Aims 1 and 2 can be achieved using methods similar to those described above for the analysis of the presence or absence of symptoms.

Using data to improve public health: COVID-19 secondment

Key facts

Abstract

Publicationslinked via Europe PMC

Impact of heterogeneity on infection probability: Insights from single-hit dose-response models.

Authors

Publish Year

Journal

DOI

ESPClust: unsupervised identification of modifiers for the effect size profile in omics association studies.

Authors

Publish Year

Journal

DOI

Age-specific all-cause mortality trends in the UK: Pre-pandemic increases and the complex impact of COVID-19.

Authors

Publish Year

Journal

DOI