A simulation-based pre-training framework for building more robust and trustworthy machine learning-based clinical prediction models

Research Question -- Can we provide a general framework for building more robust machine learning based clinical prediction models? Background -- Machine learning-based clinical prediction models can learn the relationship between predictors and a future outcome. It does this by looking for patterns in historical data that has been collected. However, the properties of the historical data may not be fully representative of the wider population, it could be biased, or it may change over time. As a consequence, prediction models may fail to perform when under conditions for which they have not been trained for. There is a pressing need for methodologies to build more robust machine learning clinical prediction models. Aims and objectives -- Our aim is to develop a general pre-training framework for clinical prediction models that augments real historical data with simulated data at training time. Simulated data will describe challenging scenarios or constraints not seen in the historical data. By introducing these synthetic illustrations at training time, the aim is to bestow greater robustness upon the prediction models so that when they encounter unusual data in real-world use, they already have resilient mechanisms to handle such situations. Our objectives are to (i) embed approaches to simulate complex high-dimensional data types to enable a richer range of applications and to demonstrate how the framework can be used to build improved prediction models in the presence of (ii) data drift and (iii) algorithmic fairness constraints. Methods -- We will explore deep learning-based techniques for synthetic high-dimensional data generation and using both molecular and image data developed an example use of our pre-training framework to construct an ovarian cancer prognosis model that has improve out-of-distribution consistency compared to the original published model. We will review and embed recent approaches to data drift simulation in our framework and demonstrate how a prediction model can be made resilient to different forms of data drift for longer using a published COVID-19 prediction modelling example. Finally we will explore the algorithmic fairness literature to identify common fairness constraints and build these into prediction models as pre-trained properties. We will illustrate the utility of this pre-training for gender and ethnicity-related fairness within a recent Welsh Childhood Mental Health study example. Timelines for delivery -- This is a 36-month project and we will broadly address each of the three objectives consecutively in 12 month blocks. Anticipated impact and dissemination -- We have already developed two application examples that fit within this pre-training framework and our objectives seek to develop three further examples to demonstrate the broad utility of the framework. We will also seek to develop training materials and toolkits in the use of these techniques and serve them through a national learning platform. Our aim is to encourage wider adoption of these techniques by prediction model developers.

Visualise a dataset

Research Funding Tracker

Clinical Research Registrations Tracker

Explore a dataset

Research Funding Tracker

Clinical Research Registrations Tracker

A simulation-based pre-training framework for building more robust and trustworthy machine learning-based clinical prediction models

Key facts

Abstract

1 Publication linked via Europe PMC

Understanding the spatial determinants of the Oxford Classic prognostic signature for high-grade serous ovarian cancer.

Authors

Publish Year

Journal

DOI