Return to homepagePandemic Pact

A simulation-based pre-training framework for building more robust and trustworthy machine learning-based clinical prediction models

  • Funded by Department of Health and Social Care / National Institute for Health and Care Research (DHSC-NIHR)
  • Total publications:1 publications

Grant number: NIHR173695

Grant search

Key facts

  • Disease

    COVID-19
  • Start & end year

    2025
    2028
  • Known Financial Commitments (USD)

    $585,693.25
  • Funder

    Department of Health and Social Care / National Institute for Health and Care Research (DHSC-NIHR)
  • Principal Investigator

    N/A

  • Research Location

    United Kingdom
  • Lead Research Institution

    University of Oxford
  • Research Priority Alignment

    N/A
  • Research Category

    N/A

  • Research Subcategory

    N/A

  • Special Interest Tags

    N/A

  • Study Type

    Non-Clinical

  • Clinical Trial Details

    N/A

  • Broad Policy Alignment

    Pending

  • Age Group

    Not Applicable

  • Vulnerable Population

    Not applicable

  • Occupations of Interest

    Not applicable

Abstract

Research Question -- Can we provide a general framework for building more robust machine learning based clinical prediction models? Background -- Machine learning-based clinical prediction models can learn the relationship between predictors and a future outcome. It does this by looking for patterns in historical data that has been collected. However, the properties of the historical data may not be fully representative of the wider population, it could be biased, or it may change over time. As a consequence, prediction models may fail to perform when under conditions for which they have not been trained for. There is a pressing need for methodologies to build more robust machine learning clinical prediction models. Aims and objectives -- Our aim is to develop a general pre-training framework for clinical prediction models that augments real historical data with simulated data at training time. Simulated data will describe challenging scenarios or constraints not seen in the historical data. By introducing these synthetic illustrations at training time, the aim is to bestow greater robustness upon the prediction models so that when they encounter unusual data in real-world use, they already have resilient mechanisms to handle such situations. Our objectives are to (i) embed approaches to simulate complex high-dimensional data types to enable a richer range of applications and to demonstrate how the framework can be used to build improved prediction models in the presence of (ii) data drift and (iii) algorithmic fairness constraints. Methods -- We will explore deep learning-based techniques for synthetic high-dimensional data generation and using both molecular and image data developed an example use of our pre-training framework to construct an ovarian cancer prognosis model that has improve out-of-distribution consistency compared to the original published model. We will review and embed recent approaches to data drift simulation in our framework and demonstrate how a prediction model can be made resilient to different forms of data drift for longer using a published COVID-19 prediction modelling example. Finally we will explore the algorithmic fairness literature to identify common fairness constraints and build these into prediction models as pre-trained properties. We will illustrate the utility of this pre-training for gender and ethnicity-related fairness within a recent Welsh Childhood Mental Health study example. Timelines for delivery -- This is a 36-month project and we will broadly address each of the three objectives consecutively in 12 month blocks. Anticipated impact and dissemination -- We have already developed two application examples that fit within this pre-training framework and our objectives seek to develop three further examples to demonstrate the broad utility of the framework. We will also seek to develop training materials and toolkits in the use of these techniques and serve them through a national learning platform. Our aim is to encourage wider adoption of these techniques by prediction model developers.

1 Publication linked via Europe PMC

Understanding the spatial determinants of the Oxford Classic prognostic signature for high-grade serous ovarian cancer.