Addressing Algorithmic Unreliability and Dataset Shift in EHR-based Risk Prediction Models

Project Summary Predictive analytic algorithms built on electronic health record (EHR) inputs, such as patient characteristics, administrative codes, and lab values, are increasingly used in health care settings to direct resources to high- risk patients. Data play an indispensable role in the development and deployment of effective predictive models. The greatest, yet understudied, challenge in the maintenance of these tools arises from a data-related concern, namely dataset shift, in which training data distribution differs from the population on which the algorithm is deployed, leading to model deterioration and inaccurate risk predictions. Dataset shift is a pervasive cause of algorithmic unreliability in EHR-based models due to inevitable changes in physician behaviors and health system operations that alter (1) the input distribution (covariate drift); and (2) changes in the relationship between predictors and outcome (concept drift). Sudden changes in healthcare utilization during the COVID-19 pandemic may have impacted the data generation process and the performance of clinical predictive models. Our preliminary study showed that decreased collection of patient labs during the COVID-19 quarantine period led to sparse data generation for important predictors of a single-institution EHR-based mortality risk prediction algorithm, underpredicting risk for patients with advanced cancers. Despite the increasing use of predictive tools in high stakes clinical applications; and growing recognition of dataset shift, we lack a framework for reasoning shift and its effects on care delivery; and for proactively addressing shift to maintain performance over time. In Aim 1, we propose to extend prior works on shift to a nationally deployed risk prediction algorithm, the VA Care Assessment Need (CAN) model, used on millions of VA beneficiaries each year. The VA CAN model predicts the likelihood of hospitalization within 90 days or 1 year after a primary care encounter to identify high-risk patients who would benefit from additional outpatient interventions. We also investigate covariate and concept drift as two possible mechanisms for COVID-19 associated dataset shift. In Aim 2, we apply an interrupted time series design to study the association between sudden shift at the onset of the pandemic on case-management decisions. Current solutions to address dataset shift have primarily been reactive (i.e. model retraining with recent data), however, fail to be robust in new testing environments. In Aim 3, we consider revision of the VA CAN model via machine learning and inclusion of variables that reflect potential drivers of shift. This project is innovative as it is the first to leverage a rigorous statistical framework to study extent and mechanisms of shift and develop proactive guidelines for model maintenance. The training plan is rigorous for Ms. Kolla, an MD-PhD student in biostatistics. She is strongly supported by her department and institution as well as her two high- qualified sponsors: Dr. Jinbo Chen, an expert in EHR-based risk prediction modeling, and Dr. Ravi Parikh, an expert in implementation of predictive analytics. The proposed research and career development plan will be an essential step towards Ms. Kolla's development as an interdisciplinary and independent physician-scientist.

Addressing Algorithmic Unreliability and Dataset Shift in EHR-based Risk Prediction Models

Key facts

Abstract