CAREER: Advancing the Role of Ontologies for Data Science in Biomedicine

  • Funded by National Science Foundation (NSF)
  • Total publications:7 publications

Grant number: 2047001

Grant search

Key facts

  • Disease

    COVID-19
  • Start & end year

    2021
    2026
  • Known Financial Commitments (USD)

    $213,408
  • Funder

    National Science Foundation (NSF)
  • Principal Investigator

    Licong Cui
  • Research Location

    United States of America
  • Lead Research Institution

    The University of Texas Health Science Center at Houston
  • Research Priority Alignment

    N/A
  • Research Category

    13

  • Research Subcategory

    N/A

  • Special Interest Tags

    Data Management and Data Sharing

  • Study Type

    Non-Clinical

  • Clinical Trial Details

    N/A

  • Broad Policy Alignment

    Pending

  • Age Group

    Not Applicable

  • Vulnerable Population

    Not applicable

  • Occupations of Interest

    Not applicable

Abstract

An ontology is a formal representation of concepts (or classes), properties, and relationships between concepts within a knowledge domain. Ontologies and terminologies have played a vital role in biomedical research for coding, managing, sharing, and exchange of vast amounts of heterogeneous biomedical data that are being continuously generated, such as in Electronic Health Records (EHRs). EHRs have been widely used in translational research to learn predictive models for discovery and disease management across varying patient cohorts. The very first step in such EHR-based applications often concerns patient cohort identification. Cohort identification involves the specification of a collection of eligibility criterion that needs to be transformed into a computable representation using the EHR's semantic backbone (i.e., coding systems or ontologies) before queries can run against the EHR database. However, there are two critical barriers in performing effective cohort identification from large-scale EHRs. The first one is data (or semantic) heterogeneity, caused by a mixed utilization of coding systems. The second one is the quality of the semantic backbone or ontology hierarchy, which is essential for translating patient eligibility criteria to executable database queries. To address such challenges, this project will develop new methods for ontology matching and for ontology quality enhancement that directly impact data science practice in biomedicine, such as patient cohort identification. In addition, this project will incorporate the proposed computational aspects into data science-based courses to train next generation data scientists.

This project consists of three research objectives. In Objective 1, the PI will develop new graph neural network (GNN)-based learning methods for matching biomedical ontologies by harnessing knowledge embedded in sources such as the Unified Medical Language System. This will address the heterogeneity issue and achieve semantic interoperability. In Objective 2, the PI will develop learning-based methods for detecting quality defects in subclass relations. This will address the quality issue and achieve continued enhancement of ontology hierarchies. In Objective 3, the PI will develop an ontology-based COVID-19 query engine for patient cohort identification, which is a real-world application of enhancing semantic interoperability for supporting data-driven COVID-19 research. For evaluation of the proposed methods, domain experts will be involved in validation of the resulted matching concepts and detected quality issues. The PI will communicate validated quality issues to the respective ontology owners for correction in subsequent ontology versions.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Publicationslinked via Europe PMC

Quantitatively assessing the impact of the quality of SNOMED CT subtype hierarchy on cohort queries.

Leveraging logical definitions and lexical features to detect missing IS-A relations in biomedical terminologies.

Logical definition-based identification of potential missing concepts in SNOMED CT.

A deep learning approach to identify missing is-a relations in SNOMED CT.

Towards quality improvement of vaccine concept mappings in the OMOP vocabulary with a semi-automated method.

Identification of missing hierarchical relations in the vaccine ontology using acquired term pairs.

An evidence-based lexical pattern approach for quality assurance of Gene Ontology relations.