CAREER: Discourse Processing and Content Generation for Document Simplification

This award is funded in whole or in part under the American Rescue Plan Act of 2021 (Public Law 117-2).

Simplification is the process of making a text more accessible to a target audience, e.g., language learners, children, and individuals with language impairments, while preserving its meaning and content. The lack of accessible material can exacerbate social issues, for example, the complexity of language used in college admission and financial aid applications has contributed to the lagging access to higher education among emergent bilingual students; the WHO has recognized the urgency of accessible technical information, given the rise of medical misinformation especially in the wake of the COVID-19 pandemic. While there has been much work on sentence simplification, very few datasets are large enough to train supervised models; simplifying a document also involves different operations from those at the sentence level, including content addition, and how sentences connect with each other. This project aims to develop new resources and data-driven approaches for document simplification, with the potential to address information transparency and fair access across a range of high-stake domains. This project will also support the education and training of a diverse body of undergraduate and graduate students across disciplines.

To substantially advance document simplification, this CAREER project will tackle several key issues in existing simplification work, including corpora diversity, explanation generation, and document-level approaches. This is achieved by the following research activities: (1) introducing new corpora that tackle the pressing challenge of data diversity in simplification research and enable new application scenarios, especially in the accessibility of technical and jargon-laden texts; (2) tackling content addition and elaboration during simplification---a previously little-explored challenge, and propose a novel, linguistically-informed framework that characterizes and generates elaborations; (3) develop models for document simplification that are informed by structures of discourse, using both coherence structure and entity salience. The innovative ways to integrate discourse target a larger challenge for models to take stretches of discourse into account.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Visualise a dataset

Research Funding Tracker

Clinical Research Registrations Tracker

Explore a dataset

Research Funding Tracker

Clinical Research Registrations Tracker

CAREER: Discourse Processing and Content Generation for Document Simplification

Key facts

Abstract