No more business as usual: responding to the COVID-19 pandemic through open data genome and evolutionary analytics.

  • Funded by National Institutes of Health (NIH)
  • Total publications:0 publications

Grant number: 3R01AI134384-05S1

Grant search

Key facts

  • Disease

    COVID-19
  • Start & end year

    2020
    2022
  • Known Financial Commitments (USD)

    $368,281
  • Funder

    National Institutes of Health (NIH)
  • Principal Investigator

    Anton Nekrutenko
  • Research Location

    United States of America
  • Lead Research Institution

    N/A
  • Research Priority Alignment

    N/A
  • Research Category

    13

  • Research Subcategory

    N/A

  • Special Interest Tags

    Data Management and Data Sharing

  • Study Type

    Not applicable

  • Clinical Trial Details

    N/A

  • Broad Policy Alignment

    Pending

  • Age Group

    Not Applicable

  • Vulnerable Population

    Not applicable

  • Occupations of Interest

    Not applicable

Abstract

The rapid worldwide spread and severe regional outbreaks of COVID-19 following its emergence in Wuhan in November 2019 has created a sense of urgency and alarm. There are many more cases (>100,000) and deaths (~5,000) than in other recent viral outbreaks/epidemics (SARS, MERS, Ebola and Zika viruses); but in many other respects the epidemic is "typical" - zoonotic introduction from a (yet undetermined) animal reservoir, followed by a period of undetected transmission among humans (with possible adaptation to the new host), and then generalized transmission. The same types of questions arise during each of these emerging outbreaks: Where did the pathogen come from? Is it evolving in the human population? How is it spreading? How to develop reliable diagnostics? What are promising vaccine targets? Many, if not all, of these questions depend on rapid and reliable genomic analysis of diverse viral sample sequences by multiple laboratories. Yet, time and time again, including COVID-19, we encounter the same avoidable shortcomings early in the viral investigation: lack of reproducibility, rigor, and data/analytic sharing. The initial publications describing genomic features of COVID-19 [1-4] used Illumina and Oxford nanopore data to elucidate the sequence composition of patient specimens (although only Wu et al. [3] explicitly provided the accession numbers for their raw short read sequencing data). However, their approaches to processing, assembly, and analysis of raw data differed widely and ranged from transparent [3] to entirely opaque [4]. Such lack of analytical transparency sets a dangerous precedent. Infectious disease outbreaks often occur in locations where infrastructure necessary for data analysis may be inaccessible or unbiased interpretation of results may be politically untenable. Essential questions such as the extent of intra-host genomic variability (indicative of adaptation or multiple infection), viral evolution (selection, recombination), transmission (phylogentic and phylogeographic) cannot be answered reliably if researchers cannot trust/replicate the source data and analytical approaches. The key goals/deliverables of this supplement will be the open analytic workflows that can be used to curate and standardize genomic data, and high quality annotated variation data for SARS-CoV-2 and potential future outbreaks. These workflows will be distributed through proven, fully open, and highly used infrastructure provided by the Galaxy (http://covid19.galaxyproject.org) and HyPhy/Datamonkey (http://covid19.datamonkey.org/) projects.