Tuning big data analysis infrastructure for HIV research

  • Funded by National Institutes of Health (NIH)
  • Total publications:0 publications

Grant number: unknown

Grant search

Key facts

  • Disease

  • Start & end year

  • Known Financial Commitments (USD)

  • Funder

    National Institutes of Health (NIH)
  • Principle Investigator

  • Research Location

    United States of America, Americas
  • Lead Research Institution

  • Research Category

    Pathogen: natural history, transmission and diagnostics

  • Research Subcategory

    Pathogen genomics, mutations and adaptations

  • Special Interest Tags


  • Study Subject


  • Clinical Trial Details


  • Broad Policy Alignment


  • Age Group

    Not Applicable

  • Vulnerable Population

    Not applicable

  • Occupations of Interest

    Not applicable


SummaryThe COVID‐19/SARS‐CoV‐2 pandemic is a once in a generation, "all‐hands‐on‐deck" event for thescientific community. This pandemic is also the first in which real time genomic data are available,e.g. via GISAID [1], where genomic sequences are deposited daily. Vital insights about the virus andthe epidemic depend on rapid and reliable genomic analysis of diverse viral sample sequences bymultiple laboratories. Yet we repeatedly encounter the same avoidable shortcomings early in viralinvestigations, including COVID‐19: lack of reproducibility, rigor, and data/analytic sharing. Onlyabout 10% of the published genomes have quality metrics, primary data (read files), or any level ofdetails on analytics, making these data irreproducible and unverifiable; over 40% of GISAIDsubmissions to date provide no information about how the sequences were generated. Essentialquestions about the extent of intra‐host genomic variability (indicative of adaptation or multipleinfection), viral evolution (selection, recombination), transmission (phylogenetic andphylogeographic) cannot be answered reliably if researchers cannot trust/replicate the source dataand analytical approaches. One of the key goals/deliverables of this supplement will be the openanalytic workflows that can be used to curate and standardize genomic data, and high qualityannotated variation data.