No more business as usual: responding to the COVID-19 pandemic through open data genome and evolutionary analytics.

The rapid worldwide spread and severe regional outbreaks of COVID-19 following its emergence in Wuhan in November 2019 has created a sense of urgency and alarm. There are many more cases (>100,000) and deaths (~5,000) than in other recent viral outbreaks/epidemics (SARS, MERS, Ebola and Zika viruses); but in many other respects the epidemic is "typical" - zoonotic introduction from a (yet undetermined) animal reservoir, followed by a period of undetected transmission among humans (with possible adaptation to the new host), and then generalized transmission. The same types of questions arise during each of these emerging outbreaks: Where did the pathogen come from? Is it evolving in the human population? How is it spreading? How to develop reliable diagnostics? What are promising vaccine targets? Many, if not all, of these questions depend on rapid and reliable genomic analysis of diverse viral sample sequences by multiple laboratories. Yet, time and time again, including COVID-19, we encounter the same avoidable shortcomings early in the viral investigation: lack of reproducibility, rigor, and data/analytic sharing. The initial publications describing genomic features of COVID-19 [1-4] used Illumina and Oxford nanopore data to elucidate the sequence composition of patient specimens (although only Wu et al. [3] explicitly provided the accession numbers for their raw short read sequencing data). However, their approaches to processing, assembly, and analysis of raw data differed widely and ranged from transparent [3] to entirely opaque [4]. Such lack of analytical transparency sets a dangerous precedent. Infectious disease outbreaks often occur in locations where infrastructure necessary for data analysis may be inaccessible or unbiased interpretation of results may be politically untenable. Essential questions such as the extent of intra-host genomic variability (indicative of adaptation or multiple infection), viral evolution (selection, recombination), transmission (phylogentic and phylogeographic) cannot be answered reliably if researchers cannot trust/replicate the source data and analytical approaches. The key goals/deliverables of this supplement will be the open analytic workflows that can be used to curate and standardize genomic data, and high quality annotated variation data for SARS-CoV-2 and potential future outbreaks. These workflows will be distributed through proven, fully open, and highly used infrastructure provided by the Galaxy (http://covid19.galaxyproject.org) and HyPhy/Datamonkey (http://covid19.datamonkey.org/) projects.

No more business as usual: responding to the COVID-19 pandemic through open data genome and evolutionary analytics.

Key facts

Abstract