the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Now, the two researchers used genomic sequencing to compare the DNA of the new coronavirus in humans with that in animals and found a 99% match with pangolins. Holmes, E. C. The Evolution and Emergence of RNA Viruses (Oxford Univ. Ji, W., Wang, W., Zhao, X., Zai, J. 24, 490502 (2016). B., Weaver, S. & Sergei, L. Evidence of significant natural selection in the evolution of SARS-CoV-2 in bats, not humans. NTD, N-terminal domain; CTD, C-terminal domain. Softw. Nature 579, 270273 (2020). Isolation and characterization of a bat SARS-like coronavirus that uses the ACE2 receptor. It compares the new genome against the large, diverse population of sequenced strains using a 2a. RegionC showed no PI signals within it. Posterior distributions were approximated through Markov chain Monte Carlo sampling, which were run sufficiently long to ensure effective sampling sizes >100. Regions AC were further examined for mosaic signals by 3SEQ, and all showed signs of mosaicism. PubMed Our approach resulted in similar posterior rates using two different prior means, implying that the sarbecovirus data do inform the rate estimate even though a root-to-tip temporal signal was not apparent. performed codon usage analysis. To employ phylogenetic dating methods, recombinant regions of a 68-genome sarbecovirus alignment were removed with three independent methods. Curr. Share . PubMed M.F.B. Instead, similarity in codon usage metrics between the SARS-CoV-2 and eukaryotes analyzed was correlated with coding sequence GC content of the eukaryote, with more similar codon usage being identified in eukaryotes with low GC content similar to that of the coronavirus (b). The command line tool is open source software available under the GNU General Public License v3.0. The red and blue boxplots represent the divergence time estimates for SARS-CoV-2 (red) and the 2002-2003 SARS-CoV (blue) from their most closely related bat virus, with the light- and dark-colored versions based on the HCoV-OC43 and MERS-CoV centered priors, respectively. PANGOLIN lineage database (15, 16) was used to analyze the frequency of lineages among countries. 82, 48074811 (2008). 30, 21962203 (2020). Viruses 11, 979 (2019). Given what was known about the origins of SARS, as well as identification of SARS-like viruses circulating in bats that had binding sites adapted to human receptors29,30,31, appropriate measures should have been in place for immediate control of outbreaks of novel coronaviruses. Virus Evol. Lam, H. M., Ratmann, O. Robertson, D. nCoVs relationship to bat coronaviruses & recombination signals (no snakes) no evidence the 2019-nCoV lineage is recombinant. Intragenomic rearrangements involving 5-untranslated region segments in SARS-CoV-2, other betacoronaviruses, and alphacoronaviruses, Crystal structure of the CoV-Y domain of SARS-CoV-2 nonstructural protein 3, Association of underlying comorbidities and progression of COVID-19 infection amongst 2586 patients hospitalised in the National Capital Region of India: a retrospective cohort study, Molecular characterization of horse nettle virus A, a new member of subgroup B of the genus Nepovirus, Molecular phylogeny of coronaviruses and host receptors among domestic and close-contact animals reveals subgenome-level conservation, crossover, and divergence. PI signals were identified (with bootstrap support >80%) for seven of these eight breakpoints: positions 1,684, 3,046, 9,237, 11,885, 21,753, 22,773 and 24,628. Early detection via genomics was not possible during Southeast Asias initial outbreaks of avian influenza H5N1 (1997 and 20032004) or the first SARS outbreak (20022003). Pangolin relies on a novel algorithm called pangoLEARN. However, the coronavirus isolated from pangolin is similar at 99% in a specific region of the S protein, which corresponds to the 74 amino acids involved in the ACE (Angiotensin Converting Enzyme . Proc. The boxplots show divergence time estimates (posterior medians) for SARS-CoV-2 (red) and the 20022003 SARS-CoV virus (blue) from their most closely related bat virus. Maclean, O. CAS Epidemiology, genetic recombination, and pathogenesis of coronaviruses. It is available as a command line tool and a web application. It performs: K-mer based detection Map/align, variant calling Consensus sequence generation Lineage/clade analysis using Pangolin and NextClade Access the DRAGEN COVID Lineage App on BaseSpace Sequence Hub Trends Microbiol. Mol. Zhang, Y.-Z. Specifically, progenitors of the RaTG13/SARS-CoV-2 lineage appear to have recombined with the Hong Kong clade (with inferred breakpoints at 11.9 and 20.8kb) to form the CoVZXC21/CoVZC45-lineage. It allows a user to assign a SARS-CoV-2 genome sequence the most likely lineage (Pango lineage) to SARS-CoV-2 query sequences. Sorting these breakpoint-free regions (BFRs) by length results in two segments >5kb: an ORF1a subregion spanning nucleotides (nt) 3,6259,150 and the first half of ORF1b spanning nt13,29119,628 (sequence numbering given in Source Data, https://github.com/plemey/SARSCoV2origins). A reduced sequence set of 25sequences chosen to capture the breadth of diversity in the sarbecoviruses (obvious recombinants not involving the SARS-CoV-2 lineage were also excluded) was used because GARD is computationally intensive. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Posterior means with 95% HPDs are shown in Supplementary Information Table 2. J. Virol. Researchers in the UK had just set the scientific world . Virology 507, 110 (2017). We thank all authors who have kindly deposited and shared genome data on GISAID. By 2009, however, rapid genomic analysis had become a routine component of outbreak response. S. China corresponds to Guangxi, Yunnan, Guizhou and Guangdong provinces. The divergence time estimates for SARS-CoV-2 and SARS-CoV from their respective most closely related bat lineages are reasonably consistent among the three approaches we use to eliminate the effects of recombination in the alignment. MERS-CoV data were subsampled to match sample sizes with SARS-CoV and HCoV-OC43. The first available sequence data6 placed this novel human pathogen in the Sarbecovirus subgenus of Coronaviridae7, the same subgenus as the SARS virus that caused a global outbreak of >8,000 cases in 20022003. Add entries for pangolin-data/-assignment 1.18.1.1 (, Really add a document on testing strategy. Anderson, K. G. nCoV-2019 codon usage and reservoir (not snakes v2). Google Scholar. Kosakovsky Pond, S. L., Posada, D., Gravenor, M. B., Woelk, C. H. & Frost, S. D. W. Automated phylogenetic detection of recombination using a genetic algorithm. Phylogenetic supertree reveals detailed evolution of SARS-CoV-2, Origin and cross-species transmission of bat coronaviruses in China, Emerging SARS-CoV-2 variants follow a historical pattern recorded in outgroups infecting non-human hosts, Inferring the ecological niche of bat viruses closely related to SARS-CoV-2 using phylogeographic analyses of Rhinolophus species, Genomic recombination events may reveal the evolution of coronavirus and the origin of SARS-CoV-2, A Bayesian approach to infer recombination patterns in coronaviruses, Metagenomic identification of a new sarbecovirus from horseshoe bats in Europe, A comparative recombination analysis of human coronaviruses and implications for the SARS-CoV-2 pandemic, Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape, https://github.com/plemey/SARSCoV2origins, https://doi.org/10.1101/2020.04.20.052019, https://doi.org/10.1101/2020.02.10.942748, https://doi.org/10.1101/2020.05.28.122366, http://virological.org/t/ncov-2019-codon-usage-and-reservoir-not-snakes-v2/339, http://virological.org/t/ncovs-relationship-to-bat-coronaviruses-recombination-signals-no-snakes-no-evidence-the-2019-ncov-lineage-is-recombinant/331. Humans' selfish, speciesist treatment of these animals could be the very reason why the novel coronavirus exists. Sibling lineages to RaTG13/SARS-CoV-2 include a pangolin sequence sampled in Guangdong Province in March 2019 and a clade of pangolin sequences from Guangxi Province sampled in 2017. 3). Sequencing from Malayan pangolins collected during anti-smuggling operations in southern China detected coronavirus lineages related to SARS-CoV-2. Methods Ecol. M.F.B., P.L. 21, 255265 (2004). For coronaviruses, however, recombination means that small genomic subregions can have independent origins, identifiable if sufficient sampling has been done in the animal reservoirs that support the endemic circulation, co-infection and recombination that appear to be common. The difficulty in inferring reliable evolutionary histories for coronaviruses is that their high recombination rate48,49 violates the assumption of standard phylogenetic approaches because different parts of the genome have different histories. We find that the sarbecovirusesthe viral subgenus containing SARS-CoV and SARS-CoV-2undergo frequent recombination and exhibit spatially structured genetic diversity on a regional scale in China. While pangolins could be acting as intermediate hosts for bat viruses to get into humansthey develop severe respiratory disease38 and commonly come into contact with people through traffickingthere is no evidence that pangolin infection is a requirement for bat viruses to cross into humans. B 281, 20140732 (2014). Biol. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. COVID-19 lineage names can be confusing to navigate; there are many aliases and if you want to catch them all to examine further in data analyses it helps to Allen O'Brien on LinkedIn: #r #rstudio #rstats #pangolin #covid19 #datascience #epidemiology The SARS-CoV divergence times are somewhat earlier than dates previously estimated15 because previous estimates were obtained using a collection of SARS-CoV genomes from human and civet hosts (as well as a few closely related bat genomes), which implies that evolutionary rates were predominantly informed by the short-term SARS outbreak scale and probably biased upwards. If stopping an outbreak in its early stages is not possibleas was the case for the COVID-19 epidemic in Hubeiidentification of origins and point sources is nevertheless important for containment purposes in other provinces and prevention of future outbreaks. Virological.org http://virological.org/t/ncov-2019-codon-usage-and-reservoir-not-snakes-v2/339 (2020). Identifying the origins of an emerging pathogen can be critical during the early stages of an outbreak, because it may allow for containment measures to be precisely targeted at a stage when the number of daily new infections is still low. acknowledges support by the Research FoundationFlanders (Fonds voor Wetenschappelijk OnderzoekVlaanderen (nos. DRAGEN COVID Lineage App This app aligns reads to a SARS-CoV-2 reference genome and reports coverage of targeted regions. PubMed Biol. Five example sequences with incongruent phylogenetic positions in the two trees are indicated by dashed lines. 6, 8391 (2015). Indeed, the rates reported by these studies are in line with the short-term SARS rates that we estimate (Fig. 26, 450452 (2020). 1) and thus likely to be the product of recombination, acquiring a divergent variable loop from a hitherto unsampled bat sarbecovirus28. Syst. Since the release of Version 2.0 in July 2020, however, it has used the 'pangoLEARN' machine-learning-based assignment algorithm to assign lineages to new SARS-CoV-2 genomes. This underscores the need for a global network of real-time human disease surveillance systems, such as that which identified the unusual cluster of pneumonia in Wuhan in December 2019, with the capacity to rapidly deploy genomic tools and functional studies for pathogen identification and characterization. 2, bottom) show that SARS-CoV-2 is unlikely to have acquired the variable loop from an ancestor of Pangolin-2019 because these two sequences are approximately 1015% divergent throughout the entire Sprotein (excluding the N-terminal domain). As a proxy, it would be possible to model the long-term purifying selection dynamics as a major source of time-dependent rates43,44,52, but this is beyond the scope of the current study. Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Boni, M. F., Zhou, Y., Taubenberger, J. K. & Holmes, E. C. Homologous recombination is very rare or absent in human influenza A virus. =0.00025. Wu, Y. et al. 6, e14 (2017). 13, e1006698 (2017). Conservatively, we combined the three BFRs >2kb identified above into non-recombining region1 (NRR1). Using a third consensus-based approach for identifying recombinant regions in individual sequenceswith six different recombination detection methods in RDP5 (ref. The consistency of the posterior rates for the different prior means also implies that the data do contribute to the evolutionary rate estimate, despite the fact that a temporal signal was visually not apparent (Extended Data Fig. Biol. 725422-ReservoirDOCS). master 4 branches 94 tags Code AngieHinrichs Add entries for pangolin-data/-assignment 1.18.1.1 ( #512) ad16752 4 days ago 990 commits .github/ workflows Update pangolin.yml 7 months ago docs docs need guide tree now 3 years ago pangolin Anderson, K. G., Rambaut, A., Lipkin, W. I., Holmes, E. C. & Garry, R. F. The proximal origin of SARS-CoV-2. 16, e1008421 (2020). Two other bat viruses (CoVZXC21 and CoVZC45) from Zhejiang Province fall on this lineage as recombinants of the RaTG13/SARS-CoV-2 lineage and the clade of Hong Kong bat viruses sampled between 2005 and 2007 (Fig. Using the most conservative approach (NRR1), the divergence time estimate for SARS-CoV-2 and RaTG13 is 1969 (95% HPD: 19302000), while that between SARS-CoV and its most closely related bat sequence is 1962 (95% HPD: 19321988); see Fig. 1c). Extended Data Fig. Calibration of priors can be performed using other coronaviruses (SARS-CoV, MERS-CoV and HCoV-OC43), but estimated rates vary with the timescale of sample collection. Bioinformatics 22, 26882690 (2006). Pangolin was developed to implement the dynamic nomenclature of SARS-CoV-2 lineages, known as the Pango nomenclature. Membrebe, J. V., Suchard, M. A., Rambaut, A., Baele, G. & Lemey, P. Bayesian inference of evolutionary histories under time-dependent substitution rates. In Extended Data Fig. Liu, P. et al. 88, 70707082 (2014). J. Virol. PLoS Pathog. These shy, quirky but cute mammals are one of the most heavily trafficked yet least understood animals in the world. The pangolin coronaviruses show lower similarity to SARS-CoV-2 than bat coronavirus RaTG13 across the whole genome, but higher similarity in the spike receptor binding domain, although the similarity at either scale remains too low to implicate . Rev. Bruen, T. C., Philippe, H. & Bryant, D. A simple and robust statistical test for detecting the presence of recombination. 4), that region and shorter BFRs were not included in combined putative non-recombinant regions. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. 3) to examine the sensitivity of date estimates to this prior specification. Microbiol. If the latter still identified non-negligible recombination signal, we removed additional genomes that were identified as major contributors to the remaining signal. Of importance for future spillover events is the appreciation that SARS-CoV-2 has emerged from the same horseshoe bat subgenus that harbours SARS-like coronaviruses. Smuggled pangolins were carrying viruses closely related to the one sweeping the world, say scientists. The assumption of long-term purifying selection would imply that coronaviruses are in endemic equilibrium with their natural host species, horseshoe bats, to which they are presumably well adapted. & Andersen, K. G. The evolution of Ebola virus: insights from the 20132016 epidemic. The latter was reconstructed using IQTREE66 v.2.0 under a general time-reversible (GTR) model with a discrete gamma distribution to model inter-site rate variation. Nat Microbiol 5, 14081417 (2020). We used TreeAnnotator to summarize posterior tree distributions and annotated the estimated values to a maximum clade credibility tree, which was visualized using FigTree. the development of viral diversity. In other words, a true breakpoint is less likely to be called as such (this is breakpoint-conservative), and thus the construction of a non-recombining region may contain true recombination breakpoints (with insufficient evidence to call them as such). Visual exploration using TempEst39 indicates that there is no evidence for temporal signal in these datasets (Extended Data Fig. Nucleotide positions for phylogenetic inference are 147695, 9621,686 (first tree), 3,6259,150 (second tree, also BFR B), 9,26111,795 (third tree, also BFR C), 12,44319,638 (fourth tree) and 23,63124,633, 24,79525,847, 27,70228,843 and 29,57430,650 (fifth tree). 62,63), the GTR+ model and 100bootstrap replicateswas inferred for each BFR >500nt. These rate priors are subsequently used in the Bayesian inference of posterior rates for NRR1, NRR2, and NRA3 as indicated by the solid arrows. and D.L.R. These residues are also in the Pangolin Guangdong 2019 sequence. Sarbecovirus, HCoV-OC43 and SARS-CoV data were assembled from GenBank to be as complete as possible, with sampling year as an inclusion criterion. 382, 11991207 (2020). However, on closer inspection, the relative divergences in the phylogenetic tree (Fig. Lin, X. et al. Developed by the Centre for Genomic Pathogen Surveillance. Yu, H. et al. collected SARS-CoV data and assisted in analyses of SARS-CoV and SARS-CoV-2 data. 4 TMRCAs for SARS-CoV and SARS-CoV-2. Mol. Furthermore, the other key feature thought to be instrumental in the ability of SARS-CoV-2 to infect humansa polybasic cleavage site insertion in the Sproteinhas not yet been seen in another close bat relative of the SARS-CoV-2 virus. SARS-CoV-2 itself is not a recombinant of any sarbecoviruses detected to date, and its receptor-binding motif, important for specificity to human ACE2 receptors, appears to be an ancestral trait shared with bat viruses and not one acquired recently via recombination. BEAGLE 3: improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics. A., Filip, I., AlQuraishi, M. & Rabadan, R. Recombination and lineage-specific mutations led to the emergence of SARS-CoV-2. Are pangolins the intermediate host of the 2019 novel coronavirus (SARS-CoV-2)? Pink, green and orange bars show BFRs, with regionA (nt 13,29119,628) showing two trimmed segments yielding regionA (nt13,29114,932, 15,40517,162, 18,00919,628). In the presence of time-dependent rate variation, a widely observed phenomenon for viruses43,44,52, slower prior rates appear more appropriate for sarbecoviruses that currently encompass a sampling time range of about 18years. Boni, M. F., de Jong, M. D., van Doorn, H. R. & Holmes, E. C. Guidelines for identifying homologous recombination events in influenza A virus. Holmes, E. C., Dudas, G., Rambaut, A. J. Virol. Phylogenies of subregions of NRR1 depict an appreciable degree of spatial structuring of the bat sarbecovirus population across different regions (Fig. Rambaut, A., Lam, T. T., Carvalho, L. M. & Pybus, O. G. Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). PubMed Phylogenetic Assignment of Named Global Outbreak LINeages, The pangolin web app is maintained by the Centre for Genomic Pathogen Surveillance. In our analyses of the sarbecovirus datasets, we incorporated the uncertainty of the sampling dates when exact dates were not available. The estimated divergence times for the pangolin virus most closely related to the SARS-CoV-2/RaTG13 lineage range from 1851 (1730-1958) to 1877 (1746-1986), indicating that these pangolin . Divergence dates between SARS-CoV-2 and the bat sarbecovirus reservoir were estimated as 1948 (95% highest posterior density (HPD): 18791999), 1969 (95% HPD: 19302000) and 1982 (95% HPD: 19482009), indicating that the lineage giving rise to SARS-CoV-2 has been circulating unnoticed in bats for decades. (Yes, Pango is a tongue-in-cheek reference to pangolins, which were briefly suspected to have had a role in the coronavirus's originseveral of the team's computational tools are named after. All authors contributed to analyses and interpretations. GARD identified eight breakpoints that were also within 50nt of those identified by 3SEQ. A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence. Since experts have suggested that pangolins may be the reservoir species for COVID-19, the scaly anteater has been catapulted into headlines, news reports, and conversationsand some are calling COVID-19 "the revenge of the . The fact that these estimates lie between the rates for MERS-CoV and HCoV-OC43 is consistent with the intermediate sampling time range of about 18years (Fig. matics program called Pangolin was developed. While such models have recently been made available, we lack the information to calibrate the rate decline over time (for example, through internal node calibrations44). Lu, R. et al. Zhou, P. et al. A third approach attempted to minimize the number of regions removed while also minimizing signals of mosaicism and homoplasy. Menachery, V. D. et al. 36)gives a putative recombination-free alignment that we call non-recombinant alignment3 (NRA3) (see Methods). TMRCA estimates for SARS-CoV-2 and SARS-CoV from their respective most closely related bat lineages are reasonably consistent for the different data sets and different rate priors in our analyses. The Artic Network receives funding from the Wellcome Trust through project no. 190, 20882095 (2004). Results and discussion Genomic surveillance has been a hallmark of the COVID-19 pandemic that, in contrast to other pandemics, achieves tracking of the virus evolution and spread worldwide almost in real-time ( 4 ). At present, we analyzed the diversity of SARS-CoV-2 viral genomes in India to know the evolutionary patterns of viruses in the country through their pangolin lineage and GISAID-Clade. Nature 579, 265269 (2020). To gauge the length of time this lineage has circulated in bats, we estimate the time to the most recent common ancestor (TMRCA) of SARS-CoV-2 and RaTG13. Its genome is closest to that of severe acute respiratory syndrome-related coronaviruses from horseshoe bats, and its receptor-binding domain is closest to that of pangolin viruses. Menachery, V. D. et al. With horseshoe bats currently the most plausible origin of SARS-CoV-2, it is important to consider that sarbecoviruses circulate in a variety of horseshoe bat species with widely overlapping species ranges57. Python 379 102 pangoLEARN Public Store of the trained model for pangolin to access. Sci. Because coronaviruses are known to be highly recombinant, we used three different approaches to identify non-recombinant regions for use in our Bayesian time-calibrated phylogenetic inference. 53), this is inferred to have occurred before the divergence of RaTG13 and SARS-CoV-2 and thus should not influence our inferences. Google Scholar. Decimal years are shown on the x axis for the 1.2 years of SARS sampling in c. d, Mean evolutionary rate estimates plotted against sampling time range for the same three datasets (represented by the same colour as the data points in their respective RtT divergence plots), as well as for the comparable NRA3 using the two different priors for the rate in the Bayesian inference (red points). Time-measured phylogenetic reconstruction was performed using a Bayesian approach implemented in BEAST42 v.1.10.4. The construction of NRR1 is the most conservative as it is least likely to contain any remaining recombination signals. Xiao, K. et al. To examine temporal signal in the sequenced data, we plotted root-to-tip divergence against sampling time using TempEst39 v.1.5.3 based on a maximum likelihood tree. Concatenated region ABC is NRR1. It is clear from our analysis that viruses closely related to SARS-CoV-2 have been circulating in horseshoe bats for many decades. Mol. 21, 15081514 (2015). Bioinformatics 28, 32483256 (2012). Emerg. We showed that severe acute respiratory syndrome coronavirus 2 is probably a novel recombinant virus. Annu Rev. Webster, R. G., Bean, W. J., Gorman, O. T., Chambers, T. M. & Kawaoka, Y. Evolution and ecology of influenza A viruses. BEAST inferences made use of the BEAGLE v.3 library68 for efficient likelihood computations. Another similarity between SARS-CoV and SARS-CoV-2 is their divergence time (4070years ago) from currently known extant bat virus lineages (Fig. We compiled a set of 69SARS-CoV genomes including 58 sampled from humans and 11 sampled from civets and raccoon dogs. 110. Chernomor, O. et al. We say that this approach is conservative because sequences and subregions generating recombination signals have been removed, and BFRs were concatenated only when no PI signals could be detected between them.