In: Computer Science
Answer:- Forty-eight thousand six hundred thirty-five SARS-CoV-2 genomic sequences were downloaded from GISAID (Shu and McCauley, 2017) on June 26, 2020 (Supplementary File 1). Only viruses affecting human hosts were selected, removing low-quality sequences (>5% NNNs) and using only full-length sequences (>29,000 nt). Forty-eight thousand six hundred twenty-four sequences were associated to a geographic region, specifically: 514 from Africa, 3,340 from Asia, 31,818 from Europe, 10,250 from North America, 2,127 from Oceania and 575 from South America. Eleven sequences were not associated to any continent. We provide as Supplementary File 2 a full geographic description of each sample used in the study.SARS-CoV-2 genomic variations: Samples of this virus from various geographical locations display variations in the genomic sequences.
The reference NC_045512.2 SARS-CoV-2 Wuhan genome (Coronaviridae Study Group of the International Committee on Taxonomy of Viruses, 2020), 29,903 nucleotides long, was obtained from NCBI GenBank. A GFF3 annotation associated to the refence, showing genomic coordinates for all protein sequences of SARS-CoV-2, is provided as Supplementary File 3. The large ORF1 polyprotein was split into its constituent Non-structural proteins (NSPs). The NSP12, encoding for the viral RNA-dependent RNA polymerase, was considered in the annotation as two regions, NSP12a and NSP12b, corresponding to the regions before and after a ribosomal frameshift, occurring as nucleotide 13,468 is translated as both the last nucleotide of a codon and the first of the next codon.
NUCMER version 3.1 (Delcher, 2002) was used to align all 48,635 genome sequences over the NC_045512.2 reference. The output of the alignment was converted to an annotated list of all mutational events using an internally developed R SARS-CoV-2 annotation algorithm provided as Supplementary File 4.
SARS-CoV-2 5′UTR RNA secondary structure has been predicted by free energy minimization together with equilibrium partition function and base pair binding probabilities algorithm from the RNAfold WebServer using default settings (Gruber et al., 2008).
We identified six major clades with 14 subclades (Fig. 1 and Table 4). The largest clade is D614G clade with five subclades. Most samples in the D614G clade also display the non-coding variant 241C > T, the synonymous variant 3037C > T and ORF1ab P4715L. Within D614G clade, D614G/Q57H/T265I subclade forms the largest subclade with 2391 samples. The second largest major clade is L84S clade, which was observed among travellers from Wuhan in the early days of the outbreak, and the clade consists of 1662 samples with 2 subclades. The L84S/P5828L/ subclade is predominantly observed in the United States. Among the L3606F subclades, L3606F/G251V/ forms the largest group with 419 samples. G251V frequently appears in samples from the United Kingdom (329 samples), Australia (95 samples), the United States (80 samples) and Iceland (76 samples). However, the basal clade now accounts only for a small fraction of genomes (670 samples mainly from China). The remaining two clades D448del and G392D are small and they are without any significant subclades at this point.
Variants with recurrence over 100 samples are shown in Table 3. The most common variants were the synonymous variant 3037C > T (6334 samples), ORF1ab P4715L (RdRp P323L; 6319 samples) and SD614G (6294 samples). They occur simultaneously in over 3000 samples, mainly from Europe and the United States. Other variants including ORF3a Q57H (2893 samples), ORF1ab T265I (NSP3 T85I; 2442 samples), ORF8 L84S (1669 samples), N203_204delinsKR (1573 samples), ORF1ab L3606F (NSP6 L37F; 1070 samples) were the key variants for identifying clades.