Phylogenetic analysis of SARS-CoV-2 genomes in Turkey

COVID-19 has effectively spread worldwide. As of May 2020, Turkey is among the top ten countries with the most cases. A comprehensive genomic characterization of the virus isolates in Turkey is yet to be carried out. Here, we built a phylogenetic tree with globally obtained 15,277 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes. We identified the subtypes based on the phylogenetic clustering in comparison with the previously annotated classifications. We performed a phylogenetic analysis of the first 30 SARS-CoV-2 genomes isolated and sequenced in Turkey. We suggest that the first introduction of the virus to the country is earlier than the first reported case of infection. Virus genomes isolated from Turkey are dispersed among most types in the phylogenetic tree. We find 2 of the seventeen subclusters enriched with the isolates of Turkey, which likely have spread expansively in the country. Finally, we traced virus genomes based on their phylogenetic placements. This analysis suggested multiple independent international introductions of the virus and revealed a hub for the inland transmission. We released a web application to track the global and interprovincial virus spread of the isolates from Turkey in comparison to thousands of genomes worldwide.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has emerged in Wuhan , spread across continents and eventually resulted in the COVID-19 pandemic. Although there are significant differences between the current and previously known SARS-CoV genomes, the reason behind its pandemic behaviour is still unclear. Genome sequences around the world were revealed and deposited into public databases such as GISAID (Shu and McCauley, 2017). With those genomic datasets, it is possible, in fact crucial to reveal the evolutionary events of SARS-CoV-2 to understand the types of the circulating genomes as well as in which parts of the genome differ across these types.
The SARS-CoV-2 virus is homologous to SARS-CoV, and its closer versions were characterized in bats and pangolins . The virus has been under a strong purifying selection . With the isolates obtained so far, the sequences of SARS-CoV-2 genomes showed more than 99.9% percent identity indicating a recent shift to the human species (Tang et al., 2020). Yet, there are unambiguous evolutionary clusters in the genome pool. Various studies use SNP (Tang et al., 2020) or entropy (Zhao et al., 2020) based methods to identify evolving virus types to reveal genomic regions responsible for transmission and evolution. Tang et. al identified S and L types among 103 SARS-CoV-2 genomes based on 2 SNPs at ORF1ab and ORF8 regions which encode replicase/transcriptase and ATF6, respectively (Tang et al., 2020). The entropy-based approach generated informative subtype markers from 17 informative positions to cluster evolving virus genomes (Zhao et al., 2020). Another study defined a competitive subtype based on the D614G mutation in the spike protein which facilitates binding to ACE2 to receptor on the host cell surface (Bhattacharyya et al., 2020). Although whether there is any effect of D614G substitution on the transmissibility is inconclusive (Van Dorp et al., 2020), this mutation has been 1 of the landmarks for major groupings of the virus family.
In this work, we used publicly available SARS-CoV-2 genome datasets. We aligned the sequences of more than 15,000 whole genomes and built a phylogenetic tree with the maximum likelihood method. We clustered the genomes based on their clade distribution in the phylogenetic tree, identified their genomic characteristics and linked them with the previous studies. We further analysed clusters, mutations and transmission patterns of the genomes from Turkey.

Materials and methods
To perform our analyses we retrieved virus genomes, aligned them to each other and revealed the evolutionary relationships between them through phylogenetic trees. We assigned the clusters based on the mutations for each genome. We further analyzed the phylogenetic tree with respect to neighbor samples of our genomes of interest to identify possible transmission patterns.

Data retrieval, multiple sequence alignment and phylogenomic tree generation
The entire SARS-CoV-2 genome sequences, along with their metadata were retrieved from the GISAID database (Table S1) (Shu and McCauley, 2017). We retrieved the initial batch of genomes (3,228) from GISAID on 02/04/2020. We used Augur toolkit to align whole genome sequences using mafft algorithm (--reorder --anysymbolnomemsave) (Katoh and Standley, 2016). The SARS-CoV2 isolate Wuhan-Hu-1 genome (GenBank: NC_045512.2) was used as a reference genome to trim the sequence and remove insertions in the genomes. Since the initial batch, the new sequences in GISAID were periodically added to the preexisting multiple sequence alignment (--existingalignment). The final multiple sequence alignment (MSA) contained 15,501 genomes that were available on May 1st 2020. In the metadata file, some genomes lacked month and day information and contained the year of the sample collection date. The genomes with incomplete metadata were filtered out and the filtered MSA consisted of 15,277 sequences. Maximum likelihood phylogenetic tree was built with IQ-TREE with the following options: -nt AUTO (on a 112-core server) -m GTR -fast. Augur was used to estimate the molecular clock through TimeTree (Sagulenko et al., 2018). For the sample EPI-ISL-428718 we additionally built a separate maximum likelihood phylogenetic tree by using IQ-TREE multicore version 1.6.1 with ultra-fast bootstrapping option and 1000 bootstraps.
The subtree consisting of Turkey isolates (Table 1) were retrieved from the master time-resolved tree by removing the rest of the genomes with the 'pruning' method from ete3 toolkit (Huerta-Cepas et al., 2016). The tree was visualized in FigTree v1.4.4 (http://tree.bio. ed.ac.uk/software/figtree/), and rerooted by selecting EPI_ ISL_428718 as an outgroup. The branch lengths of EPI-ISL-417413 and EPI-ISL-428713 samples were shortened for better visualization. Ggtree (Yu et al., 2017) package in R was used to generate the tree and corresponding clusters.

Genome clustering
We generated phylo-clusters with TreeCluster (Balaban et al., 2019) which is specifically designed to group viral genomes. The tool supports different clustering options and we used the default option, Max Clade, which identifies clusters based on 2 parameters, "-t" and "-s". These parameters define the threshold that two leaf nodes can be distant from each other and assign a minimum support value that connects the two leaf nodes or clades, respectively. For this analysis, we only used the distance threshold. The Max Clade algorithm requires leaves to form a clade and satisfy the distance threshold. The number of clusters that can be generated using a phylogenetic tree depends on the pair wise leaf distance cutoff. We manually searched for a meaningful cutoff for the number of phyloclusters and phylo-subgroups based on their similarity with the previously reported clusters (see below). We used the -t parameter as 0.0084 and 0.00463 for phylo-clusters and phylo-subclusters, respectively. After retrieving the groupings from TreeCluster, we eliminated clusters containing less than 100 sequences (except a subcluster with 99 sequences). We categorized those clusters having less than 100 sequences as not clustered. As a result, we obtained 4 primary clusters and seventeen subclusters.
L/S types of the SARS-CoV-2 genomes were previously defined based on the nucleotides at 8782nd and 28144th positions (Tang et al., 2020) . We categorized "TC" and "CT" haplotypes S and L type, respectively. In the cases both these positions correspond to a gap, the sequences as well as all other cases were classified as unknown. 614 G/D clustering was applied based on the amino acid at the 614th position of the Spike protein (Jaimes et al., 2020). Combinations of the nucleotides at positions 241; 1059; 3037; 8782; 11083; 14408; 14805; 17747; 17858; 18060; 23403; 25563; 26144; 28144; 28881; 28882; 28883 determined the subtypes for barcode clustering. Sequences that belong to the 10 major subtypes (with more than 100 sequences) which constitute 86 percent of all sequences were labelled with their respective 17 nucleotides (Zhao et al., 2020). All other sequences were categorized as unknown for barcode classification. Six major clusters (Morais Júnior et al., 2020) were assigned by the previously determined 12 positions (3037; 8782; 11083; 14408; 17747; 17858; 18060; 23403; 28144; 28881; 28882; 28883). The lineages were assigned using the proposed nomenclature by  Lineage Assigner web server (Rambaut et al., 2020). Sequences that cannot be assigned to any group were categorized as unknown for each classification scheme.

Distance calculations
We rooted the maximum-likelihood tree for distance calculations by selecting samples that belong to bats and pangolin as an outgroup, namely EPI-ISL-412976, EPI-ISL-412977, and EPI-ISL-412860. We measured the distance from leaf to root for every leaf node that is present in the phylogenetic tree with the ete3 toolkit (Huerta-Cepas et al., 2016).

Variant information processing
Mutations for each position relative to the reference genome (GenBank: NC_045512.2) were catalogued in a table with a custom script. A table of all the mutations of selected sequences was created and ordered according to the phylogenetic tree of the corresponding genomes (Table  S2). Mutations that do not correspond to a nucleotide such as a gap or N were labeled as "Gap or N"; the other mutations were marked as nongap. For variations that do not correspond to gap or N, respective nucleotides in the reference genome were obtained and added to the table. The GFF file of the reference genome (GCF_009858895.2) was extracted from NCBI genome database. Open reading frame (ORF) information of each mutation was retrieved from the GFF file and added to the table. Positions that are not in the range of any ORF were labelled as "noncoding region". Codon information and position of each mutation in the reference genome were retrieved according to their respective ORF start positions and frame. In this process, reported frameshifts in ORF1ab (Dos Ramos et al., 2004; Kelly and Dinman, 2020) and the overlap between ORF7a 3' and 7b 5' ends were taken into account. Coding information was used to assign amino acid substitution information to the variations. Eventually, the variants were categorized as nonsynonymous, synonymous and noncoding regions.

Migration analysis
The maximum-likelihood phylodynamic analysis was performed with TreeTime (Sagulenko et al., 2018) to estimate likely times of whole-genome sequences of SARS-CoV-2 by computing confidence intervals of node dates and reconstruct phylogenetic tree into the timeresolved tree. The slope of the root-to-tip regression was set to 0.0008 to avoid inaccurate inferences of substitution rates. With this model, we eliminated the variation of rapid changes in clock rates by integration along branches (standard deviation of the fixed clock rate estimate was set to 0.0004). The coalescent likelihood was performed with the Skyline (Strimmer and Pybus, 2001) model to optimize branch lengths and dates of ancestral nodes and infer the evolutionary history of population size. The marginal maximum likelihood assignment was used to assign internal nodes to their most likely dates. Clock rates were filtered by removing tips that deviate more than four interquartile ranges from the root-to-tip versus time regression. JC69 model was used as general time-reversible (GTR) substitution models to calculate transition probability matrix, actual substitution rate matrix, and equilibrium frequencies of given attributes of sequences. The distribution of subleading migration states and entropies were recorded for each location through Augur trait module (sampling bias correction was set to 2.5). Closest child-parent pairs that do not go beyond their given locations were identified and evaluated as transmissions using Auspice (Hadfield et al., 2018).

Phylogenetic map of the virus subtypes
The first COVID-19 case in Turkey was reported on March 10th , 2020, later than the reported earliest incidents in Asian and European countries. Since then, the number of cases increased dramatically. We used all the genomes available in the GISAID database as of May 1st , 2020 and built a phylogenetic tree. After we filtered out the samples with incomplete date or location information, the total number of samples we eventually used was 15,277. The phylogenetic tree was built with the maximum likelihood method and a time-resolved tree was generated (Figure 1).
To verify the accuracy of the phylogenetic tree as well as to assess the distribution of well-characterized genomic features, we mapped several classification schemes on the tree; (i) S/L type (Tang et al., 2020); (ii) D614G type (Bhattacharyya et al., 2020); (iii) barcodes (Zhao et al., 2020); (iv) six major clusters (Morais Júnior et al., 2020). Although the methodologies of the clustering attempts were different between these studies, in general, the previously established groups were in line with our phylogenetic tree. Besides the already established clustering methods, we classified the clades based on the phylogenetic tree only.
There are 2 levels of clustering; we termed phylo-clusters and phylo-subclusters. Small clusters were not taken into account (see methods). The phylogenetic map of the virus genomes clearly shows the 2 major S and L type clades. As the ancestral clade, S-type is seen as limited in the number of genomes. Twenty-nine of the 30 isolates in Turkey were classified in the L-type group.
The samples from Turkey are dispersed throughout the phylogenetic tree (Figure 1). The 30 samples are classified in 3 out of 4 different phylo-clusters and one remained unclassified. The dispersed groups suggested multiple independent introductions to the country. Seven of the 30 genomes encode aspartic acid (D) at the 614th position of the spike protein. The remaining 23 genomes encode glycine (G) in the same position. The D614G mutation is hypothesized to dominate because it enables smoother transmission of the virus (Bhattacharyya et al., 2020). However, this correlation might simply be a founder effect which is basically the loss or gain of genetic information when large population arises from a single individual.

A transient genome between S and L strain suggests early introduction
One of the genomes isolated in Turkey (EPI-ISL-428718) clustered together with the early subtypes of the virus. This isolate contains T at the position 8782, which is a characteristic of the S-type; however, it has T at the position 28144, which coincides with the L-type. Therefore, we characterized this sample as neither S-nor L-type. In the phylogenetic tree, this genome is placed between S and L strains, which suggests a transitioning genome from S to L strain ( Figure 2). The number of variant nucleotides between this sample and root is lower relative to other Turkey samples. Phylogenetic placement in the earliest cluster, which is closer to the root, suggests that the lineage of EPI-ISL-428718 entered Turkey as one of the earliest genomes. By the time this sample was isolated in Turkey, the L-strain had started to spread in Europe, primarily in Italy. Although the isolation date of this early sample is one week later than the first reported case, the existence of an ancestral genome sequence suggests an earlier introduction of SARS-CoV-2 to Turkey.

Cluster profiles of the sample sets
Turkey has genome samples from at least 3 of the 4 major clusters. By taking the transitioning genome into account, samples of Turkey are genuinely scattered in the phylogenetic tree. Based on the groupings applied, we analyzed the relative abundances of the clusters in Turkey and other countries ( Figure 3A). The most samples of Turkey belong to cluster 4. Iran, Denmark and France are also enriched in cluster 4. Most European countries are enriched in cluster 3. Although Turkey has cluster 3 genomes, the fraction of them is lower compared to European countries. With the available genome sequences, the overallcluster profile of Turkey does not resemble any country.
The divergence of the samples from to tree root was calculated for each subcluster. The subclusters observed in Turkey were analyzed along with the other countries ( Figure 3B). The divergence rates are comparable in general. However, within the same subclusters (4 and 8), virus genomes collected in Turkey have averagely more diverged than their relatives in other countries. The isolated genomes assigned to subcluster 4 and 8 show higher divergence rates in Turkey compared to the others in the same cluster (P-values: 0.00001 and 0.006, respectively, one-tailed t-test between Turkey and the rest).

Mutation analysis of the genomes retrieved in Turkey
We used the Turkey isolates (30) to analyze their mutational patterns and their corresponding clusters. From the master tree, we pruned all the leaves except for the samples of interest. We rooted the subtree at the transitioning sample. We aligned the assigned clusters and all the mutations relative to the reference genome (Figure 4), illustrating a correlation between the mutation pattern and the phylogenetic tree clades. Observation of no recurrence of a mutationshows that the multiple mutations are the results of founder effects.
In total, 55 unique mutations were detected, 2 and 20 of which are noncoding and synonymous, respectively. Thirty-three unique amino acid substitutions were detected (Table 2). Twenty-three out of 30 genomes we analyzed have the 614G mutation. The D614G mutation seems to have mutated with the 2 synonymous mutations in ORF1ab (Figure 4). Besides 614G, 3 more amino acid substitutions were identified in the spike protein (Table  2). G206A, T951I, G227S, S911F, A1420V, A3995F in ORF1a and V772I, T1238I in Spike protein, V66L in ORF5 and S54L in ORF8 were found to be specific to some isolates in Turkey ( Table 2). The most abundant amino acid substitutions (23/30) are P314L (ORF1b) and D614G (Spike), which are dispersed worldwide and not specifically enriched in Turkey. ORF1a V378I and ORF9 S194L are found in 7 and 6 of the 30 isolates, respectively, and show high frequency (15 folds with respect to general) in Turkey.
The mutational landscape represents the natural classifications of major groupings and subclusters. These mutational footprints can be used to identify the clusters of the future genomes. The combinations of mutations can be used as barcodes to group upcoming virus genomes efficiently without a need for establishing evolutionary associations across lineages, which is a computationally expensive procedure considering the accumulating genomic data.

Trace of the spread
The number of mutations since December 2019 indicates that the SARS-CoV-2 genome mutates twice a month, on average. Genome sequencing reveals mutations and enables a better understanding of the epidemiology by revealing the patterns of virus transmission. The timeresolved phylogenetic distributions of the genomes collected in Turkey suggested multiple independent sources of introduction ( Figure 5A). Out of the 30 genomes analyzed in this work, the earliest introduction seems to have originated from China. Other international imports include the US, Australia and Europe, probably from the UK. There is a connection between Saudi Arabia and the 2 cities in Turkey. Based on the model, this association is reciprocal. The Europe-based introductions are observed with the genomes isolated in İstanbul. Within Turkey, a transmission hub appears to be Ankara ( Figure 5B). The Figure 2. Phylogenetic tree of the transient type (EPI-ISL-428718) from S to L strain.The maximum likelihood tree was built with IQ-TREE. 10 S-type and 10 L-type sequences were randomly selected from the assigned samples. The tree was rooted at the genomes obtained from bat and pangolin.
isolates in 5 cities are associated with genomes collected in Ankara ( Figure 5C).

Web application to trace virus transmission
We have published a web application powered by Auspice (sarscov2.adebalilab.org/latest). We employed the frontend package (Auspice) that Nextstrain uses (Hadfield, et al. 2018). With increasing number of virus strains, not far from now, it will be infeasible to display the entire phylogenetic tree even in modern browsers. Nextstrain handles this problem by grouping the datasets based on the continents. As the aim of this platform is to trace the spread of virus genomes associated with Turkey, we will use representatives in the phylogenetic tree. The representative sequences will cover all the subtypes. The genomes of the samples collected in Turkey and their neighboring sequences will be kept in the tree. With this approach, the web application will always contain the genome data from Turkey and necessary information of the subtypes with the representative sequences. An additional dimension we added to the application is that it enables to trace virus across the cities of Turkey. This approach is applicable to create a comprehensive platform for migration analysis for any country or region of choice.

Discussion
There are 2 most abundant lineages of isolates in Turkey: subclusters 4 and 8. If the 30 samples unbiasedly represent the overall distribution of the strains in Turkey, subclusters 4 and 8 might comprise approximately 80% of the genomes in the country. The high divergence of the samples in these subclusters in Turkey relative to their equivalents in other countries ( Figure 3B) possibly suggests either or both of the 2 scenarios; (i) the viruses dominantly circulating in Turkey were introduced to the country later than other countries or (ii) this subcluster has been circulating in Turkey at a relatively higher rate than other countries and because of that, it is more likely to select the more diverged isolates by random sampling. Much more genomes should be sequenced and analyzed to gain more insight into virus evolution. It is essential to continuously follow up on the upcoming mutations when new samples are added to GISAID database.
The phylogenetic analysis of the circulating genomes in a country is necessary to identify the specific groups and their unique mutational patterns. The success of the COVID-19 diagnosis test kits, antibody tests and proteintargeting drugs possibly depend on genomic variations. For antibody tests, if a mutation affects protein recognition, the sensitivity of the test might drastically reduce. Therefore, mutation profiles of the isolates abundantly circulating in the country should be taken into account to modify these tests. As international travels are limited, viral genome profiles of the countries differ from each other, which is known as bottleneck effect. If international transmissions are kept being restricted, distinct cluster profiles might establish. Therefore, each country might need to develop their specific tests targeting the abundant genomes circulating in local.
We must note that sample distribution is not in line with the case distribution across Turkish cities. Due to this sampling bias as well as the low number of genomes, the spread history is undoubtedly incomplete. For instance, only 3 of the 30 samples were collected in İstanbul, which hosts approximately 60% of the COVID-19 cases. It is highly probable that İstanbul will be revealed as the central hub when additional genomes are sequenced. Moreover, there was no sample from İzmir, 3rd largest city. It should also be noted that the lack of a sufficient number of genomes could have resulted in indirect associations between the cities. More genomes are needed to complement this study with confidence.
The spread of the virus is traced by the personal declarations and travel history of the infected people. As SARS-CoV-2 genomes spread, they leave foot prints behind (mutations) allowing us to trace them. It is feasible to complement the conventional approach with genome sequencing in an unbiased way. Implemented feature of city-based tracing of the virus should be useful for authorities to take necessary measures to prevent spread. This approach will be automated with a standard pipeline. We aim to eliminate the technical limitations (because of the size) by applying filtering methods without losing any relevant information.
We thank Dr. Barış Süzek and Dr. Batu Erman for their helpful comments on the manuscript. We would like to acknowledge Cem Azgari for his contributions throughout the project. We thank Molecular Biology Association for their leadership in taking the initiative of forming a pool of volunteers for COVID-19 testing. Finally, we would like to thank the members of Ecology and Evolutionary Biology Association in Turkey for fruitful discussions regarding the preliminary analysis of the SARS-CoV-2 genomes.

Authors' contribution
OA conceived the study, designed the analysis, interpreted the results and wrote the first draft. AB generated the multiple sequence alignments, Bİ generated the visualization pipeline with auspice. BS generated the clusters based on the phylogenetic tree and plotted cluster graphs. DÇ, ZK and BT assigned previously identified clusters to the genomes, visualized the clusters aligned with the tree and identified mutations per sample. All authors contributed to manuscript writing and revising.