This repository contains a comprehensive phylogeographic analysis of Chikungunya virus (CHIKV) based on genomic sequence data collected between 2015 and 2025. The study focuses on three major CHIKV genotypes (I-WestAfrica, II-ECSA, and III-Asian) and employs advanced phylogenetic methods to investigate global transmission patterns, evolutionary dynamics, and geographic spread.
Chikungunya virus (CHIKV) is a mosquito-borne alphavirus that has caused large-scale epidemics across tropical and subtropical regions worldwide. Despite its global health significance, the fine-scale transmission patterns and phylogeographic dynamics of CHIKV remain incompletely understood. This study presents a comprehensive genomic surveillance analysis of CHIKV using 1,500+ complete genome sequences collected from multiple continents between 2015 and 2025. We employed Bayesian phylogenetic methods (BEAST) to reconstruct time-calibrated phylogenies for three major CHIKV genotypes and analyzed transmission patterns using identical sequence pairs to characterize fine-scale geographic spread. Our analysis reveals distinct phylogeographic patterns across genotypes, with evidence of frequent inter-continental transmission events, particularly involving South America, Asia, and Africa. We applied relative risk (RR) analysis based on identical sequence pairs to identify transmission hotspots and quantify connectivity between geographic regions at multiple spatial scales (continent, country, and sub-national levels). Time-calibrated phylogenies indicate ongoing evolution and diversification of all three genotypes, with recent expansion events in Asia and South America. These findings provide critical insights into CHIKV transmission dynamics and inform public health surveillance and control strategies.
Chikungunya virus (CHIKV) is an arthropod-borne alphavirus (family Togaviridae) that causes chikungunya fever, a debilitating disease characterized by acute onset of fever and severe joint pain. First identified in Tanzania in 1952, CHIKV has since spread globally, with major outbreaks reported in Africa, Asia, the Indian Ocean islands, and the Americas. The virus is primarily transmitted by Aedes aegypti and Aedes albopictus mosquitoes, both of which are widely distributed across tropical and subtropical regions.
CHIKV exhibits significant genetic diversity and can be classified into three main genotypes based on phylogenetic analysis of the envelope gene or complete genome sequences:
-
Genotype I (West African): Originally restricted to West Africa, this genotype has been associated with sporadic cases and limited outbreaks in the region.
-
Genotype II (East/Central/South African, ECSA): This is one of the most widespread genotypes, responsible for major epidemics in the Indian Ocean region, India, Southeast Asia, and parts of Europe. A key variant within this genotype, the E1-A226V mutation, enhances transmission by Aedes albopictus, contributing to its global spread.
-
Genotype III (Asian): Originating in Asia, this genotype has been responsible for massive outbreaks in the Americas since its introduction in 2013. It has caused millions of cases across Central and South America, with ongoing transmission and local adaptation.
Despite decades of CHIKV research, several critical questions remain unanswered:
- What are the fine-scale transmission patterns between geographic regions at multiple spatial scales?
- How do transmission networks differ among the three major genotypes?
- What are the temporal dynamics of CHIKV spread and evolution?
- Which geographic regions serve as transmission hubs or bridges between continents?
To address these questions, we conducted a comprehensive phylogeographic analysis using genomic surveillance data. Our approach combines:
- Bayesian phylogenetic inference using BEAST to reconstruct time-calibrated phylogenies
- Phylogeographic analysis to track the geographic spread of viral lineages
- Transmission network analysis using identical sequence pairs to identify fine-scale transmission patterns
- Relative risk (RR) analysis to quantify connectivity between geographic regions
We collected complete genome sequences from public databases (GISAID and GenBank) spanning the period from 2015 to 2025. After quality control and filtering (minimum length: 10,500 nucleotides), we compiled a dataset of 1,500+ sequences with comprehensive metadata including collection date, geographic location (continent, country, city), host type, and genotype classification.
Phylogenetic Analysis Pipeline:
- Sequence Alignment: Multiple sequence alignment using MAFFT
- Phylogenetic Inference: IQ-TREE for initial tree reconstruction, followed by BEAST analysis for time-calibrated phylogenies
- MCC Tree Extraction: Maximum Clade Credibility (MCC) trees extracted from posterior tree distributions
- Time Calibration: TreeTime for molecular clock calibration and divergence time estimation
- Phylogeographic Reconstruction: Discrete trait analysis in BEAST to infer ancestral geographic states
Transmission Analysis:
- Identical Sequence Identification: Identification of pairs of identical sequences within the dataset
- Geographic Annotation: Annotation of sequence pairs with geographic metadata
- Relative Risk Calculation: Computation of RR values to quantify transmission connectivity:
RR = (observed pairs between regions) / (expected pairs based on random distribution) - Uncertainty Quantification: Bootstrap resampling to estimate confidence intervals for RR values
Visualization:
- Phylogenetic Trees: Branch-colored trees showing geographic (country-level) distribution
- Geographic Maps: Visualization of sequence distribution and transmission patterns
- Heatmaps: Transmission rate matrices showing connectivity between regions
Chikungunya/
├── chikv_ALL/ # Combined analysis across all genotypes
│ ├── results/ # Analysis results (RR matrices, pair counts, etc.)
│ ├── scripts/ # Utility scripts for RR analysis
│ └── sub_trees/ # Sub-tree extraction and analysis
│
├── chikv_I_WestAfrica/ # Genotype I analysis
│ ├── West_African_aln_2MCC.tree # MCC tree
│ ├── West_African_with_coordinates.tsv # Metadata with geographic coordinates
│ ├── plot_phylogenetic_tree.R # Tree visualization script
│ └── results/ # Genotype-specific results
│
├── chikv_II_ECSA/ # Genotype II analysis
│ ├── ECSA_MCC.tree
│ ├── ECSA_with_coordinates.tsv
│ ├── plot_phylogenetic_tree.R
│ └── results/
│
├── chikv_III_Asian/ # Genotype III analysis
│ ├── Asian_alnMCC1.tree
│ ├── Asian_with_coordinates.tsv
│ ├── plot_phylogenetic_tree.R
│ └── results/
│
├── Genebank/ # Raw sequence data from GenBank
├── metadata.tsv # Combined metadata file
└── README.md # This file
Scripts are available to generate publication-quality phylogenetic trees with:
- Branch coloring by geographic location (country-level)
- Tip point coloring by country
- No tip labels for cleaner visualization
- Legend positioning on the right side
- Consistent styling across all three genotypes
Usage:
cd chikv_III_Asian/
module load R/4.1.2
Rscript plot_phylogenetic_tree.RThis generates:
phylogenetic_tree.png(high-resolution PNG)phylogenetic_tree.pdf(vector format PDF)
The repository includes comprehensive scripts for analyzing transmission patterns using identical sequence pairs:
- Pair identification:
scripts/count_pairs.R - RR calculation:
scripts/RR_from_df_n_pairs.R - Uncertainty estimation:
scripts/uncertainty_RR.R - Metadata integration:
scripts/append_metadata_field.R
Outputs include:
- Number of identical sequence pairs by geographic region
- Relative risk (RR) matrices for transmission connectivity
- Bootstrap confidence intervals for RR estimates
- Results at multiple spatial scales (continent, country, region)
BEAST analyses were performed for each genotype to:
- Estimate divergence times
- Reconstruct ancestral states
- Infer phylogeographic history
- Quantify evolutionary rates
MCC trees are provided for each genotype, containing:
- Branch lengths in time units
- Posterior probabilities for nodes
- Ancestral geographic state reconstructions
- Height annotations
-
R (≥ 4.1.2) with packages:
ape- Phylogenetic analysisggtree- Phylogenetic tree visualizationdplyr- Data manipulationggplot2- PlottingRColorBrewer- Color schemestidytree- Tree data manipulation
-
BEAST2 - Bayesian phylogenetic analysis
-
TreeAnnotator - MCC tree extraction
-
IQ-TREE - Maximum likelihood phylogeny
-
MAFFT - Multiple sequence alignment
-
TreeTime - Molecular clock calibration
If ggtree is not available, the scripts will attempt to install it via BiocManager:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ggtree")Alternatively, install from GitHub:
remotes::install_github("YuLab-SMU/ggtree")# For Genotype III (Asian)
cd /scr/u/dongw21/Chikungunya/chikv_III_Asian
module load R/4.1.2
Rscript plot_phylogenetic_tree.R# In R
source("chikv_ALL/scripts/utils_RR.R")
source("chikv_ALL/scripts/count_pairs.R")
source("chikv_ALL/scripts/RR_from_df_n_pairs.R")
# Example: Calculate RR by country
df_n_pairs <- count_identical_pairs(sequences, metadata, "Country")
df_RR <- get_df_RR(df_n_pairs, "Country")- Genotype I (West Africa): Primarily restricted to West Africa and South America
- Genotype II (ECSA): Widespread across Africa, Asia, and Europe
- Genotype III (Asian): Dominant in Asia and the Americas, with recent expansion to new regions
- High connectivity observed between South America and Asia for Genotype III
- Frequent inter-continental transmission events identified
- Transmission hotspots identified in major urban centers
- Country-level RR analysis reveals asymmetric transmission flows
- Continuous evolution and diversification across all genotypes
- Recent expansion events (2020-2025) in multiple regions
- Evidence of local adaptation and selection pressures
- GISAID: EpiCoV database (acknowledgments provided in metadata)
- GenBank: NCBI Virus database
- Accession ID
- Collection date
- Geographic location (Continent, Country, City, Coordinates)
- Host type (Human, Mosquito, etc.)
- Genotype classification
If you use this code or data, please cite:
- GISAID contributors (for sequences from GISAID)
- Original GenBank submissions (for GenBank sequences)
- This repository (if applicable)
- Contributors to GISAID database
- GenBank submitters
- BEAST and TreeTime developers
- R package maintainers (ape, ggtree, etc.)
For questions or issues, please contact the repository maintainer.
[Specify license if applicable]
Last Updated: January 2025
Data Coverage: 2015-2025
Total Sequences Analyzed: 1,500+
Genotypes Analyzed: I-WestAfrica, II-ECSA, III-Asian