Description

This is the companion repository for our publication Genomic GC bias correction improves species abundance estimation from metagenomic data (Holcik, von Haeseler & Pflug 2025, Nature Communications 16, 10526) where discuss and evaluate our GC-bias correction algorithm GuaCAMOLE using a large number of datasets. This repository contains all necessary raw data and scripts to reproduce the main analyses and figures shown in our publication.

The GuaCAMOLE algorithm is available at https://github.com/Cibiv/GuaCAMOLE.

The script simulate_communities.py simulates communities with different number of taxa and genomic GC composition as described in the manuscript. Each simulated community is represented by a file in parameters/ which contains the RefSeq identifier of the genome assembly used to represent each taxon and the taxon's abundance in the community. All assemblies are downloaded into assemblies_merged in a format suitable for read simulation. The script simulate_reads.py uses a InSilicoSeq-GCBias to simulate sequencing reads for each libraries and places the simulated libraries in libraries/. InSilicoSeq-GCBias is a version of InSilicoSeq modified to simulate GC bias, and must be installed from https://github.com/Cibiv/InSilicoSeq-GCBias before running simulate_reads.py. evaluate.py reads the output for GuaCAMOLE and MetaPhlAn4 from results/ and reproduces the simulation results shown in Fig. 2. Due to size constraints, this repository contains the community descriptions and GuaCAMOLE/MetaPhlAn4 results, but not the genome assemblies (these are available from RefSeq) and simulated reads.

Fig. 3, data from Tourlousse et al., 2024 (`tourlousse/`)

The file sra_ids.txt lists the SRA ids of the samples of Tourlousse et al. (2021) used to evaluate GuaCAMOLE. The folders guacamole_results, motus_results, singlem_results and sylph_results contain the output produced by these tools for the data of Tourlousse et al. evaluate_efficiencies.py reads the GuaCAMOLE results and reproduces Fig. 3A. evaluate.py the output of all tools and reproduces panels Fig. 3 B-E.

Fig. 4, colorectal cancer datasets (`crc/`)

The files gupta2020_s3.csv, gupta2020_s6.csv, and murovec2024_s2.csv, PRJDB4176.csv, PRJDB4176.attributes.tab and PRJDB4176.sample_disease_status.csv contain the supplemental tables of Gupta et al. (2020) and Murovec at al. (2024) from which the list of samples (samples.yaml) used was derived as described in the publication. The RStudio Notebook evaluate.Rmd reads these samples lists and the raw GuaCAMOLE results from guacamole_results and perform the clustering and analysis as described in the publication. It produces the plots shwon in Fig. 4 and additional generates listing all individual abundance estimates (abundances.csv.bz2), and all studies including their cluster assignment (studies.csv.bz2).

Fig. S1, data from Mori et al., 2023 (`mori/`)

The file samples.csv list the samples from Mori et al. (2023) comparing abundance estimates across sequencing platforms and DNA extraction methods. evaluate.Rmd reads the raw GuaCAMOLE results for these samples and reproduces the panels shown in Fig. S1.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
crc		crc
mori		mori
simulation		simulation
tourlousse		tourlousse
LICENSE		LICENSE
README.md		README.md
environment.precise.yaml		environment.precise.yaml
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Contents

Fig. 2A, simulation study (`simulation/fig2a/`)

Fig. 2B, simulation study (`simulation/`)

Fig. 3, data from Tourlousse et al., 2024 (`tourlousse/`)

Fig. 4, colorectal cancer datasets (`crc/`)

Fig. S1, data from Mori et al., 2023 (`mori/`)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Description

Contents

Fig. 2A, simulation study (simulation/fig2a/)

Fig. 2B, simulation study (simulation/)

Fig. 3, data from Tourlousse et al., 2024 (tourlousse/)

Fig. 4, colorectal cancer datasets (crc/)

Fig. S1, data from Mori et al., 2023 (mori/)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Fig. 2A, simulation study (`simulation/fig2a/`)

Fig. 2B, simulation study (`simulation/`)

Fig. 3, data from Tourlousse et al., 2024 (`tourlousse/`)

Fig. 4, colorectal cancer datasets (`crc/`)

Fig. S1, data from Mori et al., 2023 (`mori/`)

Packages