Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ The pipeline is designed to be a companion pipeline to [nf-core/taxprofiler](htt

In addition to this page, you can find additional usage information on the following pages:

- [Tutorials](usage/tutorials.md)
- [FAQ and troubleshooting](usage/faq.md)
- [Development documentation](usage/dev.md) (only relevant for people contributing to code to the pipeline!)

Expand Down
227 changes: 227 additions & 0 deletions docs/usage/tutorials.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
# Tutorials

## Convert NCBI assembly_summary file to nf-core/createtaxdb samplesheet

A common source of reference genomes to build taxonomic classification databases is the NCBI suite of databases.

Conveniently, NCBI provides 'assembly summary' tables for different taxonomic groups that contain all the information that is needed for a nf-core/createtaxdb samplesheet.
Using this file as a source of reference FASTAs can provide two primary benefits:

- The genomes will be automatically compatible with NCBI taxonomy files
- They provide URLs that can be directly used by Nextflow to download the genome FASTA files for you

The goals of this tutorial are:

- Use standard terminal commands to convert an NCBI `assembly_summary.txt` file to a nf-core/createtaxdb compatible samplesheet
- Build DNA-based Kraken2 and an Amino Acid-based Kaiju databases with the pipeline using the generated samplesheet

:::info
This tutorial is tested with NCBI assembly_summary files from January 2026.

You may need to modify commands if NCBI changes the format of these files in the future.
:::

### Prerequisites

1. Internet connection
2. A Unix terminal (Linux or macOS)
3. Software installed:
1. `curl` (tested version: `8.5.0`)
2. `awk` (tested version: `mawk 1.3.4 20240123`)
3. `sed` (tested version: `GNU sed 4.9`)
4. `nextflow` (tested version: `25.10.2`)
5. A Nextflow compatible environment system (for example `conda`, `singularity`, `docker`) (tested version: `docker 27.2.1, build 9e34c9b`)

### Download, filter, and convert the assembly_summary file

1. Download the assembly_summary file for your taxonomic group of interest.

As an example, we will use the Genome RefSeq database's fungi assembly summary:

```bash
curl -O https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt
```

2. Optionally filter the assembly_summary file to only include certain genomes of interest.

For example, you might want to only include assemblies built to a "Complete" or "Chromosome" level.
You could do this with command line tools, or in a spreadsheet program.

Here is an example with `awk` to filter for:
- Only "Complete Genome"-level assemblies
- First three genomes

```bash
awk -F '\t' 'NF>2; $12 == "Complete Genome" {print}' assembly_summary.txt | head -n 4 > assembly_summary_filtered.txt
```

3. Simplify the assembly_summary file to only include the columns we need for the nf-core/createtaxdb samplesheet, namely, `# assembly_accession`, `taxid`, `ftp_path`.
Additionally, replace the first line to have the expected nf-core/createtaxdb samplesheet headers: `id`, `taxid`, `fasta_dna`, `fasta_aa`.

```bash
cut -f 1,7,20 assembly_summary_filtered.txt | sed 's/#assembly_accession.*/id\ttaxid\tfasta_dna\tfasta_aa/' > assembly_summary_simplified.txt
```

This results in:

```tsv
id taxid fasta_dna fasta_aa
GCF_041956525.1 4840 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/041/956/525/GCF_041956525.1_Rhipu1
GCF_000002945.2 4896 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/945/GCF_000002945.2_ASM294v3
GCF_003054445.1 4909 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/054/445/GCF_003054445.1_ASM305444v1
```

4. Reconstruct the complete URLs of the relevant FASTA files to make them downloadable.

```bash
awk 'BEGIN { FS=OFS="\t" } NR > 1 { n=split($3,p,"/"); $4=$3"/"p[n]"_protein.faa.gz"; $3=$3"/"p[n]"_genomic.fna.gz"} {print $1","$2","$3","$4}' assembly_summary_simplified.txt > samplesheet.csv
Copy link
Member

@dialvarezs dialvarezs Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
awk 'BEGIN { FS=OFS="\t" } NR > 1 { n=split($3,p,"/"); $4=$3"/"p[n]"_protein.faa.gz"; $3=$3"/"p[n]"_genomic.fna.gz"} {print $1","$2","$3","$4}' assembly_summary_simplified.txt > samplesheet.csv
awk 'BEGIN { FS="\t"; OFS="," }
NR>1 {
base=$3; sub(".*/","",base)
$4=$3 "/" base "_protein.faa.gz"
$3=$3 "/" base "_genomic.fna.gz"
}
{ print $1,$2,$3,$4 }' assembly_summary_simplified.txt > samplesheet.csv

```

:::info{collapse="true" title="Explanation of `awk` command"}
This `awk` command works as follows:
1. Specify tab as the delimiter
2. Print the header line
3. Extract the base URL column and in the variable `n`, split the elements on `/` into an array called `p`
4. Construct a new protein FASTA URL column based on base URL, but append the last element of the array `p` (called by the length of `n`) plus `_protein.faa.gz`
5. Replace the existing base URL column with a new DNA FASTA URL constructed in the same way as the protein FASTA URL, but instead append `_genomic.fna.gz`
6. Print the four columns, separated by commas to create the expected nf-core/createtaxdb CSV file

:::

This results in:

```tsv
id,taxid,fasta_dna,fasta_aa
GCF_041956525.1,4840,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/041/956/525/GCF_041956525.1_Rhipu1/GCF_041956525.1_Rhipu1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/041/956/525/GCF_041956525.1_Rhipu1/GCF_041956525.1_Rhipu1_protein.faa.gz
GCF_000002945.2,4896,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/945/GCF_000002945.2_ASM294v3/GCF_000002945.2_ASM294v3_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/945/GCF_000002945.2_ASM294v3/GCF_000002945.2_ASM294v3_protein.faa.gz
GCF_003054445.1,4909,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/054/445/GCF_003054445.1_ASM305444v1/GCF_003054445.1_ASM305444v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/054/445/GCF_003054445.1_ASM305444v1/GCF_003054445.1_ASM305444v1_protein.faa.gz
```

:::tip
If you only want to build DNA-based databases (for example, Kraken2), omit the `$4` variable definition and printing.

```bash
awk 'BEGIN { FS=OFS="\t" } NR > 1 { n=split($3,p,"/"); $3=$3"/"p[n]"_genomic.fna.gz"} {print $1","$2","$3}' assembly_summary_simplified.txt > samplesheet_dna.csv
```

This results in:

```csv
id,taxid,fasta_dna
GCF_041956525.1,4840,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/041/956/525/GCF_041956525.1_Rhipu1/GCF_041956525.1_Rhipu1_genomic.fna.gz
GCF_000002945.2,4896,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/945/GCF_000002945.2_ASM294v3/GCF_000002945.2_ASM294v3_genomic.fna.gz
GCF_003054445.1,4909,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/054/445/GCF_003054445.1_ASM305444v1/GCF_003054445.1_ASM305444v1_genomic.fna.gz
```

If you only want to build amino acid-based databases (for example, Kaiju), omit the `$4` variable definition and printing, and replace the `$3` field with `_protein.faa.gz`.
You will also need to replace the header:

```bash
awk 'BEGIN { FS=OFS="\t" } NR > 1 { n=split($3,p,"/"); $3=$3"/"p[n]"_protein.faa.gz"} {print $1","$2","$3}' assembly_summary_simplified.txt > samplesheet_aa.csv
sed -i '1s/fasta_dna/fasta_aa/' samplesheet_aa.csv
```

This results in:

```csv
id,taxid,fasta_aa
GCF_041956525.1,4840,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/041/956/525/GCF_041956525.1_Rhipu1/GCF_041956525.1_Rhipu1_protein.faa.gz
GCF_000002945.2,4896,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/945/GCF_000002945.2_ASM294v3/GCF_000002945.2_ASM294v3_protein.faa.gz
GCF_003054445.1,4909,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/054/445/GCF_003054445.1_ASM305444v1/GCF_003054445.1_ASM305444v1_protein.faa.gz
```

:::

### Download taxonomy files

- Download the necessary NCBI taxonomy files required by Kraken2 with:

```bash
curl -O https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip
rm taxdmp.zip
curl -O https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip nucl_gb.accession2taxid.gz
```

- Kaiju does not require taxonomy files for database construction.

### Run the pipeline

- Run the nf-core/createtaxdb pipeline to build our databases as [normal](../usage.md).
Specify the samplesheet and taxonomy files we created and downloaded with their respective parameters.
Here we use `docker` as our environment manager:

```bash
nextflow run nf-core/createtaxdb \
-r 2.0.0 \
-profile docker \
--input samplesheet.csv \
--outdir ./results \
--dbname ncbi_fungi \
--nodesdmp nodes.dmp \
--namesdmp names.dmp \
--accession2taxid nucl_gb.accession2taxid \
--build_kraken2 --build_kaiju
```

:::note
By default the pipeline assumes you need 72 GB of RAM to build a Kraken2 database.
However, the test run can fit within approximately 8 GB of RAM due to the small number of genomes.

If running on a smaller machine, you may get an error such as `Process requirement exceeds available memory -- req: 36 GB; avail: 31 GB`.
To create a custom config file (for example, `custom_config.config`) with the following contents to reduce the memory requirement.
In this case, my machine has 16GB RAM:

```groovy
process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h',
]
}
```

Append to the end of the Nextflow command the parameter `-c custom_config.config`.
:::

Once completed successfully, you can check the database files in the `results/` directory with

```bash
ls results/{kaiju,kraken2}/*
```

And we can see the Kaiju `.fmi` file and the Kraken2 database directory:

```bash
results/kaiju/ncbi_fungi-kaiju.fmi

results/kraken2/ncbi_fungi-kraken2:
hash.k2d opts.k2d taxo.k2d
```

### Bonus: NCBI assembly_summary to samplesheet one-liner

In fact, we can execute all the commands to generate the samplesheet described [above](#convert-ncbi-assembly_summary-file-to-nf-corecreatetaxdb-samplesheet) in one go as single UNIX one-liner command:

```bash
awk -F '\t' 'NF>2; $12 == "Complete Genome" {print}' assembly_summary.txt $(curl https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt) | \
head -n 4 | cut -f 1,7,20 | \
sed 's/#assembly_accession.*/id\ttaxid\tfasta_dna\tfasta_aa/' | \
awk 'BEGIN { FS=OFS="\t" } NR > 1 { n=split($3,p,"/"); $4=$3"/"p[n]"_protein.faa.gz"; $3=$3"/"p[n]"_genomic.fna.gz"} {print $1","$2","$3","$4}' > samplesheet.csv
```

:::info
We have to place the `curl` command within the `awk` command substitution `$()` to ensure the file is completely downloaded before the rest of the pipe starts processing.
:::

### Summary

In this tutorial we went through how to convert an NCBI assembly_summary file to a nf-core/createtaxdb samplesheet.

We used standard command line tools to download, filter, and reformat the assembly_summary file in a reproducible manner and use this file to generate databases for two different taxonomic classification tools with nf-core/createtaxdb.

Use these steps to quickly build custom taxonomic classification databases for your metagenomic analyses from one of the most popular source of reference genomes.

_Note: The `awk` command in step 4 was partly written with the assistance of AI (Claude Haiku 4.5). Documentation style review with GPT-5.1-Codex-Max_