TrufflR is an R-based command-line tool which extracts clean target sequences from NCBI's nuccore database for specific taxa.
- Search NCBI nuccore for specific genes in specific taxa using taxids and gene synonyms.
- Extract and save nucleotide and amino acid sequences for target genes.
- Filter by feature type (CDS, gene, rRNA, tRNA, or all).
- Combine all nucleotide or amino acid sequences into single multifasta files.
- Generates summary tables and logs for each run.
- R (≥ 4.0 recommended)
- R packages: optparse, rentrez, seqinr, geneviewer, Biostrings, BiocManager
- Internet connection (for NCBI queries)
-
Clone or Download TrufflR Download this repository or clone it using git:
git clone https://github.com/yourusername/TrufflR.git cd TrufflR -
Check Rscript Availability Make sure
Rscriptis available in your PATH (it comes with most R installations). -
Make the Script Executable If you want to run it as
./trufflR.R, make it executable:chmod +x trufflR.R
-
Test the Installation Run:
Rscript trufflR.R --help
This should print the help message and available options.
Go forth and find your truffles, little piggy :)
Rscript trufflR.R \
--taxids=taxids.txt \
--genes="COI[Gene],COX1[Gene],cytochrome c oxidase subunit I[Gene]" \
--output-dir=results \
--feature-type=CDS \
--retmax=5 \
--combine-nt \
--combine-aa \
-v
| Option | Description |
|---|---|
| -t, --taxids | Path to text file containing taxids (one per line) [required] |
| -g, --genes | Comma-separated list of gene search terms (with NCBI field tags, e.g. COI[Gene]) [required] |
| -o, --output-dir | Output directory (default: trufflr_output) |
| -f, --feature-type | Feature type to extract: CDS, gene, rRNA, tRNA, or all (default: all). ==NOTE: non-CDS extraction needs amending, this is a WIP== |
| -r, --retmax | Max number of sequences to retrieve per taxid (default: 5) |
| -c, --combine-nt | Combine all nucleotide sequences into one file |
| -a, --combine-aa | Combine all amino acid sequences into one file |
| --nt-file | Name for combined nucleotide file (default: combined_nucleotide_seqs.fna) |
| --aa-file | Name for combined amino acid file (default: combined_aminoacid_seqs.faa) |
| -v, --verbose | Print extra output |
- Raw files: GenBank and FASTA files for each accession in output-dir/raw_files/
- Extracted sequences: Nucleotide and amino acid FASTA files in output-dir/nucleotide/ and output-dir/protein/
- Combined files: If requested, combined nucleotide (.fna) and amino acid (.faa) multifasta files in output-dir/
- Summary: Per-taxon and overall summary tables and logs in output-dir/
- Srishti Arya at [email protected]
- Morgan Jones at [email protected]
- or open a git issue :)