TrufflR

TrufflR is an R-based command-line tool which extracts clean target sequences from NCBI's nuccore database for specific taxa.

Features

Search NCBI nuccore for specific genes in specific taxa using taxids and gene synonyms.
Extract and save nucleotide and amino acid sequences for target genes.
Filter by feature type (CDS, gene, rRNA, tRNA, or all).
Combine all nucleotide or amino acid sequences into single multifasta files.
Generates summary tables and logs for each run.

Requirements

R (≥ 4.0 recommended)
R packages: optparse, rentrez, seqinr, geneviewer, Biostrings, BiocManager
Internet connection (for NCBI queries)

Installation

Clone or Download TrufflR Download this repository or clone it using git:
```
git clone https://github.com/yourusername/TrufflR.git
cd TrufflR
```
Check Rscript Availability Make sure Rscript is available in your PATH (it comes with most R installations).
Make the Script Executable If you want to run it as ./trufflR.R, make it executable:
```
chmod +x trufflR.R
```
Test the Installation Run:
```
Rscript trufflR.R --help
```
This should print the help message and available options.

Go forth and find your truffles, little piggy :)

Command-line usage

Rscript trufflR.R \
  --taxids=taxids.txt \
  --genes="COI[Gene],COX1[Gene],cytochrome c oxidase subunit I[Gene]" \
  --output-dir=results \
  --feature-type=CDS \
  --retmax=5 \
  --combine-nt \
  --combine-aa \
  -v

Options

Option	Description
-t, --taxids	Path to text file containing taxids (one per line) [required]
-g, --genes	Comma-separated list of gene search terms (with NCBI field tags, e.g. `COI[Gene]`) [required]
-o, --output-dir	Output directory (default: trufflr_output)
-f, --feature-type	Feature type to extract: CDS, gene, rRNA, tRNA, or all (default: all). ==NOTE: non-CDS extraction needs amending, this is a WIP==
-r, --retmax	Max number of sequences to retrieve per taxid (default: 5)
-c, --combine-nt	Combine all nucleotide sequences into one file
-a, --combine-aa	Combine all amino acid sequences into one file
--nt-file	Name for combined nucleotide file (default: combined_nucleotide_seqs.fna)
--aa-file	Name for combined amino acid file (default: combined_aminoacid_seqs.faa)
-v, --verbose	Print extra output

Output

Raw files: GenBank and FASTA files for each accession in output-dir/raw_files/
Extracted sequences: Nucleotide and amino acid FASTA files in output-dir/nucleotide/ and output-dir/protein/
Combined files: If requested, combined nucleotide (.fna) and amino acid (.faa) multifasta files in output-dir/
Summary: Per-taxon and overall summary tables and logs in output-dir/

Contact

Srishti Arya at [email protected]
Morgan Jones at [email protected]
or open a git issue :)

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
taxids		taxids
.gitignore		.gitignore
README.md		README.md
trufflR.R		trufflR.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TrufflR

Features

Requirements

Installation

Command-line usage

Options

Output

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

srisarya/TrufflR

Folders and files

Latest commit

History

Repository files navigation

TrufflR

Features

Requirements

Installation

Command-line usage

Options

Output

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages