GitHub - markschl/seqtool: Fast and flexible tool for reading, modifying and writing biological sequences

Seqtool is a fast and flexible command line program for dealing with large amounts of biological sequences. It provides different subcommands for converting, inspecting and modifying sequences. The standalone binary (5-7 MB) is simply named st to save some typing.

Documentation

See https://markschl.github.io/seqtool-docs

Downloads

Binaries are available for Linux, OS X and Windows

📥 download the latest version (v0.4.0)

Feature overview

File formats

Reads and writes FASTA, FASTQ and CSV/TSV, optionally compressed with GZIP, BZIP2, or the faster and more modern Zstandard or LZ4 formats

Example: compressed FASTQ to FASTA

Combine multiple compressed FASTQ files, converting them to FASTA, using pass.

st pass file1.fastq.gz file2.fastq.gz -o output.fasta

Note: almost every command can read multiple input files and convert between formats, but pass does nothing other than reading and writing while other command perform certain actions.

Example: FASTA to tab-separated list

Aside from ID and sequence, any variable/function such as the sequence length (seqlen) can be written to delimited text.

st pass input.fasta --to-tsv id,seq,seqlen

id1	ACG	3
id1	ACGTACGT	7
id1	ACGTA	5

Highly versatile thanks to variables/functions

See also variables/functions for more details.

Example: count sequences in a large set of FASTQ files

st count -k path data/*.fastq.gz

data/sample1.fastq.gz	30601
data/sample2.fastq.gz	15702
data/sample3.fastq.gz	264965
data/sample4.fastq.gz	1120
data/sample5.fastq.gz	7021
(...)

In count, one or several categorical variables/functions can be specified with -k/--key.

Example: summarize the GC content in 10% intervals

The function bin(variable, interval) groups continuous numeric values into intervals

st count -k 'bin(gc_percent, 10)' sequences.fasta

(10, 20]	57
(20, 30]	2113
(30, 40]	11076
(40, 50]	7184
(50, 60]	12

Example: Assign new sequence IDs

st set -i 'seq_{num}' seqs.fasta > renamed.fasta

>seq_1
SEQUENCE
>seq_2
SEQUENCE
>seq_3
SEQUENCE
(...)

Example: De-replicate by description and sequence

seqs.fasta with a 'group' annotation in the header:

>id1 group1
SEQUENCE1
>id2 group1
SEQUENCE2
>id3 group1
SEQUENCE2
>id4 group2
SEQUENCE1
>id5 group2
SEQUENCE1

st unique 'desc,seq' seqs.fasta > grouped_uniques.fasta

>id1 group1
SEQUENCE1
>id2 group1
SEQUENCE2
>id4 group2
SEQUENCE1

Expressions

From simple math to complicated filter expressions, the tiny integrated JavaScript engine (QuickJS) offers countless possibilities for customized sequence processing.

Example: filter FASTQ sequences by quality and length

This filter command removes sequencing reads with more than one expected sequencing error (like USEARCH can do) or sequence length of <100 bp.

st filter 'exp_err < 1 && seqlen >= 100' reads.fastq > filtered.fastq

Header attributes for metadata storage

key=value header attributes allow storing and passing on all kinds of information

Example: De-replicate by sequence (seq variable) and/or other properties

The unique command returns all unique sequences and annotates the number of records with the same sequence in the header:

st unique seq -a abund={n_duplicates} input.fasta > uniques.fasta

>id1 abund=3
TCTTTAATAACCTGATTAG
>id3 abund=1
GGAGGATCCGAGCG
(...)

It is also possible to de-replicate by multiple keys, e.g. by sequence, but grouped by a sample attribute in the header:

st unique 'seq,attr(sample)' input.fasta > uniques.fasta

>id1 sample=1
SEQUENCE1
>id3 sample=2
SEQUENCE2
>id10 sample=1
SEQUENCE3
>id11 sample=3
SEQUENCE4
(...)

Fast pattern search

The find command enables flexible and fast pattern search, where the optimal algorithm is chosen automatically. Primer/adapter searches usually yield the same results as Cutadapt, while being faster.

Example: pre-processing of mixed multi-marker amplicon sequences (primer trimming, grouping by amplicon)

These steps could be part of an amplicon pipeline that de-multiplexes multi-marker amplicons.

Three different seqtool commands are chained, whereby find searches for a set of primers, which are removed by trim and finally split distributes the sequences into different files named by the forward primer. The primer names and positions are passed to the next command in the form of header attributes (-a/--attr option).

primers.fasta

>prA
PRIMER
>prB
PRIMER

Command for searching/trimming

st find file:primers.fasta -a primer='{pattern_name}' -a end='{match_end}' sequences.fasta |
  st trim -e '{attr(end)}..' | 
  st split -o '{attr(primer)}'

prA.fasta prB.fasta undefined.fasta

>id1 primer=prA end=22
SEQUENCE
>id4 primer=prA end=21
SEQUENCE
(...)

>id2 primer=prB end=20
SEQUENCE
>id3 primer=prB end=22
SEQUENCE
(...)

>id5 primer=undefined end=undefined
UNTRIMMEDSEQUENCE
(...)

Note: no primer, sequence not trimmed since end=undefined (see ranges).

Integration of external metadata

Integration of sequence metadata sources in the form of delimited text

Example: Add Genus names from a separate tab-separated list

input.fasta	genus.tsv
`>id1 SEQUENCE >id2 SEQUENCE (...)`	`id genus seq1 Actinomyces seq2 Amycolatopsis (...)`

Using -m/--meta to include genus.tsv as metadata source:

st set -m genus.tsv --desc '{meta(genus)}' input.fasta > with_genus.fasta

with_genus.fasta
`>seq1 Actinomyces SEQUENCE >seq2 Amycolatopsis SEQUENCE (...)`

Example: Choose specific sequences given a separate file with an ID list

input.fasta	id_list.txt
`>id1 SEQUENCE >id2 SEQUENCE >id3 SEQUENCE >id4 SEQUENCE`	`id1 id4`

st filter -m id_list.txt 'has_meta()' input.fasta > subset.fasta

subset.fasta
`>id1 SEQUENCE >id4 SEQUENCE`

Commands

Basic conversion/editing

pass: Directly pass input to output without any processing, useful for converting and attribute setting

Information about sequences

view: View biological sequences, colored by base / amino acid, or by sequence quality
count: Count all records in the input (total or categorized by variables/functions)
stat: Return per-sequence statistics as tab delimited list

Subset/reorganize/compare

sort: Sort records by sequence or any other criterion
unique: De-replicate by sequence and/or other properties, returning only unique records
filter: Keep/exclude sequences based on different properties with a mathematical (JavaScript) expression
split: Distribute sequences into multiple files based on a variable/function or advanced expression
cmp: Compare two input files/streams
sample: Get a random subset of sequences; either a fixed number or an approximate fraction of the input
slice: Return a range of sequence records from the input
head: Return the first N sequences
tail: Return the last N sequences
interleave: Interleave records of all files in the input

Search and replace

find: Search for pattern(s) in sequences or sequene headers for record filtering, pattern replacement or passing hits to next command
replace: Fast and simple pattern replacement in sequences or headers

Modifying commands

del: Delete header ID/description and/or attributes
set: Replace the header, header attributes or sequence with new content
trim: Trim sequences on the left and/or right (single range) or extract and concatenate several ranges
mask: Soft or hard mask sequence ranges
upper: Convert sequences to uppercase
lower: Convert sequences to lowercase
revcomp: Reverse complements DNA or RNA sequences
concat: Concatenates sequences/alignments from different files

Comparison with other tools

There are other tools with a similar focus such as Seqtk, SeqKit, the FASTX-Toolkit, as well as the more specialized USEARCH and VSEARCH offering some of the functions as well.

Seqtool performs well compared to these tools on a selection of diverse tasks:

Comparison of tools

License

Either MIT or Apache-2.0 at your choice

Name		Name	Last commit message	Last commit date
Latest commit History 352 Commits
.github/workflows		.github/workflows
js		js
profile		profile
scripts		scripts
src		src
var_provider		var_provider
wix		wix
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
build.rs		build.rs
dist-workspace.toml		dist-workspace.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Documentation

Downloads

Feature overview

File formats

Highly versatile thanks to variables/functions

Expressions

Header attributes for metadata storage

Fast pattern search

Integration of external metadata

Commands

Basic conversion/editing

Information about sequences

Subset/reorganize/compare

Search and replace

Modifying commands

Comparison with other tools

License

About

Licenses found

Uh oh!

Releases 10

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Licenses found

markschl/seqtool

Folders and files

Latest commit

History

Repository files navigation

Documentation

Downloads

Feature overview

File formats

Highly versatile thanks to variables/functions

Expressions

Header attributes for metadata storage

Fast pattern search

Integration of external metadata

Commands

Basic conversion/editing

Information about sequences

Subset/reorganize/compare

Search and replace

Modifying commands

Comparison with other tools

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages