Add Flex GEX pipeline support and release helper#189
Merged
Conversation
Foundation for 10x Flex GEX support in simpleaf:
New types in chem_utils.rs:
- Organism enum: Human, Mouse, Other(String) — with FromStr,
Display, clap::ValueEnum, serde support
- ProtocolType enum: StandardRna, FlexGex, Atac — parsed from
meta.protocol_type in chemistry JSON
- SampleBcListInfo: plist_name + remote_url for probe barcode files
- ProbeSetInfo: name + plist_name + remote_url for organism-specific
probe sets
Extended CustomChemistry struct:
- sample_bc_list: Option<SampleBcListInfo> — for Flex probe barcode
rotation files
- probe_sets: Option<HashMap<String, ProbeSetInfo>> — keyed by
organism name ("human", "mouse")
- protocol_type() helper method — reads from meta.protocol_type
- is_flex_gex() convenience method
All 49 existing tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New `simpleaf flex-quant` command that orchestrates the complete Flex
GEX pipeline: probe index building, mapping, multi-barcode permit list
generation, hierarchical collation, and quantification.
FlexQuantOpts CLI struct with:
--chemistry: registered Flex chemistry name
--organism: strongly-typed Organism enum (human/mouse)
--probe-set: optional explicit probe CSV or FASTA
--sample-bc-list: optional explicit probe barcode file
--index: optional pre-built probe index
--reads1/--reads2: FASTQ files
--resolution, --threads, --kmer-length, etc.
Pipeline orchestration (flex_quant.rs):
- Resource resolution: auto-fetch cell BC whitelist, probe barcode file,
and probe set from chemistry registry (with content-hash caching)
- Probe CSV → FASTA conversion with t2g map and metadata extraction
- Probe index building with piscem
- Mapping with chemistry geometry (includes s[N] tag)
- generate-permit-list with --sample-bc-list --unfiltered-pl
- Collate and quant (auto-detect multi-barcode from RAD)
- Full pipeline metadata output (simpleaf_flex_quant_info.json)
Example usage:
simpleaf flex-quant \
--chemistry 10x-flexv1-gex-3p \
--organism human \
-1 R1.fq.gz -2 R2.fq.gz \
-o output -t 8
All 49 tests pass (updated CLI snapshot for new subcommand).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 10x-flexv1-gex-3p and 10x-flexv2-gex-3p entries with all resource
URLs and blake3 content hashes.
10x-flexv1-gex-3p:
- geometry: 1{b[16]u[12]x:}2{r[50]x[18]s[8]x:}
- cell BC: 737K-fixed-rna-profiling.txt (hash: 9fe0cb...)
- probe BC: 128 entries, 16 samples x 8 rotations, 8bp (hash: 5dc9d1...)
- probe sets: human v1.1.0 (hash: 9ccbd0...) + mouse v1.1.1 (hash: b4b811...)
10x-flexv2-gex-3p:
- geometry: 1{b[16]u[12]x[10]s[10]}2{r:}
- cell BC: 737K-flex-v2.txt (hash: dcb018...)
- probe BC: 384 entries, 10bp barcodes (hash: 5e7ef9...)
- probe sets: human v2.0.0 (hash: bfa53e...) + mouse v2.0.0 (hash: 8e0ea4...)
All resources hosted on UMD Box with stable URLs.
All 49 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes discovered during end-to-end testing: 1. Use 'map-sc' not 'map-scrna' as the piscem subcommand name 2. Add #[serde(rename = "remote_url")] to CustomChemistry.remote_pl_url so the JSON key 'remote_url' deserializes correctly (was always None) 3. Remove the manual empty unmapped_bc_count_collated.bin workaround since alevin-fry collate now produces it properly Successfully tested on 4-plex human colorectal/kidney Flex v1 dataset: - 5.6M cells across 16 sample channels (4 real + 12 noise) - ~97.7% mapping rate - Full auto-fetch pipeline: probe CSV download, FASTA conversion, index build, mapping, GPL, collate, quant Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Geometries containing piscem-only tags (s[N] for sample barcodes,
b<N>[L] for numbered barcodes) cannot be parsed by seq_geom_parser,
which only understands the standard b/u/r/x/f tags. These extended
geometries are validated by piscem at mapping time instead.
This fixes simpleaf inspect failing with a parse error on Flex
chemistry entries like 1{b[16]u[12]x:}2{r[50]x[18]s[8]x:}.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes for the end-to-end flex-quant pipeline: 1. Probe CSV conversion now includes ALL probes (included + excluded) in both the FASTA and t2g map. The index contains all probes, so quant needs a t2g entry for every reference. Previously only "included" probes were emitted, causing a mismatch (53,459 t2g entries vs 54,580 index references). 2. Use piscem-rs "map-scrna" subcommand (not "map-sc" which is the C++ piscem name). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Structural constraints are not appropriate for probe-based Flex mapping where references are short (~50bp) probe sequences. The flag has been removed from the flex-quant CLI entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This reverts commit a81dc74.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch adds end-to-end support for 10x Flex GEX quantification and the supporting chemistry schema needed to describe Flex protocols in the registry. It also adds a release helper script and cleans up the resulting compile/clippy issues.
Changes compared to
devflex-quantCLI subcommand and wire it into command dispatch.src/simpleaf_commands/flex_quant.rs, including:probe_t2g.tsvgenerationpiscem buildpiscem map-scrnamappingalevin-fry generate-permit-listalevin-fry collatesrc/utils/chem_utils.rswith:Organismenum for organism-specific probe set selectionProtocolTypeenum to distinguish standard RNA, Flex GEX, and ATAC protocolsSampleBcListInfoandProbeSetInfostructuressample_bc_listandprobe_setsfields onCustomChemistryprotocol_type()/is_flex_gex()helpersremote_pl_urlserde mapping to theremote_urlfield used in the registry JSONresources/chemistries.json:10x-flexv1-gex-3p10x-flexv2-gex-3pNone.src/utils/af_utils.rsso geometries containing tags likes[N]orb0[N]skipseq_geom_parservalidation and defer validation to piscem.piscem map-scrnabump_and_publish.shas a release helper script that:Cargo.tomland thesimpleafentry inCargo.lockcargo checkandcargo package--dry-runand--publishflex-quantcommand.DefaultforProtocolTypestrip_prefixChemistry::Customenum variantVerification
cargo buildcargo clippy --all-targets --all-features -- -D warningsNotes