Skip to content

Add Flex GEX pipeline support and release helper#189

Merged
rob-p merged 9 commits intodevfrom
multi-sample-rad
Mar 19, 2026
Merged

Add Flex GEX pipeline support and release helper#189
rob-p merged 9 commits intodevfrom
multi-sample-rad

Conversation

@rob-p
Copy link
Copy Markdown
Contributor

@rob-p rob-p commented Mar 19, 2026

Summary

This branch adds end-to-end support for 10x Flex GEX quantification and the supporting chemistry schema needed to describe Flex protocols in the registry. It also adds a release helper script and cleans up the resulting compile/clippy issues.

Changes compared to dev

  • Add a new flex-quant CLI subcommand and wire it into command dispatch.
  • Implement the Flex GEX pipeline in src/simpleaf_commands/flex_quant.rs, including:
    • chemistry lookup from the registry
    • automatic probe-set selection by organism
    • probe CSV to FASTA conversion plus probe_t2g.tsv generation
    • cached or on-demand probe index creation with piscem build
    • cell barcode whitelist resolution
    • sample barcode list resolution
    • piscem map-scrna mapping
    • multi-barcode alevin-fry generate-permit-list
    • alevin-fry collate
    • quantification plus pipeline metadata output
  • Extend chemistry schema support in src/utils/chem_utils.rs with:
    • Organism enum for organism-specific probe set selection
    • ProtocolType enum to distinguish standard RNA, Flex GEX, and ATAC protocols
    • SampleBcListInfo and ProbeSetInfo structures
    • new sample_bc_list and probe_sets fields on CustomChemistry
    • protocol_type() / is_flex_gex() helpers
    • remote_pl_url serde mapping to the remote_url field used in the registry JSON
  • Register Flex chemistries in resources/chemistries.json:
    • 10x-flexv1-gex-3p
    • 10x-flexv2-gex-3p
    • include their geometry strings, protocol metadata, cell barcode whitelist metadata, sample barcode list metadata, and organism-specific probe-set entries for human and mouse
  • Update chemistry creation code so newly added custom chemistries initialize the new optional Flex fields to None.
  • Relax geometry validation for piscem-only geometry extensions in src/utils/af_utils.rs so geometries containing tags like s[N] or b0[N] skip seq_geom_parser validation and defer validation to piscem.
  • Fix Flex-specific pipeline details that differed from the initial implementation:
    • use piscem map-scrna
    • preserve all probes, including excluded probes, in the generated FASTA/t2g mapping so quant has a complete reference-to-gene mapping
    • keep structural constraints enabled only when explicitly requested
  • Add bump_and_publish.sh as a release helper script that:
    • validates a requested SemVer version
    • requires the new version to be greater than the current crate version
    • updates both Cargo.toml and the simpleaf entry in Cargo.lock
    • runs preflight/post-bump cargo check and cargo package
    • commits, tags, and pushes the release bump
    • separates dry-run behavior from actual crates.io publishing via --dry-run and --publish
  • Update CLI help snapshot(s) to include the new flex-quant command.
  • Resolve current compile/clippy issues on this branch:
    • derive Default for ProtocolType
    • replace manual prefix stripping with strip_prefix
    • box the large Chemistry::Custom enum variant

Verification

  • cargo build
  • cargo clippy --all-targets --all-features -- -D warnings

Notes

  • Local untracked data directories/files in the working tree were not included in this PR.

rob-p and others added 9 commits March 17, 2026 12:06
Foundation for 10x Flex GEX support in simpleaf:

New types in chem_utils.rs:
- Organism enum: Human, Mouse, Other(String) — with FromStr,
  Display, clap::ValueEnum, serde support
- ProtocolType enum: StandardRna, FlexGex, Atac — parsed from
  meta.protocol_type in chemistry JSON
- SampleBcListInfo: plist_name + remote_url for probe barcode files
- ProbeSetInfo: name + plist_name + remote_url for organism-specific
  probe sets

Extended CustomChemistry struct:
- sample_bc_list: Option<SampleBcListInfo> — for Flex probe barcode
  rotation files
- probe_sets: Option<HashMap<String, ProbeSetInfo>> — keyed by
  organism name ("human", "mouse")
- protocol_type() helper method — reads from meta.protocol_type
- is_flex_gex() convenience method

All 49 existing tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New `simpleaf flex-quant` command that orchestrates the complete Flex
GEX pipeline: probe index building, mapping, multi-barcode permit list
generation, hierarchical collation, and quantification.

FlexQuantOpts CLI struct with:
  --chemistry: registered Flex chemistry name
  --organism: strongly-typed Organism enum (human/mouse)
  --probe-set: optional explicit probe CSV or FASTA
  --sample-bc-list: optional explicit probe barcode file
  --index: optional pre-built probe index
  --reads1/--reads2: FASTQ files
  --resolution, --threads, --kmer-length, etc.

Pipeline orchestration (flex_quant.rs):
- Resource resolution: auto-fetch cell BC whitelist, probe barcode file,
  and probe set from chemistry registry (with content-hash caching)
- Probe CSV → FASTA conversion with t2g map and metadata extraction
- Probe index building with piscem
- Mapping with chemistry geometry (includes s[N] tag)
- generate-permit-list with --sample-bc-list --unfiltered-pl
- Collate and quant (auto-detect multi-barcode from RAD)
- Full pipeline metadata output (simpleaf_flex_quant_info.json)

Example usage:
  simpleaf flex-quant \
    --chemistry 10x-flexv1-gex-3p \
    --organism human \
    -1 R1.fq.gz -2 R2.fq.gz \
    -o output -t 8

All 49 tests pass (updated CLI snapshot for new subcommand).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 10x-flexv1-gex-3p and 10x-flexv2-gex-3p entries with all resource
URLs and blake3 content hashes.

10x-flexv1-gex-3p:
  - geometry: 1{b[16]u[12]x:}2{r[50]x[18]s[8]x:}
  - cell BC: 737K-fixed-rna-profiling.txt (hash: 9fe0cb...)
  - probe BC: 128 entries, 16 samples x 8 rotations, 8bp (hash: 5dc9d1...)
  - probe sets: human v1.1.0 (hash: 9ccbd0...) + mouse v1.1.1 (hash: b4b811...)

10x-flexv2-gex-3p:
  - geometry: 1{b[16]u[12]x[10]s[10]}2{r:}
  - cell BC: 737K-flex-v2.txt (hash: dcb018...)
  - probe BC: 384 entries, 10bp barcodes (hash: 5e7ef9...)
  - probe sets: human v2.0.0 (hash: bfa53e...) + mouse v2.0.0 (hash: 8e0ea4...)

All resources hosted on UMD Box with stable URLs.
All 49 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes discovered during end-to-end testing:

1. Use 'map-sc' not 'map-scrna' as the piscem subcommand name
2. Add #[serde(rename = "remote_url")] to CustomChemistry.remote_pl_url
   so the JSON key 'remote_url' deserializes correctly (was always None)
3. Remove the manual empty unmapped_bc_count_collated.bin workaround
   since alevin-fry collate now produces it properly

Successfully tested on 4-plex human colorectal/kidney Flex v1 dataset:
- 5.6M cells across 16 sample channels (4 real + 12 noise)
- ~97.7% mapping rate
- Full auto-fetch pipeline: probe CSV download, FASTA conversion,
  index build, mapping, GPL, collate, quant

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Geometries containing piscem-only tags (s[N] for sample barcodes,
b<N>[L] for numbered barcodes) cannot be parsed by seq_geom_parser,
which only understands the standard b/u/r/x/f tags. These extended
geometries are validated by piscem at mapping time instead.

This fixes simpleaf inspect failing with a parse error on Flex
chemistry entries like 1{b[16]u[12]x:}2{r[50]x[18]s[8]x:}.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes for the end-to-end flex-quant pipeline:

1. Probe CSV conversion now includes ALL probes (included + excluded)
   in both the FASTA and t2g map. The index contains all probes, so
   quant needs a t2g entry for every reference. Previously only
   "included" probes were emitted, causing a mismatch (53,459 t2g
   entries vs 54,580 index references).

2. Use piscem-rs "map-scrna" subcommand (not "map-sc" which is the
   C++ piscem name).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Structural constraints are not appropriate for probe-based Flex mapping
where references are short (~50bp) probe sequences. The flag has been
removed from the flex-quant CLI entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rob-p rob-p merged commit e03e7f5 into dev Mar 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant