Skip to content

feat(sort): add CB (cellular barcode) to template-coordinate sort key#160

Merged
nh13 merged 1 commit intomainfrom
feat/nh/add-cb-to-template-sort
Mar 6, 2026
Merged

feat(sort): add CB (cellular barcode) to template-coordinate sort key#160
nh13 merged 1 commit intomainfrom
feat/nh/add-cb-to-template-sort

Conversation

@nh13
Copy link
Member

@nh13 nh13 commented Mar 4, 2026

Summary

Ports fgbio PR #1142 to include the CB (cellular barcode) tag in the template-coordinate sort order for single-cell data. Without CB in the sort, templates from different cells at the same locus are interleaved, breaking grouper adjacency assumptions.

The new comparison order is:
tid1 → tid2 → pos1 → pos2 → neg1 → neg2 → **CB** → MI → name → library → is_upper

Changes

  • inline_buffer.rs: Add cb_hash: u64 to TemplateKey (5th packed field), expanding serialized key from 32→40 bytes and inline header from 40→48 bytes. Updated radix sort, Ord, core_cmp, and serialization.
  • keys.rs: Add cid: String to TemplateCoordinateKey with length-first then lexicographic comparison (matching fgbio). Inserted after neg2, before MI in Ord impl.
  • raw.rs: Add cell_tag: Option<[u8; 2]> to RawExternalSorter with builder method. extract_template_key_inline hashes CB tag with ahash (distinct fixed seeds for determinism).
  • sort.rs (fgumi-raw-bam crate): Add cell_tag parameter to compare_template_coordinate_raw() with CB comparison after neg2, before MI. None skips CB (backwards compatible).
  • sort.rs (command): Add --cell-tag / -c CLI arg (default "CB"), threaded to sorter and verify mode.
  • group.rs / dedup.rs: Updated help text and error messages to mention --cell-tag CB in sort suggestions.

Tests

  • 3 new raw comparison tests (CB tiebreak, mixed presence, length-first ordering)
  • CB-specific assertions added to TemplateCoordinateKey regression test (length-first, lexicographic, empty vs non-empty)
  • All existing TemplateKey tests updated for new cb_hash parameter

Test plan

  • cargo ci-test — all 1852 tests pass
  • cargo ci-fmt — clean
  • cargo ci-lint — clean
  • End-to-end sort with CB-tagged single-cell BAM to verify output order

@nh13 nh13 temporarily deployed to github-actions March 4, 2026 01:03 — with GitHub Actions Inactive
@nh13 nh13 marked this pull request as ready for review March 4, 2026 01:05
@coderabbitai
Copy link

coderabbitai bot commented Mar 4, 2026

Warning

Rate limit exceeded

@nh13 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 41 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9bca7e7-ab44-445d-838a-0ea2c90e388a

📥 Commits

Reviewing files that changed from the base of the PR and between f8fa67b and ea9adec.

📒 Files selected for processing (5)
  • src/commands/dedup.rs
  • src/commands/group.rs
  • src/commands/sort.rs
  • src/lib/sort/inline_buffer.rs
  • src/lib/sort/raw.rs
📝 Walkthrough

Walkthrough

Adds cellular-barcode (CB) support across template-coordinate sorting. compare_template_coordinate_raw now accepts cell_tag: Option<&[u8;2]> and can tiebreak by CB (length then lexicographic); import set includes find_string_tag_in_record. CLI Sort gains a cell_tag field and parsing, and RawExternalSorter gains a cell_tag builder. TemplateKey gains cb_hash: u64, expanding serialized key from 32→40 bytes and TemplateInlineHeader size accordingly. TemplateCoordinateKey now carries cid. extract_template_key_inline signature updated to accept the optional cell tag. Tests and messages updated.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly and concisely describes the main change: adding CB (cellular barcode) to template-coordinate sort key.
Description check ✅ Passed Description is comprehensive and directly related to the changeset, detailing motivation, new sort order, implementation changes, and test coverage.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/nh/add-cb-to-template-sort

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/sort/inline_buffer.rs (1)

384-390: ⚠️ Potential issue | 🟠 Major

Unmapped template keys now discard CB in CB-aware mode.

Line 388 hardcodes cb_hash: 0 in TemplateKey::unmapped, so fully unmapped records are no longer cell-separated when --cell-tag is active.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/sort/inline_buffer.rs` around lines 384 - 390, The unmapped
constructor currently zeroes cb_hash which causes unmapped records to lose
cell-boundary tagging in CB-aware mode; update TemplateKey::unmapped to set
cb_hash to the unmapped sentinel (e.g., u64::MAX) instead of 0 so cell-tag logic
can detect an unmapped CB. Modify the unmapped function (the constructor that
sets primary, secondary, cb_hash, tertiary, name_hash_upper) to assign cb_hash =
u64::MAX.
🧹 Nitpick comments (3)
src/lib/sort/keys.rs (1)

1214-1227: Parameterize these CB comparison cases with rstest.

This block is a clear case matrix and is easier to maintain as parameterized tests.

As per coding guidelines "Use rstest for parameterized tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/sort/keys.rs` around lines 1214 - 1227, Replace the three individual
CB comparison assertions with a single parameterized rstest that supplies the CB
strings and expected ordering; create a test function (e.g., test_cbd_ordering
or test_cid_comparisons) annotated with #[rstest] and include cases for
("A","AA", Ordering::Less), ("A","B", Ordering::Less), and ("","AAAA",
Ordering::Less), constructing keys via key(...) inside the test and asserting
the comparison matches the expected result; ensure you import rstest and
std::cmp::Ordering and keep the original key(...) argument pattern to build
operands.
src/commands/sort.rs (1)

298-304: Extract shared cell-tag parsing into a helper.

The same parse/convert block appears twice; a small helper would reduce drift.

Also applies to: 376-382

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/commands/sort.rs` around lines 298 - 304, Extract the repeated
parse/convert block into a small helper like parse_cell_tag_option(&self) ->
Result<Option<[u8;2]>, Error> that checks if self.order ==
SortOrderArg::TemplateCoordinate, calls string_to_tag(&self.cell_tag,
"cell-tag")?, and returns Some([tag.as_ref()[0], tag.as_ref()[1]]) or None
otherwise; replace the duplicate code blocks that currently compute cell_tag
(the snippet using string_to_tag and creating Some([..]) and the similar block
at the other location) with calls to this helper so both sites reuse the same
logic and error handling.
crates/fgumi-raw-bam/src/sort.rs (1)

493-557: Use rstest for these CB permutation tests.

These cases are table-style variants of one scenario and are better expressed as parameterized tests.

As per coding guidelines "Use rstest for parameterized tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/fgumi-raw-bam/src/sort.rs` around lines 493 - 557, Replace the three
near-duplicate unit tests with a single parameterized rstest that iterates the
CB permutations and expected Ordering values; locate the existing tests
test_compare_template_coordinate_raw_cb_tag_tiebreak,
test_compare_template_coordinate_raw_cb_mixed_presence, and
test_compare_template_coordinate_raw_cb_length_first and consolidate them into
one #[rstest] function that takes parameters (aux_a: Vec<u8>, aux_b: Vec<u8>,
expected: Ordering) and inside builds a and b using make_bam_bytes and asserts
compare_template_coordinate_raw(&a, &b, Some(b"CB")) == expected (also include
the case where cell_tag is None for the tiebreak test as a separate param or
separate assertion); add use rstest::rstest at top and ensure Ordering is in
scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/commands/dedup.rs`:
- Line 1042: The guidance strings in src/commands/dedup.rs hardcode the cell tag
("CB") instead of using the configured --cell-tag value; update the user-facing
messages to interpolate the actual cell-tag variable (e.g., cell_tag or
opts.cell_tag) when building the help/guidance strings so the example command
uses the real flag value; apply the same change to all occurrences referenced
(around the strings at lines ~1042, ~1191-1192, ~1402-1403) so messages
consistently show the configured --cell-tag rather than "CB".

In `@src/commands/group.rs`:
- Around line 928-929: The remediation message hardcodes "--cell-tag CB"; update
the string construction in src/commands/group.rs (where the grouping error/help
message is built, e.g., inside the group command handler function) to
interpolate the actual cell-tag value variable (commonly named cell_tag,
args.cell_tag, or obtained via matches.value_of("cell-tag")) instead of the
literal "CB", so the suggested fgumi command uses the user-provided cell tag;
ensure the string formatting/escaping is correct when inserting that variable.

In `@src/lib/sort/inline_buffer.rs`:
- Around line 305-307: The template-key packing in inline_buffer.rs produces a
different field order than TemplateCoordinateKey (causing inconsistent sorting):
adjust the packing in the tertiary key construction (the block that builds
`tertiary`) so its field order matches TemplateCoordinateKey’s ordering (cid ->
mid -> name -> library_idx) instead of placing `library` before `MI`; update the
same ordering logic in the other affected sections (around the tertiary
construction and the similar packs at lines referenced 369-374 and 452-460) so
all sort paths encode keys identically.

In `@src/lib/sort/raw.rs`:
- Around line 1779-1788: The code reduces the cell barcode (CB) to a 64-bit hash
(cb_hash) using ahash::RandomState::with_seeds and state.hash_one (triggered
from cell_tag and find_string_tag_in_record), which breaks the required ordering
(should be by CB length then lexicographic) and can create collisions; change
the implementation so the sort key retains the CB length and raw bytes instead
of a numeric hash (e.g., produce a tuple key (cb_len, cb_bytes) or otherwise
emit length then lexicographic value) and update any comparisons that use
cb_hash to compare first by length then by lexicographic byte order; remove use
of ahash::RandomState::with_seeds/state.hash_one for CB ordering to preserve
correct template-coordinate ordering semantics.

---

Outside diff comments:
In `@src/lib/sort/inline_buffer.rs`:
- Around line 384-390: The unmapped constructor currently zeroes cb_hash which
causes unmapped records to lose cell-boundary tagging in CB-aware mode; update
TemplateKey::unmapped to set cb_hash to the unmapped sentinel (e.g., u64::MAX)
instead of 0 so cell-tag logic can detect an unmapped CB. Modify the unmapped
function (the constructor that sets primary, secondary, cb_hash, tertiary,
name_hash_upper) to assign cb_hash = u64::MAX.

---

Nitpick comments:
In `@crates/fgumi-raw-bam/src/sort.rs`:
- Around line 493-557: Replace the three near-duplicate unit tests with a single
parameterized rstest that iterates the CB permutations and expected Ordering
values; locate the existing tests
test_compare_template_coordinate_raw_cb_tag_tiebreak,
test_compare_template_coordinate_raw_cb_mixed_presence, and
test_compare_template_coordinate_raw_cb_length_first and consolidate them into
one #[rstest] function that takes parameters (aux_a: Vec<u8>, aux_b: Vec<u8>,
expected: Ordering) and inside builds a and b using make_bam_bytes and asserts
compare_template_coordinate_raw(&a, &b, Some(b"CB")) == expected (also include
the case where cell_tag is None for the tiebreak test as a separate param or
separate assertion); add use rstest::rstest at top and ensure Ordering is in
scope.

In `@src/commands/sort.rs`:
- Around line 298-304: Extract the repeated parse/convert block into a small
helper like parse_cell_tag_option(&self) -> Result<Option<[u8;2]>, Error> that
checks if self.order == SortOrderArg::TemplateCoordinate, calls
string_to_tag(&self.cell_tag, "cell-tag")?, and returns Some([tag.as_ref()[0],
tag.as_ref()[1]]) or None otherwise; replace the duplicate code blocks that
currently compute cell_tag (the snippet using string_to_tag and creating
Some([..]) and the similar block at the other location) with calls to this
helper so both sites reuse the same logic and error handling.

In `@src/lib/sort/keys.rs`:
- Around line 1214-1227: Replace the three individual CB comparison assertions
with a single parameterized rstest that supplies the CB strings and expected
ordering; create a test function (e.g., test_cbd_ordering or
test_cid_comparisons) annotated with #[rstest] and include cases for ("A","AA",
Ordering::Less), ("A","B", Ordering::Less), and ("","AAAA", Ordering::Less),
constructing keys via key(...) inside the test and asserting the comparison
matches the expected result; ensure you import rstest and std::cmp::Ordering and
keep the original key(...) argument pattern to build operands.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bf4b24f and 0c14864.

📒 Files selected for processing (7)
  • crates/fgumi-raw-bam/src/sort.rs
  • src/commands/dedup.rs
  • src/commands/group.rs
  • src/commands/sort.rs
  • src/lib/sort/inline_buffer.rs
  • src/lib/sort/keys.rs
  • src/lib/sort/raw.rs

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 90.37657% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.28%. Comparing base (3d3d194) to head (ea9adec).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/commands/sort.rs 71.42% 12 Missing ⚠️
src/lib/sort/inline_buffer.rs 90.80% 8 Missing ⚠️
src/lib/sort/raw.rs 97.27% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #160      +/-   ##
==========================================
+ Coverage   84.09%   84.28%   +0.19%     
==========================================
  Files         128      128              
  Lines       51771    51978     +207     
==========================================
+ Hits        43538    43812     +274     
+ Misses       8233     8166      -67     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nh13 nh13 force-pushed the feat/nh/add-cb-to-template-sort branch from 0c14864 to 79dc8b2 Compare March 4, 2026 20:42
@nh13 nh13 temporarily deployed to github-actions March 4, 2026 20:42 — with GitHub Actions Inactive
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
src/lib/sort/inline_buffer.rs (1)

369-373: ⚠️ Potential issue | 🔴 Critical

TemplateKey still orders library before MI/name, violating the declared template-coordinate order.

Because tertiary packs library in higher bits and is compared before name_hash_upper, the effective order is ... -> CB -> library -> MI -> name. That disagrees with the intended ... -> CB -> MI -> name -> library -> is_upper.

Also applies to: 452-460

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/sort/inline_buffer.rs` around lines 369 - 373, The tertiary packing
in inline_buffer.rs currently places library in the highest bits (p3), causing
library to be compared before MI/name; fix the pack order so MI and
name_hash_upper occupy the higher bits and library comes after them (with
is_upper as the lowest bit). Specifically update the p3 construction (and the
analogous pack at the other occurrence around the 452-460 region) to shift and
combine mi.0 and the name hash upper bits into the top bits, then include
library next, and finally the is_upper bit (from !mi.1) as the least significant
bit so the effective comparison order becomes CB -> MI -> name -> library ->
is_upper.
src/lib/sort/raw.rs (1)

1779-1788: ⚠️ Potential issue | 🔴 Critical

CB hashing breaks required template-coordinate ordering semantics.

cb_hash comparison is not equivalent to CB length-first + lexicographic ordering and can collide. It also makes missing CB (0) differ from empty CB (hash("")), which is inconsistent with the raw comparator semantics.

Please carry a collision-free CB ordering representation (length + bytes, or equivalent) through TemplateKey instead of a hash-only key.

🧹 Nitpick comments (1)
src/commands/sort.rs (1)

276-285: Add focused tests for parse_cell_tag.

This new branch/parsing logic should have direct coverage (template-coordinate valid/invalid tag, non-template returns None).

Proposed test additions
 #[cfg(test)]
 mod tests {
     use super::*;
+    use rstest::rstest;
+
+    #[rstest]
+    #[case(SortOrderArg::TemplateCoordinate, "CB", Some(*b"CB"), true)]
+    #[case(SortOrderArg::TemplateCoordinate, "CR", Some(*b"CR"), true)]
+    #[case(SortOrderArg::Coordinate, "CB", None, true)]
+    #[case(SortOrderArg::Queryname, "CB", None, true)]
+    #[case(SortOrderArg::TemplateCoordinate, "C", None, false)]
+    fn test_parse_cell_tag(
+        #[case] order: SortOrderArg,
+        #[case] cell_tag: &str,
+        #[case] expected: Option<[u8; 2]>,
+        #[case] ok: bool,
+    ) {
+        let cmd = Sort {
+            input: PathBuf::from("in.bam"),
+            output: Some(PathBuf::from("out.bam")),
+            verify: false,
+            order,
+            max_memory: 768 * 1024 * 1024,
+            memory_per_thread: true,
+            tmp_dir: None,
+            threads: 1,
+            compression: CompressionOptions::default(),
+            temp_compression: 1,
+            write_index: false,
+            cell_tag: cell_tag.to_string(),
+        };
+
+        let got = cmd.parse_cell_tag();
+        if ok {
+            assert_eq!(got.unwrap(), expected);
+        } else {
+            assert!(got.is_err());
+        }
+    }

As per coding guidelines, "Use rstest for parameterized tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/commands/sort.rs` around lines 276 - 285, Add rstest-based unit tests
that directly exercise the parse_cell_tag method: (1) when self.order ==
SortOrderArg::TemplateCoordinate and self.cell_tag is a valid two-char tag (use
string_to_tag expectations) assert Ok(Some([u8;2])) with correct bytes; (2) when
self.order == SortOrderArg::TemplateCoordinate and self.cell_tag is invalid
assert Err from string_to_tag; and (3) when self.order !=
SortOrderArg::TemplateCoordinate assert Ok(None). Reference the parse_cell_tag
method, SortOrderArg::TemplateCoordinate variant, the cell_tag field, and the
string_to_tag helper when writing test cases and use rstest parameterization to
cover valid/invalid/non-template cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/sort/inline_buffer.rs`:
- Around line 369-373: The tertiary packing in inline_buffer.rs currently places
library in the highest bits (p3), causing library to be compared before MI/name;
fix the pack order so MI and name_hash_upper occupy the higher bits and library
comes after them (with is_upper as the lowest bit). Specifically update the p3
construction (and the analogous pack at the other occurrence around the 452-460
region) to shift and combine mi.0 and the name hash upper bits into the top
bits, then include library next, and finally the is_upper bit (from !mi.1) as
the least significant bit so the effective comparison order becomes CB -> MI ->
name -> library -> is_upper.

---

Nitpick comments:
In `@src/commands/sort.rs`:
- Around line 276-285: Add rstest-based unit tests that directly exercise the
parse_cell_tag method: (1) when self.order == SortOrderArg::TemplateCoordinate
and self.cell_tag is a valid two-char tag (use string_to_tag expectations)
assert Ok(Some([u8;2])) with correct bytes; (2) when self.order ==
SortOrderArg::TemplateCoordinate and self.cell_tag is invalid assert Err from
string_to_tag; and (3) when self.order != SortOrderArg::TemplateCoordinate
assert Ok(None). Reference the parse_cell_tag method,
SortOrderArg::TemplateCoordinate variant, the cell_tag field, and the
string_to_tag helper when writing test cases and use rstest parameterization to
cover valid/invalid/non-template cases.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7a9fd425-6a9f-4730-bdcc-f18e22754fc8

📥 Commits

Reviewing files that changed from the base of the PR and between 0c14864 and 79dc8b2.

📒 Files selected for processing (7)
  • crates/fgumi-raw-bam/src/sort.rs
  • src/commands/dedup.rs
  • src/commands/group.rs
  • src/commands/sort.rs
  • src/lib/sort/inline_buffer.rs
  • src/lib/sort/keys.rs
  • src/lib/sort/raw.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/commands/dedup.rs

@nh13 nh13 force-pushed the feat/nh/add-cb-to-template-sort branch from 79dc8b2 to f8fa67b Compare March 4, 2026 23:08
@nh13 nh13 temporarily deployed to github-actions March 4, 2026 23:08 — with GitHub Actions Inactive
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/commands/dedup.rs (1)

1042-1042: ⚠️ Potential issue | 🟠 Major

Replace hardcoded CB in help text with a generic/matching cell-tag instruction.

Line 1042 still hardcodes --cell-tag CB, which conflicts with configurable --cell-tag and can produce incorrect operator guidance.

Suggested fix
-- Must be sorted with `fgumi sort --order template-coordinate --cell-tag CB`
+- Must be sorted with `fgumi sort --order template-coordinate` using the matching `--cell-tag`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/commands/dedup.rs` at line 1042, The help text currently hardcodes "CB"
in the string "Must be sorted with `fgumi sort --order template-coordinate
--cell-tag CB`"; update the help/usage string in dedup.rs (where the
`--cell-tag` argument/flag help is defined) to avoid a hardcoded tag by either
showing a placeholder like `--cell-tag <TAG>`/`--cell-tag <CELL_TAG>` or by
interpolating the actual `--cell-tag` value used by the command; locate the
literal help string and replace "CB" with the placeholder or dynamic reference
so the message matches the configurable `--cell-tag` flag.
src/lib/sort/raw.rs (1)

1758-1760: ⚠️ Potential issue | 🔴 Critical

CB hashing here breaks the required CB sort semantics.

The required order is CB length-first then lexicographic. Hashing at Line 1779 loses that order and introduces collision risk, so template-coordinate output can diverge from spec and from other CB-aware comparison paths.

Please store a sortable CB key (length + bytes) rather than a hash in the template key path.

Also applies to: 1778-1789

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/sort/raw.rs` around lines 1758 - 1760, The template key currently
hashes the CB value when cell_tag is Some (the CB hashing between neg2 and MI),
which breaks required length-first then lexicographic CB ordering and risks
collisions; instead, change the template key construction to append a sortable
CB key: first the CB length (as a fixed-width byte or varint consistent with
other keys) followed by the raw CB bytes (no hashing). Update the code paths
that build the template key where cell_tag is used (the block inserting the
hashed CB between neg2 and MI) to write length+bytes into the same buffer/field
so downstream comparisons use length-first then lexicographic order and remove
the hash usage.
src/lib/sort/inline_buffer.rs (1)

305-307: ⚠️ Potential issue | 🟠 Major

TemplateKey ordering still diverges from the intended template-coordinate field order.

At Line 371, library is packed ahead of MI, and tertiary is compared before name_hash_upper (Line 452+), producing CB -> library -> MI -> name. The intended order is CB -> MI -> name -> library. This can make sort output disagree across key paths.

Also applies to: 369-373, 452-460, 465-476

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/sort/inline_buffer.rs` around lines 305 - 307, The TemplateKey
packing/comparison in inline_buffer.rs currently places `library` before `MI`
and compares `tertiary` (tertiary/`is_upper`/name_hash_upper order) before
`name_hash_upper`, resulting in CB -> library -> MI -> name; update the packing
and comparison logic for the TemplateKey so the fields are emitted and compared
in the intended order: CB -> MI -> name -> library. Specifically, in the code
that constructs/packs the u64 key words for TemplateKey (the packing block that
builds the five u64 values) reorder the fields so `MI` is packed prior to
`library` and the `name_hash_upper` components are placed/compared before the
`tertiary`/`is_upper` component; likewise update the comparison path where those
u64 words are compared (the TemplateKey comparison routine) to follow CB, MI,
name, library sequence so sorting is consistent across key paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/commands/dedup.rs`:
- Line 1042: The help text currently hardcodes "CB" in the string "Must be
sorted with `fgumi sort --order template-coordinate --cell-tag CB`"; update the
help/usage string in dedup.rs (where the `--cell-tag` argument/flag help is
defined) to avoid a hardcoded tag by either showing a placeholder like
`--cell-tag <TAG>`/`--cell-tag <CELL_TAG>` or by interpolating the actual
`--cell-tag` value used by the command; locate the literal help string and
replace "CB" with the placeholder or dynamic reference so the message matches
the configurable `--cell-tag` flag.

In `@src/lib/sort/inline_buffer.rs`:
- Around line 305-307: The TemplateKey packing/comparison in inline_buffer.rs
currently places `library` before `MI` and compares `tertiary`
(tertiary/`is_upper`/name_hash_upper order) before `name_hash_upper`, resulting
in CB -> library -> MI -> name; update the packing and comparison logic for the
TemplateKey so the fields are emitted and compared in the intended order: CB ->
MI -> name -> library. Specifically, in the code that constructs/packs the u64
key words for TemplateKey (the packing block that builds the five u64 values)
reorder the fields so `MI` is packed prior to `library` and the
`name_hash_upper` components are placed/compared before the
`tertiary`/`is_upper` component; likewise update the comparison path where those
u64 words are compared (the TemplateKey comparison routine) to follow CB, MI,
name, library sequence so sorting is consistent across key paths.

In `@src/lib/sort/raw.rs`:
- Around line 1758-1760: The template key currently hashes the CB value when
cell_tag is Some (the CB hashing between neg2 and MI), which breaks required
length-first then lexicographic CB ordering and risks collisions; instead,
change the template key construction to append a sortable CB key: first the CB
length (as a fixed-width byte or varint consistent with other keys) followed by
the raw CB bytes (no hashing). Update the code paths that build the template key
where cell_tag is used (the block inserting the hashed CB between neg2 and MI)
to write length+bytes into the same buffer/field so downstream comparisons use
length-first then lexicographic order and remove the hash usage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 27d67f7d-93ef-4450-a311-fd3bd1f248ca

📥 Commits

Reviewing files that changed from the base of the PR and between 79dc8b2 and f8fa67b.

📒 Files selected for processing (7)
  • crates/fgumi-raw-bam/src/sort.rs
  • src/commands/dedup.rs
  • src/commands/group.rs
  • src/commands/sort.rs
  • src/lib/sort/inline_buffer.rs
  • src/lib/sort/keys.rs
  • src/lib/sort/raw.rs

Port fgbio PR #1142 to include the CB tag in the template-coordinate
sort order for single-cell data. Without CB in the sort, templates from
different cells at the same locus are interleaved, breaking grouper
adjacency assumptions.

Comparison order is now:
  tid1 → tid2 → pos1 → pos2 → neg1 → neg2 → CB → MI → name → library → is_upper

Changes:
- Add cb_hash (u64) to TemplateKey between secondary and tertiary,
  expanding packed key from 32 to 40 bytes and inline header from 40
  to 48 bytes
- Add cid (String) to TemplateCoordinateKey with length-first then
  lexicographic comparison matching fgbio
- Add cell_tag parameter to compare_template_coordinate_raw() for raw
  byte comparison path
- Add cell_tag field and builder method to RawExternalSorter
- Add --cell-tag / -c CLI argument to sort command (default "CB")
- Update group and dedup help text to mention --cell-tag in sort
  suggestions
@nh13 nh13 force-pushed the feat/nh/add-cb-to-template-sort branch from f8fa67b to ea9adec Compare March 6, 2026 06:09
@nh13 nh13 temporarily deployed to github-actions March 6, 2026 06:09 — with GitHub Actions Inactive
@nh13 nh13 merged commit ec8c1ab into main Mar 6, 2026
7 checks passed
@nh13 nh13 deleted the feat/nh/add-cb-to-template-sort branch March 6, 2026 17:53
@nh13 nh13 mentioned this pull request Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant