Skip to content

Add alignment position info and read length to RAD format#9

Open
ygao61 wants to merge 27 commits intodevfrom
dev-pos
Open

Add alignment position info and read length to RAD format#9
ygao61 wants to merge 27 commits intodevfrom
dev-pos

Conversation

@ygao61
Copy link
Copy Markdown
Collaborator

@ygao61 ygao61 commented Dec 10, 2025

This PR introduces alignment-level position fields & file-level read length to the RAD format

@rob-p rob-p changed the base branch from main to dev December 11, 2025 02:15
rob-p and others added 26 commits December 11, 2025 00:21
Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>
Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>
…ap in mapping hot path

Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>
…lementation

Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>
Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>
…ation

Replace phmap::flat_hash_map with ankerl::unordered_dense in mapping hot path
Upgrade sshash (ee8513b → afff26d) for improved lookup performance
(~10-15% faster mapping). Adapt piscem to the new sshash API:

- dictionary<Kmer> → dictionary<Kmer, Offsets>
- lookup_advanced() → lookup()
- lookup_result field renames: contig_id → string_id,
  kmer_id_in_contig → kmer_id_in_string, contig_size replaced by
  string_begin/string_end, new kmer_offset field
- MurmurHash2_64 (removed upstream) → XXH64
- build.cpp: dict.size() → dict.num_kmers(), iterator check updated
  for dna_uint_kmer_t return type, perf_test API updated
- build_contig_table.cpp: fix "short refs" → "short seqs" JSON key

Verified: index build --check passes all 96M k-mers, end-to-end
position check passes on 245k gencode v49 transcripts, and bulk
mapping of 26M reads produces identical results to the pre-upgrade
baseline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…benchmark

The formula for computing remaining extendable positions was subtracting
k twice (once converting string length to num_kmers, once offsetting by
kmer_id + k), underestimating by k-1 positions.  This caused ~80% more
full dictionary lookups than necessary — fixing it yields a 16% speedup
in pure streaming query time and brings the extension rate from 83% to 97%.

Also adds a standalone streaming_lookup_bench executable for measuring
single-threaded query throughput independently of mapping I/O.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
lean_read_iterator maintains its own fw/rc 2-bit k-mer words via O(1)
shift+OR operations, independent of the sshash streaming engine. This
enables O(1) is_equivalent() comparison against reference SPSS bits
without constructing full Kmer objects or invoking the engine.

The sshash engine is called lazily (only on actual dictionary lookups)
and stays dormant during contig-walking hot paths where only bitwise
comparison is needed. advance(n) rebuilds from scratch when n > k
(O(k) vs O(n) rolls).

Also extends the streaming_lookup_bench with --lean, --sshash-native,
and --validate modes for comparative benchmarking.

Mirrors the Rust ReadKmerIter design in piscem-rs (commit 9603b3e).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Template lean_read_iterator on `canonical` so it works with both
  canonical and non-canonical sshash indices. Previously hardcoded
  `sshash::streaming_query<dict_t, false>` which crashed on canonical
  dictionaries. Dispatch on `dict.canonical()` at the call site.
- Remove redundant hit_map.clear() and hit_map.reserve() from
  mapping_cache_info::clear() — the lambda already clears before use.
- Replace full map_cache_out.clear() in merge_se_mappings with a
  lightweight field reset (callers already clear before calling).
- Add avalanching_u32_hash for the hit_map (fibonacci hash marked
  is_avalanching) to skip ankerl's wyhash 128-bit multiply on every
  map access.
- Replace per-read atomic increments with local counters, flushed to
  globals at chunk boundaries (~5000 reads). Reduces 3 atomic ops per
  read to near zero.
- Add --point flag to streaming_lookup_bench for independent per-kmer
  point lookup benchmarking.

Canonical SE mapping: 159s (was 187s non-canonical), 2453B instructions
(was 2731B). Non-canonical still works and produces identical output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants