Add alignment position info and read length to RAD format by ygao61 · Pull Request #9 · COMBINE-lab/piscem-cpp

ygao61 · 2025-12-10T21:49:32Z

This PR introduces alignment-level position fields & file-level read length to the RAD format

…h at file level

Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>

…ap in mapping hot path Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>

…lementation Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>

Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>

…ation Replace phmap::flat_hash_map with ankerl::unordered_dense in mapping hot path

Upgrade sshash (ee8513b → afff26d) for improved lookup performance (~10-15% faster mapping). Adapt piscem to the new sshash API: - dictionary<Kmer> → dictionary<Kmer, Offsets> - lookup_advanced() → lookup() - lookup_result field renames: contig_id → string_id, kmer_id_in_contig → kmer_id_in_string, contig_size replaced by string_begin/string_end, new kmer_offset field - MurmurHash2_64 (removed upstream) → XXH64 - build.cpp: dict.size() → dict.num_kmers(), iterator check updated for dna_uint_kmer_t return type, perf_test API updated - build_contig_table.cpp: fix "short refs" → "short seqs" JSON key Verified: index build --check passes all 96M k-mers, end-to-end position check passes on 245k gencode v49 transcripts, and bulk mapping of 26M reads produces identical results to the pre-upgrade baseline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…benchmark The formula for computing remaining extendable positions was subtracting k twice (once converting string length to num_kmers, once offsetting by kmer_id + k), underestimating by k-1 positions. This caused ~80% more full dictionary lookups than necessary — fixing it yields a 16% speedup in pure streaming query time and brings the extension rate from 83% to 97%. Also adds a standalone streaming_lookup_bench executable for measuring single-threaded query throughput independently of mapping I/O. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lean_read_iterator maintains its own fw/rc 2-bit k-mer words via O(1) shift+OR operations, independent of the sshash streaming engine. This enables O(1) is_equivalent() comparison against reference SPSS bits without constructing full Kmer objects or invoking the engine. The sshash engine is called lazily (only on actual dictionary lookups) and stays dormant during contig-walking hot paths where only bitwise comparison is needed. advance(n) rebuilds from scratch when n > k (O(k) vs O(n) rolls). Also extends the streaming_lookup_bench with --lean, --sshash-native, and --validate modes for comparative benchmarking. Mirrors the Rust ReadKmerIter design in piscem-rs (commit 9603b3e). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Template lean_read_iterator on `canonical` so it works with both canonical and non-canonical sshash indices. Previously hardcoded `sshash::streaming_query<dict_t, false>` which crashed on canonical dictionaries. Dispatch on `dict.canonical()` at the call site. - Remove redundant hit_map.clear() and hit_map.reserve() from mapping_cache_info::clear() — the lambda already clears before use. - Replace full map_cache_out.clear() in merge_se_mappings with a lightweight field reset (callers already clear before calling). - Add avalanching_u32_hash for the hit_map (fibonacci hash marked is_avalanching) to skip ankerl's wyhash 128-bit multiply on every map access. - Replace per-read atomic increments with local counters, flushed to globals at chunk boundaries (~5000 reads). Reduces 3 atomic ops per read to near zero. - Add --point flag to streaming_lookup_bench for independent per-kmer point lookup benchmarking. Canonical SE mapping: 159s (was 187s non-canonical), 2453B instructions (was 2731B). Non-canonical still works and produces identical output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add alignment-level position info to RAD output and record read lengt…

f8080d1

…h at file level

rob-p changed the base branch from main to dev December 11, 2025 02:15

rob-p and others added 26 commits December 11, 2025 00:21

add changes to make recording positions optional; not yet tested

3c34ea9

optional positional info

586fe29

specific atacseq bug on program terminate

fdc4ecf

update to fixed libradicl

17f1992

don't fill strings with bed info when we aren't writing bed output

02b385d

switch from zlib-cloudflare to zlib-ng

73e478c

Initial plan

95b25be

Add memory pre-allocation optimizations to mapping_cache_info

86120ff

Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>

Improve code maintainability by using member variables for reserve sizes

248a44e

Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>

FQFeeder updates

42abde7

Merge branch 'dev-pos' into copilot/optimize-memory-allocation

668e8d8

Phase 1: Replace phmap::flat_hash_map with ankerl::unordered_dense::m…

133aff6

…ap in mapping hot path Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>

Add comprehensive summary of hash map replacement exploration and imp…

c322113

…lementation Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>

Add detailed code changes visualization and finalize documentation

da16ef6

Co-authored-by: rob-p <361470+rob-p@users.noreply.github.com>

Merge pull request #10 from COMBINE-lab/copilot/optimize-memory-alloc…

4801ae8

…ation Replace phmap::flat_hash_map with ankerl::unordered_dense in mapping hot path

notes

d5e6219

update unordered dense

de344d0

tweak params

9c7532a

fix missing variable in condition

a66c232

set chunk size in single-read case as well

6344251

use system dependent lib path

6ec5150

fix subtle shifting bug in _has_homopolymer_prefix()

be502a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alignment position info and read length to RAD format#9

Add alignment position info and read length to RAD format#9
ygao61 wants to merge 27 commits intodevfrom
dev-pos

ygao61 commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ygao61 commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants