Evaluation of memchunk for high-speed chunking

I've been evaluating [memchunk](https://github.com/chonkie-inc/memchunk) as a potential backend for high-speed chunking in RAG pipelines and wanted to share findings relevant to `raggy`.

**The Promise:**
`memchunk` claims up to 1 TB/s throughput using SIMD-optimized Rust, significantly faster than typical python-based splitters.

**The Limitations for `raggy`:**
1. **Byte-based vs Token-based:** `memchunk` strictly limits chunks by *byte size*, whereas `raggy` (via `tiktoken`) chunks by *token count*. This makes it a non-starter for precise context window management unless users accept "approximate" token counts based on byte size.
2. **Markdown Awareness:** It splits on delimiters (newlines, periods) but lacks the semantic structural awareness of `MarkdownSplitter` (headers, code blocks), which is often critical for documentation RAG.

**Potential Use Case:**
It might be valuable as an optional "fast mode" backend for massive datasets where speed > precision, but it doesn't appear to be a drop-in replacement for the default precision-focused splitters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation of memchunk for high-speed chunking #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Evaluation of memchunk for high-speed chunking #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions