Skip to content

Evaluation of memchunk for high-speed chunking #26

@zzstoatzz

Description

@zzstoatzz

I've been evaluating memchunk as a potential backend for high-speed chunking in RAG pipelines and wanted to share findings relevant to raggy.

The Promise:
memchunk claims up to 1 TB/s throughput using SIMD-optimized Rust, significantly faster than typical python-based splitters.

The Limitations for raggy:

  1. Byte-based vs Token-based: memchunk strictly limits chunks by byte size, whereas raggy (via tiktoken) chunks by token count. This makes it a non-starter for precise context window management unless users accept "approximate" token counts based on byte size.
  2. Markdown Awareness: It splits on delimiters (newlines, periods) but lacks the semantic structural awareness of MarkdownSplitter (headers, code blocks), which is often critical for documentation RAG.

Potential Use Case:
It might be valuable as an optional "fast mode" backend for massive datasets where speed > precision, but it doesn't appear to be a drop-in replacement for the default precision-focused splitters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions