perf: move CPU-bound ingest work off the event loop by bd-mkt · Pull Request #449 · ggozad/haiku.rag

bd-mkt · 2026-06-18T21:36:40Z

perf: move CPU-bound ingest work off the event loop

Summary

The ingester runs worker_count async workers that share a single event loop.
Several steps in the convert → chunk → embed → store pipeline performed
CPU-bound work synchronously on the event-loop thread. Whenever one worker
hit such a call, every other worker's coroutine was frozen for its entire
duration — turning what should be a few-second store step into blocks of
hundreds to a thousand-plus seconds under a batch run, because workers pile up
behind each other's synchronous bursts.

This PR dispatches those calls through asyncio.to_thread, so the work runs in
the thread pool and the event loop stays free to drive the other workers. It
follows the same pattern already used for PDF slicing and the embedded-PDF
attachment scan (the related fix earlier on this branch).

Why these calls in particular

This was scoped against a production batch configuration: filesystem source,
docling-serve for both conversion and chunking, split_pages: 20,
generate_page_images on (default) at images_scale: 1.0. That config inlines
full-resolution page rasters as base64 into the DoclingDocument, which makes
the serialization and parse steps below genuinely heavy — proportional to total
document size, not just structure.

Changes

File	Call	Fix
`chunkers/docling_serve.py`	`document.model_dump_json()` of an image-laden document before the chunk request	`await asyncio.to_thread(...)`
`converters/docling_serve.py`	`_parse_zip_to_docling` — zip decompress, per-image base64 re-encode, `DoclingDocument.model_validate`	`await asyncio.to_thread(...)`
`converters/pdf_split.py`	`DoclingDocument.concatenate(slices)` merging inlined-image slice docs	`await asyncio.to_thread(...)`
`ingester/sources/fs.py`	`path.read_bytes()` + `hashlib.md5` over the whole file	extracted `_read_body` (stat + size-check + read + hash), called via `await asyncio.to_thread(...)`

The docling-serve HTTP submit/poll was already async (httpx.AsyncClient), and
openai embeddings are network-bound, so those paths were left unchanged.

Tests

Each fix has a thread-identity test that captures threading.current_thread()
where the work runs and asserts it is not the main thread — so a regression
back to a synchronous call fails with a clear message, independent of timing or
document size:

tests/test_chunker.py::TestDoclingServeChunker::test_chunk_serializes_document_off_event_loop_thread
tests/test_converters.py::test_parse_zip_runs_off_event_loop_thread
tests/test_pdf_split.py::test_concatenate_runs_off_event_loop_thread
tests/ingester/test_fs_source.py::test_fs_source_fetch_reads_off_event_loop_thread

The first three pass in CI. The FS test passes wherever FSSource._resolve_within_root
resolves file:// URIs correctly (Linux/CI); on Windows it is blocked by a
pre-existing path-resolution issue that fails every fetch-path FS test on
main, unrelated to this change. The FS code change was verified directly:
_read_body runs on a worker thread with correct body and md5.

ruff check and ruff format --check pass on all changed files.

Effect

For the target config, the dominant blocker — serializing full-resolution
base64 page rasters for the whole document on every document — and the per-slice
zip parse now run off the loop. A single large PDF no longer freezes the other
workers during convert/chunk. Combined with the earlier attachment-scan fix, the
convert → chunk path no longer blocks the event loop.

Out of scope

The store step's _write_lock (a single asyncio.Lock held across the
sequential table writes in client/documents.py and client/__init__.py)
still serializes all workers at commit time. That is a separate
lock-granularity issue, not addressed here.

codecov · 2026-06-22T07:25:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.96%. Comparing base (faa97f8) to head (ad3b111).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #449   +/-   ##
=======================================
  Coverage   95.96%   95.96%           
=======================================
  Files         124      124           
  Lines        7132     7140    +8     
=======================================
+ Hits         6844     6852    +8     
  Misses        288      288

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ggozad · 2026-06-22T07:30:52Z

Moved stored-document preparation off the event loop. The async client paths now use a single helper
that runs markdown export plus Document.set_docling() in a worker thread, covering the remaining size-proportional Docling serialization/compression step for image-heavy documents.

ggozad · 2026-06-22T07:38:59Z

Moved fetched-body temp-file writes off the event loop for non-FS type sources.

bd-mkt and others added 2 commits June 18, 2026 16:35

move cpu bound actions off of main loop

fe954a0

compare off-loop work against actual event-loop thread, fix ty

c6514c9

prepare stored Docling blobs off the event loop

183d595

write fetched bodies off the event loop

d3a1011

ggozad force-pushed the bd_concurrency2 branch from d3a1011 to b70a9cd Compare June 22, 2026 07:47

test: clarify off-loop thread assertions

ad3b111

ggozad force-pushed the bd_concurrency2 branch from b70a9cd to ad3b111 Compare June 22, 2026 07:52

ggozad merged commit 23f5b3c into ggozad:main Jun 22, 2026
5 checks passed

ggozad added a commit that referenced this pull request Jun 22, 2026

cl for #449

6f2ab09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: move CPU-bound ingest work off the event loop#449

perf: move CPU-bound ingest work off the event loop#449
ggozad merged 5 commits into
ggozad:mainfrom
bd-mkt:bd_concurrency2

bd-mkt commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

ggozad commented Jun 22, 2026

Uh oh!

ggozad commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

bd-mkt commented Jun 18, 2026

perf: move CPU-bound ingest work off the event loop

Summary

Why these calls in particular

Changes

Tests

Effect

Out of scope

Uh oh!

codecov Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ggozad commented Jun 22, 2026

Uh oh!

ggozad commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 22, 2026 •

edited

Loading