Skip to content

perf: move CPU-bound ingest work off the event loop#449

Merged
ggozad merged 5 commits into
ggozad:mainfrom
bd-mkt:bd_concurrency2
Jun 22, 2026
Merged

perf: move CPU-bound ingest work off the event loop#449
ggozad merged 5 commits into
ggozad:mainfrom
bd-mkt:bd_concurrency2

Conversation

@bd-mkt

@bd-mkt bd-mkt commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

perf: move CPU-bound ingest work off the event loop

Summary

The ingester runs worker_count async workers that share a single event loop.
Several steps in the convert → chunk → embed → store pipeline performed
CPU-bound work synchronously on the event-loop thread. Whenever one worker
hit such a call, every other worker's coroutine was frozen for its entire
duration — turning what should be a few-second store step into blocks of
hundreds to a thousand-plus seconds under a batch run, because workers pile up
behind each other's synchronous bursts.

This PR dispatches those calls through asyncio.to_thread, so the work runs in
the thread pool and the event loop stays free to drive the other workers. It
follows the same pattern already used for PDF slicing and the embedded-PDF
attachment scan (the related fix earlier on this branch).

Why these calls in particular

This was scoped against a production batch configuration: filesystem source,
docling-serve for both conversion and chunking, split_pages: 20,
generate_page_images on (default) at images_scale: 1.0. That config inlines
full-resolution page rasters as base64 into the DoclingDocument, which makes
the serialization and parse steps below genuinely heavy — proportional to total
document size, not just structure.

Changes

File Call Fix
chunkers/docling_serve.py document.model_dump_json() of an image-laden document before the chunk request await asyncio.to_thread(...)
converters/docling_serve.py _parse_zip_to_docling — zip decompress, per-image base64 re-encode, DoclingDocument.model_validate await asyncio.to_thread(...)
converters/pdf_split.py DoclingDocument.concatenate(slices) merging inlined-image slice docs await asyncio.to_thread(...)
ingester/sources/fs.py path.read_bytes() + hashlib.md5 over the whole file extracted _read_body (stat + size-check + read + hash), called via await asyncio.to_thread(...)

The docling-serve HTTP submit/poll was already async (httpx.AsyncClient), and
openai embeddings are network-bound, so those paths were left unchanged.

Tests

Each fix has a thread-identity test that captures threading.current_thread()
where the work runs and asserts it is not the main thread — so a regression
back to a synchronous call fails with a clear message, independent of timing or
document size:

  • tests/test_chunker.py::TestDoclingServeChunker::test_chunk_serializes_document_off_event_loop_thread
  • tests/test_converters.py::test_parse_zip_runs_off_event_loop_thread
  • tests/test_pdf_split.py::test_concatenate_runs_off_event_loop_thread
  • tests/ingester/test_fs_source.py::test_fs_source_fetch_reads_off_event_loop_thread

The first three pass in CI. The FS test passes wherever FSSource._resolve_within_root
resolves file:// URIs correctly (Linux/CI); on Windows it is blocked by a
pre-existing path-resolution issue that fails every fetch-path FS test on
main, unrelated to this change. The FS code change was verified directly:
_read_body runs on a worker thread with correct body and md5.

ruff check and ruff format --check pass on all changed files.

Effect

For the target config, the dominant blocker — serializing full-resolution
base64 page rasters for the whole document on every document — and the per-slice
zip parse now run off the loop. A single large PDF no longer freezes the other
workers during convert/chunk. Combined with the earlier attachment-scan fix, the
convert → chunk path no longer blocks the event loop.

Out of scope

The store step's _write_lock (a single asyncio.Lock held across the
sequential table writes in client/documents.py and client/__init__.py)
still serializes all workers at commit time. That is a separate
lock-granularity issue, not addressed here.

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.96%. Comparing base (faa97f8) to head (ad3b111).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #449   +/-   ##
=======================================
  Coverage   95.96%   95.96%           
=======================================
  Files         124      124           
  Lines        7132     7140    +8     
=======================================
+ Hits         6844     6852    +8     
  Misses        288      288           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ggozad

ggozad commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Moved stored-document preparation off the event loop. The async client paths now use a single helper
that runs markdown export plus Document.set_docling() in a worker thread, covering the remaining size-proportional Docling serialization/compression step for image-heavy documents.

@ggozad

ggozad commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Moved fetched-body temp-file writes off the event loop for non-FS type sources.

@ggozad ggozad force-pushed the bd_concurrency2 branch from d3a1011 to b70a9cd Compare June 22, 2026 07:47
@ggozad ggozad force-pushed the bd_concurrency2 branch from b70a9cd to ad3b111 Compare June 22, 2026 07:52
@ggozad ggozad merged commit 23f5b3c into ggozad:main Jun 22, 2026
5 checks passed
ggozad added a commit that referenced this pull request Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants