Skip to content

perf: cache PDFium document across page operations#89

Open
AdemBoukhris457 wants to merge 5 commits intorun-llama:mainfrom
AdemBoukhris457:perf/pdfium-document-caching
Open

perf: cache PDFium document across page operations#89
AdemBoukhris457 wants to merge 5 commits intorun-llama:mainfrom
AdemBoukhris457:perf/pdfium-document-caching

Conversation

@AdemBoukhris457
Copy link
Copy Markdown
Contributor

Summary

  • Add loadDocument()/closeDocument() lifecycle to PdfiumRenderer so the PDF is loaded once and reused across all per-page calls instead of re-parsing on every invocation.
  • Wire up caching in PdfJsEngine (extractPage + renderPageImage) and LiteParse.screenshot().
  • Maintain backward compatibility: methods fall back to per-call load/destroy when no document is pre-loaded.

Problem

Both renderPageToBuffer() and extractImageBounds() in PdfiumRenderer perform fs.readFile() + loadDocument() + destroy() on every single call. For a 100-page document this means 100-200 full PDF load/parse/destroy cycles with redundant disk I/O, CPU usage, and memory churn.

Changes

  • src/engines/pdf/pdfium-renderer.ts: Add cachedDocument field, loadDocument(), closeDocument(), and private getOrLoadDocument() helper. Refactor renderPageToBuffer() and extractImageBounds() to reuse cached document. Update close() to clean up cached document.
  • src/engines/pdf/pdfjs.ts: Call loadDocument() on the PdfiumRenderer after creation in extractPage() and renderPageImage().
  • src/core/parser.ts: Call renderer.loadDocument() in screenshot() before the page loop.
  • src/engines/pdf/pdfium-renderer.test.ts: Rewrite mocks for repeated calls, add 4 tests for caching behavior.
  • src/engines/pdf/pdfjs.test.ts, src/core/parser.test.ts: Add loadDocument/closeDocument to PdfiumRenderer mocks.

Closes #88

Add loadDocument()/closeDocument() to PdfiumRenderer so the PDF is loaded once and reused for all page renders and image-bound queries, instead of re-reading and re-parsing on every call. Wire up caching in PdfJsEngine and LiteParse.screenshot(). Methods fall back to per-call loading when no document is pre-loaded.
Add loadDocument()/closeDocument() to PdfiumRenderer so the PDF is loaded once and reused for all page renders and image-bound queries, instead of re-reading and re-parsing on every call. Wire up caching in PdfJsEngine and LiteParse.screenshot(). Methods fall back to per-call loading when no document is pre-loaded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] PDFium re-loads entire PDF from disk for every single page operation

1 participant