Skip to content

feat: page-memory native hierarchy — concurrent scope processing, asset type-safety, and cross-page table merge#183

Merged
EricNGOntos merged 16 commits into
mainfrom
feat/wuchengke/page-memory-native-hierarchy
Jun 29, 2026
Merged

feat: page-memory native hierarchy — concurrent scope processing, asset type-safety, and cross-page table merge#183
EricNGOntos merged 16 commits into
mainfrom
feat/wuchengke/page-memory-native-hierarchy

Conversation

@EricNGOntos

Copy link
Copy Markdown
Contributor

Summary

This PR lands the Page Memory Native Hierarchy feature branch onto main. It introduces a production-grade, page-level document understanding pipeline for the Knowhere worker, encompassing several interconnected capabilities developed over the feature branch lifecycle.


Key Changes

1. Page-Based Document Parsing & Hierarchy

  • Implemented a page-memory pipeline that builds a native heading hierarchy directly from page-level signals (VLM tags, skeleton extraction, font clustering) rather than relying solely on MinerU/Markdown heading detection.
  • Added support for page_memory track in PDF and PPTX parsing paths, controlled via the PARSE_TRACK environment variable.
  • Implemented intermediate node chain collapse (collapse_single_child_chains) to clean up degenerate single-child heading chains in the final hierarchy.
  • Added page-count limits to prevent runaway processing on oversized documents.

2. Concurrent Scope Processing (memory_service.py)

  • Added GeventPool-based concurrent scope processing controlled by PAGE_MEMORY_SCOPE_CONCURRENCY env var (default: 4 parallel scopes).
  • Asset page budget is allocated proportionally across concurrent scopes via _allocate_asset_pages().
  • Fixed a Pyright type error: greenlet .value list now filters None entries (if g.value is not None) to satisfy the list[_ScopeRunResult] declared type, while raise_error=True ensures no silent failures at runtime.

3. Cross-Page Table Continuity (page_assets.py)

  • Implemented LLM-based cross-page table merge: _llm_judge_table_continuity() uses the model to decide if the tail of one page's table and the head of the next page's table form a continuation.
  • _merge_table_html_files() appends head-page rows into the tail-page HTML file and removes the now-merged head asset.
  • Fixed Pyright type errors: soup.find() returns PageElement | NavigableString | None; added isinstance(x, Tag) guards before calling .find_all() and .append() to satisfy the type checker without changing runtime behavior.

4. Enhanced Asset Metadata

  • Enriched image and table asset metadata with LLM-extracted entities and titles via concurrent page tagging.
  • Section summary publication refactored to run at initialization time rather than deferred.

5. Retrieval Integration

  • New chunk_types set replaces the legacy data_type integer field across retrieval services for more flexible multi-type filtering.
  • effective_parse_track handling added to document ingestion to enable proper routing of page-memory documents through the retrieval pipeline.

6. CI/CD Fixes (this commit)

  • Removed 3 unused imports (ruff F401): pandas in docx/parser.py, collapse_page_ranges in fine_hierarchy.py, tempfile in test contract.
  • Renamed unused ctx_ctx in debug_page_memory.py (ruff F841).
  • All Pyright type errors resolved (0 errors, 0 warnings after make check).

Testing

  • make check passes: ruff lint ✅ + pyright typecheck ✅ (0 errors, 0 warnings)
  • Contract test test_page_memory_cross_page_table_merge covers the table merge logic.
  • Debug script debug_page_memory.py provides end-to-end single-document tracing for manual QA.

Breaking Changes

None. The page-memory pipeline is opt-in via PARSE_TRACK=page_memory and does not affect the default MinerU pipeline.

…hared page content and VLM-based node summarization
…p shard plan option

- Added `skip_shard_plan` parameter to `run_lightweight_anatomy` and `profile_document` to allow bypassing LLM shard decisions during profiling.
- Updated `memory_service` to utilize `skip_shard_plan` for page memory processing.
- Introduced asset annotation functionality for visualizing extracted assets on pages.
- Refactored page section mapping and node assembly to support internal section body pages.
- Improved environment configuration for Java dependencies in Docker setup.
…nd effective parse track handling

- Updated `data_type` in `RetrievalQueryRequest` to support new chunk types: page (7) and text+image+table (8).
- Refactored document ingestion service to apply effective parse track based on file extension and user-defined settings.
- Introduced new validation for parse track handling in oversized PDF processing.
- Enhanced tests to cover new chunk types and parse track logic, ensuring robust validation and functionality.
…t handling

- Updated document parser to utilize the new summary engine for images, tables, and text, improving the extraction of titles, summaries, and entities.
- Refactored image and table handling to streamline the summarization process, replacing legacy methods with unified calls to the summary engine.
- Enhanced markdown parsing to support new entity and asset title fields, ensuring comprehensive data capture during document processing.
- Introduced serialization for typed entities to maintain structured data integrity across parsed documents.
…set metadata with LLM-extracted entities and titles
…t limits, and refactor section summary publication to initialization.
…implement a text-only tagging mode for page memory processing.
- Remove unused pandas import in docx/parser.py (F401)
- Remove unused collapse_page_ranges import in fine_hierarchy.py (F401)
- Rename unused ctx variable to _ctx in debug_page_memory.py (F841)
- Remove unused tempfile import in test_page_memory_cross_page_table_merge.py (F401)
- Fix greenlet value type: filter None from scope_results list in memory_service.py (reportAssignmentType)
- Fix BeautifulSoup Tag narrowing in page_assets.py: use isinstance(x, Tag) guards for find_all/append calls (reportAttributeAccessIssue, reportOptionalMemberAccess)
@EricNGOntos EricNGOntos added the page-memory Page memory and page-based parsing features label Jun 29, 2026
@EricNGOntos EricNGOntos self-assigned this Jun 29, 2026
Comment on lines +462 to +463
"- Judge VLM by: does the red box tightly enclose the table/chart "
"(incl. caption, excl. body text)? Compare against the green reference.",
try:
if path.exists():
shutil.rmtree(path)
except Exception:
else:
try:
(root / "assets.json").unlink()
except FileNotFoundError:
if head_asset.image_path:
try:
Path(head_asset.image_path).unlink(missing_ok=True)
except Exception:
prominence = None
try:
prominence = float(item.get("prominence", 0.5))
except (TypeError, ValueError):
logger.warning("[summary] LLM call failed for {}: {}", usage_task, exc)
if budget is not None:
budget.refund(budget_pool, est=est, stage=budget_stage)
budget = None
asset_title_hint: str = ...,
prompt_task: str | None = ...,
prompt_paras: dict[str, Any] | None = ...,
) -> AssetSummary: ...
asset_title_hint: str = ...,
prompt_task: str | None = ...,
prompt_paras: dict[str, Any] | None = ...,
) -> BodySummary: ...
…lure

The feature branch introduced a migration with revision ID 'a1b2c3d4e5f6'
(replace_data_type_with_chunk_types), which collided with an existing main
branch migration sharing the same ID (add_qstash_tracking_columns). Alembic
detected this as a cycle and blocked all migration tests.

Fix: Assigned a unique revision ID 'f0d85d209e68' to the chunk_types
migration and rebased its down_revision onto 'f9d0e1f2a3b4'
(add_document_metadata_to_documents), which is the current main head.
The Alembic graph now has a single clean head with no cycles.
- test_agentic_evidence_renderer_contract.py: Update expected string format to match 'Pages 225-226' instead of 'Pages: 225, 226'.
- test_page_memory_cross_page_table_merge.py: Remove unused 'budget' kwarg from merge_cross_page_tables() test calls.
- test_doc_profile_anatomy_contract.py: Mock 'PDF_PROFILE_TOC_ENABLED' = True since the feature branch relies on it to test TOC profile attempting logic.
@EricNGOntos EricNGOntos merged commit 43a54ef into main Jun 29, 2026
6 checks passed
@EricNGOntos EricNGOntos deleted the feat/wuchengke/page-memory-native-hierarchy branch June 29, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

page-memory Page memory and page-based parsing features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants