feat: page-memory native hierarchy — concurrent scope processing, asset type-safety, and cross-page table merge by EricNGOntos · Pull Request #183 · Ontos-AI/knowhere

EricNGOntos · 2026-06-29T18:06:12Z

Summary

This PR lands the Page Memory Native Hierarchy feature branch onto main. It introduces a production-grade, page-level document understanding pipeline for the Knowhere worker, encompassing several interconnected capabilities developed over the feature branch lifecycle.

Key Changes

1. Page-Based Document Parsing & Hierarchy

Implemented a page-memory pipeline that builds a native heading hierarchy directly from page-level signals (VLM tags, skeleton extraction, font clustering) rather than relying solely on MinerU/Markdown heading detection.
Added support for page_memory track in PDF and PPTX parsing paths, controlled via the PARSE_TRACK environment variable.
Implemented intermediate node chain collapse (collapse_single_child_chains) to clean up degenerate single-child heading chains in the final hierarchy.
Added page-count limits to prevent runaway processing on oversized documents.

2. Concurrent Scope Processing (`memory_service.py`)

Added GeventPool-based concurrent scope processing controlled by PAGE_MEMORY_SCOPE_CONCURRENCY env var (default: 4 parallel scopes).
Asset page budget is allocated proportionally across concurrent scopes via _allocate_asset_pages().
Fixed a Pyright type error: greenlet .value list now filters None entries (if g.value is not None) to satisfy the list[_ScopeRunResult] declared type, while raise_error=True ensures no silent failures at runtime.

3. Cross-Page Table Continuity (`page_assets.py`)

Implemented LLM-based cross-page table merge: _llm_judge_table_continuity() uses the model to decide if the tail of one page's table and the head of the next page's table form a continuation.
_merge_table_html_files() appends head-page rows into the tail-page HTML file and removes the now-merged head asset.
Fixed Pyright type errors: soup.find() returns PageElement | NavigableString | None; added isinstance(x, Tag) guards before calling .find_all() and .append() to satisfy the type checker without changing runtime behavior.

4. Enhanced Asset Metadata

Enriched image and table asset metadata with LLM-extracted entities and titles via concurrent page tagging.
Section summary publication refactored to run at initialization time rather than deferred.

5. Retrieval Integration

New chunk_types set replaces the legacy data_type integer field across retrieval services for more flexible multi-type filtering.
effective_parse_track handling added to document ingestion to enable proper routing of page-memory documents through the retrieval pipeline.

6. CI/CD Fixes (this commit)

Removed 3 unused imports (ruff F401): pandas in docx/parser.py, collapse_page_ranges in fine_hierarchy.py, tempfile in test contract.
Renamed unused ctx → _ctx in debug_page_memory.py (ruff F841).
All Pyright type errors resolved (0 errors, 0 warnings after make check).

Testing

make check passes: ruff lint ✅ + pyright typecheck ✅ (0 errors, 0 warnings)
Contract test test_page_memory_cross_page_table_merge covers the table merge logic.
Debug script debug_page_memory.py provides end-to-end single-document tracing for manual QA.

Breaking Changes

None. The page-memory pipeline is opt-in via PARSE_TRACK=page_memory and does not affect the default MinerU pipeline.

…hared page content and VLM-based node summarization

…onnections with support for custom probes

…p shard plan option - Added `skip_shard_plan` parameter to `run_lightweight_anatomy` and `profile_document` to allow bypassing LLM shard decisions during profiling. - Updated `memory_service` to utilize `skip_shard_plan` for page memory processing. - Introduced asset annotation functionality for visualizing extracted assets on pages. - Refactored page section mapping and node assembly to support internal section body pages. - Improved environment configuration for Java dependencies in Docker setup.

…nd effective parse track handling - Updated `data_type` in `RetrievalQueryRequest` to support new chunk types: page (7) and text+image+table (8). - Refactored document ingestion service to apply effective parse track based on file extension and user-defined settings. - Introduced new validation for parse track handling in oversized PDF processing. - Enhanced tests to cover new chunk types and parse track logic, ensuring robust validation and functionality.

…data structure

…t handling - Updated document parser to utilize the new summary engine for images, tables, and text, improving the extraction of titles, summaries, and entities. - Refactored image and table handling to streamline the summarization process, replacing legacy methods with unified calls to the summary engine. - Enhanced markdown parsing to support new entity and asset title fields, ensuring comprehensive data capture during document processing. - Introduced serialization for typed entities to maintain structured data integrity across parsed documents.

…set metadata with LLM-extracted entities and titles

…t limits, and refactor section summary publication to initialization.

…set across retrieval services

…r cross-page table continuity analysis

…implement a text-only tagging mode for page memory processing.

- Remove unused pandas import in docx/parser.py (F401) - Remove unused collapse_page_ranges import in fine_hierarchy.py (F401) - Rename unused ctx variable to _ctx in debug_page_memory.py (F841) - Remove unused tempfile import in test_page_memory_cross_page_table_merge.py (F401) - Fix greenlet value type: filter None from scope_results list in memory_service.py (reportAssignmentType) - Fix BeautifulSoup Tag narrowing in page_assets.py: use isinstance(x, Tag) guards for find_all/append calls (reportAttributeAccessIssue, reportOptionalMemberAccess)

+        "- Judge VLM by: does the red box tightly enclose the table/chart "
+        "(incl. caption, excl. body text)? Compare against the green reference.",


+        try:
+            if path.exists():
+                shutil.rmtree(path)
+        except Exception:


+    else:
+        try:
+            (root / "assets.json").unlink()
+        except FileNotFoundError:


+    if head_asset.image_path:
+        try:
+            Path(head_asset.image_path).unlink(missing_ok=True)
+        except Exception:


+                        prominence = None
+                        try:
+                            prominence = float(item.get("prominence", 0.5))
+                        except (TypeError, ValueError):


+            logger.warning("[summary] LLM call failed for {}: {}", usage_task, exc)
+            if budget is not None:
+                budget.refund(budget_pool, est=est, stage=budget_stage)
+                budget = None


+    asset_title_hint: str = ...,
+    prompt_task: str | None = ...,
+    prompt_paras: dict[str, Any] | None = ...,
+) -> AssetSummary: ...


+    asset_title_hint: str = ...,
+    prompt_task: str | None = ...,
+    prompt_paras: dict[str, Any] | None = ...,
+) -> BodySummary: ...


…lure The feature branch introduced a migration with revision ID 'a1b2c3d4e5f6' (replace_data_type_with_chunk_types), which collided with an existing main branch migration sharing the same ID (add_qstash_tracking_columns). Alembic detected this as a cycle and blocked all migration tests. Fix: Assigned a unique revision ID 'f0d85d209e68' to the chunk_types migration and rebased its down_revision onto 'f9d0e1f2a3b4' (add_document_metadata_to_documents), which is the current main head. The Alembic graph now has a single clean head with no cycles.

- test_agentic_evidence_renderer_contract.py: Update expected string format to match 'Pages 225-226' instead of 'Pages: 225, 226'. - test_page_memory_cross_page_table_merge.py: Remove unused 'budget' kwarg from merge_cross_page_tables() test calls. - test_doc_profile_anatomy_contract.py: Mock 'PDF_PROFILE_TOC_ENABLED' = True since the feature branch relies on it to test TOC profile attempting logic.

EricNGOntos added 14 commits June 30, 2026 02:05

feat: implement node assembler for page memory hierarchy to support s…

6773fe3

…hared page content and VLM-based node summarization

feat: implement page asset integration into node assembly and chunk c…

0035b4a

…onnections with support for custom probes

refactor: decouple page memory from image URIs and simplify node meta…

c8f88ce

…data structure

feat: implement concurrent page tagging and enrich image and table as…

eea74df

…set metadata with LLM-extracted entities and titles

feat: standardize page_memory track for PDF/PPTX, implement page-coun…

dfc10f1

…t limits, and refactor section summary publication to initialization.

chore: align parser env defaults

5dbd084

refactor: replace legacy data_type integer with flexible chunk_types …

3531581

…set across retrieval services

feat: implement intermediate node chain collapse and add utilities fo…

10e6ba8

…r cross-page table continuity analysis

Add thread-safe trace logging, remove unused heading dataframes, and …

e11a41f

…implement a text-only tagging mode for page memory processing.

refactor: remove budget tracking dependency from page memory services

ab73cc7

EricNGOntos added the page-memory Page memory and page-based parsing features label Jun 29, 2026

EricNGOntos self-assigned this Jun 29, 2026

github-advanced-security AI found potential problems Jun 29, 2026

View reviewed changes

EricNGOntos added 2 commits June 30, 2026 02:13

EricNGOntos merged commit 43a54ef into main Jun 29, 2026
6 checks passed

EricNGOntos deleted the feat/wuchengke/page-memory-native-hierarchy branch June 29, 2026 18:26

EricNGOntos mentioned this pull request Jul 1, 2026

fix: retrieval hydration and lexical search refinements for page memory #193

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: page-memory native hierarchy — concurrent scope processing, asset type-safety, and cross-page table merge#183

feat: page-memory native hierarchy — concurrent scope processing, asset type-safety, and cross-page table merge#183
EricNGOntos merged 16 commits into
mainfrom
feat/wuchengke/page-memory-native-hierarchy

EricNGOntos commented Jun 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		"- Judge VLM by: does the red box tightly enclose the table/chart "
		"(incl. caption, excl. body text)? Compare against the green reference.",

Uh oh!

Conversation

EricNGOntos commented Jun 29, 2026

Summary

Key Changes

1. Page-Based Document Parsing & Hierarchy

2. Concurrent Scope Processing (memory_service.py)

3. Cross-Page Table Continuity (page_assets.py)

4. Enhanced Asset Metadata

5. Retrieval Integration

6. CI/CD Fixes (this commit)

Testing

Breaking Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. Concurrent Scope Processing (`memory_service.py`)

3. Cross-Page Table Continuity (`page_assets.py`)