feat: page-memory native hierarchy — concurrent scope processing, asset type-safety, and cross-page table merge#183
Merged
Conversation
…hared page content and VLM-based node summarization
…onnections with support for custom probes
…p shard plan option - Added `skip_shard_plan` parameter to `run_lightweight_anatomy` and `profile_document` to allow bypassing LLM shard decisions during profiling. - Updated `memory_service` to utilize `skip_shard_plan` for page memory processing. - Introduced asset annotation functionality for visualizing extracted assets on pages. - Refactored page section mapping and node assembly to support internal section body pages. - Improved environment configuration for Java dependencies in Docker setup.
…nd effective parse track handling - Updated `data_type` in `RetrievalQueryRequest` to support new chunk types: page (7) and text+image+table (8). - Refactored document ingestion service to apply effective parse track based on file extension and user-defined settings. - Introduced new validation for parse track handling in oversized PDF processing. - Enhanced tests to cover new chunk types and parse track logic, ensuring robust validation and functionality.
…t handling - Updated document parser to utilize the new summary engine for images, tables, and text, improving the extraction of titles, summaries, and entities. - Refactored image and table handling to streamline the summarization process, replacing legacy methods with unified calls to the summary engine. - Enhanced markdown parsing to support new entity and asset title fields, ensuring comprehensive data capture during document processing. - Introduced serialization for typed entities to maintain structured data integrity across parsed documents.
…set metadata with LLM-extracted entities and titles
…t limits, and refactor section summary publication to initialization.
…set across retrieval services
…r cross-page table continuity analysis
…implement a text-only tagging mode for page memory processing.
- Remove unused pandas import in docx/parser.py (F401) - Remove unused collapse_page_ranges import in fine_hierarchy.py (F401) - Rename unused ctx variable to _ctx in debug_page_memory.py (F841) - Remove unused tempfile import in test_page_memory_cross_page_table_merge.py (F401) - Fix greenlet value type: filter None from scope_results list in memory_service.py (reportAssignmentType) - Fix BeautifulSoup Tag narrowing in page_assets.py: use isinstance(x, Tag) guards for find_all/append calls (reportAttributeAccessIssue, reportOptionalMemberAccess)
Comment on lines
+462
to
+463
| "- Judge VLM by: does the red box tightly enclose the table/chart " | ||
| "(incl. caption, excl. body text)? Compare against the green reference.", |
| try: | ||
| if path.exists(): | ||
| shutil.rmtree(path) | ||
| except Exception: |
| else: | ||
| try: | ||
| (root / "assets.json").unlink() | ||
| except FileNotFoundError: |
| if head_asset.image_path: | ||
| try: | ||
| Path(head_asset.image_path).unlink(missing_ok=True) | ||
| except Exception: |
| prominence = None | ||
| try: | ||
| prominence = float(item.get("prominence", 0.5)) | ||
| except (TypeError, ValueError): |
| logger.warning("[summary] LLM call failed for {}: {}", usage_task, exc) | ||
| if budget is not None: | ||
| budget.refund(budget_pool, est=est, stage=budget_stage) | ||
| budget = None |
| asset_title_hint: str = ..., | ||
| prompt_task: str | None = ..., | ||
| prompt_paras: dict[str, Any] | None = ..., | ||
| ) -> AssetSummary: ... |
| asset_title_hint: str = ..., | ||
| prompt_task: str | None = ..., | ||
| prompt_paras: dict[str, Any] | None = ..., | ||
| ) -> BodySummary: ... |
…lure The feature branch introduced a migration with revision ID 'a1b2c3d4e5f6' (replace_data_type_with_chunk_types), which collided with an existing main branch migration sharing the same ID (add_qstash_tracking_columns). Alembic detected this as a cycle and blocked all migration tests. Fix: Assigned a unique revision ID 'f0d85d209e68' to the chunk_types migration and rebased its down_revision onto 'f9d0e1f2a3b4' (add_document_metadata_to_documents), which is the current main head. The Alembic graph now has a single clean head with no cycles.
- test_agentic_evidence_renderer_contract.py: Update expected string format to match 'Pages 225-226' instead of 'Pages: 225, 226'. - test_page_memory_cross_page_table_merge.py: Remove unused 'budget' kwarg from merge_cross_page_tables() test calls. - test_doc_profile_anatomy_contract.py: Mock 'PDF_PROFILE_TOC_ENABLED' = True since the feature branch relies on it to test TOC profile attempting logic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR lands the Page Memory Native Hierarchy feature branch onto
main. It introduces a production-grade, page-level document understanding pipeline for the Knowhere worker, encompassing several interconnected capabilities developed over the feature branch lifecycle.Key Changes
1. Page-Based Document Parsing & Hierarchy
page_memorytrack in PDF and PPTX parsing paths, controlled via thePARSE_TRACKenvironment variable.collapse_single_child_chains) to clean up degenerate single-child heading chains in the final hierarchy.2. Concurrent Scope Processing (
memory_service.py)PAGE_MEMORY_SCOPE_CONCURRENCYenv var (default: 4 parallel scopes)._allocate_asset_pages()..valuelist now filtersNoneentries (if g.value is not None) to satisfy thelist[_ScopeRunResult]declared type, whileraise_error=Trueensures no silent failures at runtime.3. Cross-Page Table Continuity (
page_assets.py)_llm_judge_table_continuity()uses the model to decide if the tail of one page's table and the head of the next page's table form a continuation._merge_table_html_files()appends head-page rows into the tail-page HTML file and removes the now-merged head asset.soup.find()returnsPageElement | NavigableString | None; addedisinstance(x, Tag)guards before calling.find_all()and.append()to satisfy the type checker without changing runtime behavior.4. Enhanced Asset Metadata
5. Retrieval Integration
chunk_typesset replaces the legacydata_typeinteger field across retrieval services for more flexible multi-type filtering.effective_parse_trackhandling added to document ingestion to enable proper routing of page-memory documents through the retrieval pipeline.6. CI/CD Fixes (this commit)
pandasindocx/parser.py,collapse_page_rangesinfine_hierarchy.py,tempfilein test contract.ctx→_ctxindebug_page_memory.py(ruff F841).make check).Testing
make checkpasses: ruff lint ✅ + pyright typecheck ✅ (0 errors, 0 warnings)test_page_memory_cross_page_table_mergecovers the table merge logic.debug_page_memory.pyprovides end-to-end single-document tracing for manual QA.Breaking Changes
None. The page-memory pipeline is opt-in via
PARSE_TRACK=page_memoryand does not affect the default MinerU pipeline.