Add XML parsing adapter by xcosmosbox · Pull Request #182 · Ontos-AI/knowhere

xcosmosbox · 2026-06-29T16:23:12Z

Summary

Add support for parsing .xml document files through the existing markdown pipeline
XML elements are recursively walked using Python's built-in xml.etree.ElementTree, with container-like tag names mapped to markdown headings and leaf element text extracted as content lines, then routed through parse_md() for hierarchy reconstruction and LLM enrichment
Malformed XML is handled gracefully via fallback to raw text lines (no crash on invalid input)
No API, worker deployment, or migration impact — this is a new format adapter that extends the existing DocumentFormat enum and routing logic

Verification

pytest apps/worker/tests/contract/test_xml_parser_contract.py -v — all 5 tests pass (2 contract tests + 3 unit tests)
pytest apps/worker/tests/contract/ -v — all 54 contract tests pass with zero regressions
Unit tests verify: text hierarchy extraction, malformed XML fallback, namespace stripping via _local_name()
Contract tests verify: full pipeline integration via checkerboard_parse_output(), DataFrame column contract, content extraction correctness

Deployment Notes

No new environment variables
No database migrations, queue changes, or storage changes
No new dependencies — uses only xml.etree.ElementTree from the Python standard library
Fully backwards compatible; no rollback concerns

Checklist

Tests were added or updated when behavior changed
Public docs, examples, or OpenAPI contracts were updated when needed
Database migrations are idempotent and safe to deploy
Logs, errors, and validation paths avoid leaking secrets or user data
The pull request description explains any breaking or user-visible change

xcosmosbox · 2026-06-30T03:42:07Z

DONE @suguanYang

suguanYang · 2026-06-30T12:57:12Z

Thanks for adding the XML adapter. The worker-side routing and contract tests are a good start, but I think this still needs a few fixes before we call .xml supported. The XML text traversal currently drops common mixed-content text. For example:

 <section>before <bold>middle</bold> after</section>

currently extracts before and middle, but loses after, because the leaf element path returns before handling the child tail text. XML mixed content is common enough that this should be covered by a unit test.

parse_xml() decodes every file as UTF-8 before XML parsing. XML files can legally declare another encoding, such as ISO-8859-1, so the parser should either let the XML parser consume bytes and honor the declaration, or handle declared encodings explicitly. Right now those files fail with UnicodeDecodeError.

One more security point: uploaded XML is untrusted input, so I would avoid direct stdlib ElementTree.fromstring() unless DTD/entities are explicitly rejected. Prefer defusedxml.ElementTree or an equivalent hardened parser path.

feat(parser): add support for XML file parsing

a594e44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add XML parsing adapter#182

Add XML parsing adapter#182
xcosmosbox wants to merge 1 commit into
Ontos-AI:mainfrom
xcosmosbox:worktree-feat-xml

xcosmosbox commented Jun 29, 2026

Uh oh!

xcosmosbox commented Jun 30, 2026

Uh oh!

suguanYang commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xcosmosbox commented Jun 29, 2026

Summary

Verification

Deployment Notes

Checklist

Uh oh!

xcosmosbox commented Jun 30, 2026

Uh oh!

suguanYang commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants