Skip to content

Add XML parsing adapter#182

Open
xcosmosbox wants to merge 1 commit into
Ontos-AI:mainfrom
xcosmosbox:worktree-feat-xml
Open

Add XML parsing adapter#182
xcosmosbox wants to merge 1 commit into
Ontos-AI:mainfrom
xcosmosbox:worktree-feat-xml

Conversation

@xcosmosbox

Copy link
Copy Markdown
Contributor

Summary

  • Add support for parsing .xml document files through the existing markdown pipeline
  • XML elements are recursively walked using Python's built-in xml.etree.ElementTree, with container-like tag names mapped to markdown headings and leaf element text extracted as content lines, then routed through parse_md() for hierarchy reconstruction and LLM enrichment
  • Malformed XML is handled gracefully via fallback to raw text lines (no crash on invalid input)
  • No API, worker deployment, or migration impact — this is a new format adapter that extends the existing DocumentFormat enum and routing logic

Verification

  • pytest apps/worker/tests/contract/test_xml_parser_contract.py -v — all 5 tests pass (2 contract tests + 3 unit tests)
  • pytest apps/worker/tests/contract/ -v — all 54 contract tests pass with zero regressions
  • Unit tests verify: text hierarchy extraction, malformed XML fallback, namespace stripping via _local_name()
  • Contract tests verify: full pipeline integration via checkerboard_parse_output(), DataFrame column contract, content extraction correctness

Deployment Notes

  • No new environment variables
  • No database migrations, queue changes, or storage changes
  • No new dependencies — uses only xml.etree.ElementTree from the Python standard library
  • Fully backwards compatible; no rollback concerns

Checklist

  • Tests were added or updated when behavior changed
  • Public docs, examples, or OpenAPI contracts were updated when needed
  • Database migrations are idempotent and safe to deploy
  • Logs, errors, and validation paths avoid leaking secrets or user data
  • The pull request description explains any breaking or user-visible change

@xcosmosbox

Copy link
Copy Markdown
Contributor Author

DONE @suguanYang

@suguanYang

Copy link
Copy Markdown
Contributor

Thanks for adding the XML adapter. The worker-side routing and contract tests are a good start, but I think this still needs a few fixes before we call .xml supported. The XML text traversal currently drops common mixed-content text. For example:

 <section>before <bold>middle</bold> after</section>

currently extracts before and middle, but loses after, because the leaf element path returns before handling the child tail text. XML mixed content is common enough that this should be covered by a unit test.

parse_xml() decodes every file as UTF-8 before XML parsing. XML files can legally declare another encoding, such as ISO-8859-1, so the parser should either let the XML parser consume bytes and honor the declaration, or handle declared encodings explicitly. Right now those files fail with UnicodeDecodeError.

One more security point: uploaded XML is untrusted input, so I would avoid direct stdlib ElementTree.fromstring() unless DTD/entities are explicitly rejected. Prefer defusedxml.ElementTree or an equivalent hardened parser path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants