Add XML parsing adapter#182
Conversation
|
DONE @suguanYang |
|
Thanks for adding the XML adapter. The worker-side routing and contract tests are a good start, but I think this still needs a few fixes before we call <section>before <bold>middle</bold> after</section>currently extracts before and middle, but loses after, because the leaf element path returns before handling the child tail text. XML mixed content is common enough that this should be covered by a unit test. parse_xml() decodes every file as UTF-8 before XML parsing. XML files can legally declare another encoding, such as ISO-8859-1, so the parser should either let the XML parser consume bytes and honor the declaration, or handle declared encodings explicitly. Right now those files fail with UnicodeDecodeError. One more security point: uploaded XML is untrusted input, so I would avoid direct stdlib ElementTree.fromstring() unless DTD/entities are explicitly rejected. Prefer defusedxml.ElementTree or an equivalent hardened parser path. |
Summary
.xmldocument files through the existing markdown pipelinexml.etree.ElementTree, with container-like tag names mapped to markdown headings and leaf element text extracted as content lines, then routed throughparse_md()for hierarchy reconstruction and LLM enrichmentDocumentFormatenum and routing logicVerification
pytest apps/worker/tests/contract/test_xml_parser_contract.py -v— all 5 tests pass (2 contract tests + 3 unit tests)pytest apps/worker/tests/contract/ -v— all 54 contract tests pass with zero regressions_local_name()checkerboard_parse_output(), DataFrame column contract, content extraction correctnessDeployment Notes
xml.etree.ElementTreefrom the Python standard libraryChecklist