Add HTML stripping for NodeBB integration and CI workflow#2
Open
anthony-tom1 wants to merge 3 commits intof25from
Open
Add HTML stripping for NodeBB integration and CI workflow#2anthony-tom1 wants to merge 3 commits intof25from
anthony-tom1 wants to merge 3 commits intof25from
Conversation
NodeBB wraps post content in HTML tags (e.g. <p>). Strip them before sending to Ollama so the LLM receives clean text.
Normalize model output lines (markdown, list prefixes, case) so LANGUAGE/TRANSLATION are detected reliably and NodeBB no longer mislabels non-English as English when the model formats replies loosely. - Add _normalize_response_line and extend _parse_model_content to return detected language - Return optional language from translate_content/query_llm_robust; include in TranslateResponse - Expand unit tests for **LANGUAGE:**, numbered lists, and Language:/Translation: casing
3a5f415 to
b60a502
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
The Ollama-backed translator works end-to-end when the model is running, but NodeBB stores post bodies as HTML (e.g.
<p>Bonjour</p>). Sending that markup in the prompt adds noise and can hurt language detection and translation. Separately, small chat models often reply with slightly off-format lines (markdown like**LANGUAGE:**, numbered lists, orLanguage:instead ofLANGUAGE:). The previous parser only recognized lines that started with exactlyLANGUAGE:/TRANSLATION:, so a valid answer could be misread as “no language line” and collapsed to English with the original text. The repo also lacked a CI workflow to run tests on every push/PR.Description
translate_contentstrips HTML tags before callingquery_llm_robust, so the model sees plain text. Response parsing is more tolerant of common formatting variants, and the HTTP API now includes alanguagefield (string or JSONnull) parsed from the model when possible, so clients like NodeBB can show a real detected language instead of onlyEnglishvsUnknown. A GitHub Actions workflow runspyteston pushes and pull requests targetingf25so regressions are caught automatically.Changes in the codebase
src/translator.py— Added_strip_html()(regex tag removal) and use it intranslate_content(). Added_normalize_response_line()and updated_parse_model_content()to accept markdown/list prefixes and case-insensitivelanguage:/translation:labels; return type now includes detected language where available. Error paths return(is_english, text, language=None)consistently.src/api.py—TranslateResponseincludes optionallanguage;GET /passes it through fromtranslate_content.test/unit/test_translator.py—test_translate_content_strips_html(HTML stripped before the LLM call). New parser cases for**LANGUAGE:**,1. LANGUAGE:, andLanguage:/Translation:casing. Assertions updated for the 3-tuple return andlanguageon successful parses..github/workflows/ci.yml(new) — On push/PR tof25: checkout, Python 3.12,pip install -r requirements.txt,pytest test/ -v.How this was tested
pytest test/ -vlocally — all tests pass (currently 17), including HTML stripping and the new parser scenarios.uvicornon port 5001,GET /?content=...with URL-encoded<p>Bonjour</p>and confirm behavior with Ollama running.How to verify
GET /?content=<text>returnsis_english,translated_content, andlanguage.languageis a string when aLANGUAGE:value is parsed from the model output, andnullotherwise (e.g. parse failure, Ollama errors, or missing/invalid format). Clients that only useis_englishandtranslated_contentremain compatible.curl "http://127.0.0.1:5001/?content=%3Cp%3EBonjour%3C%2Fp%3E"