Skip to content

Add HTML stripping for NodeBB integration and CI workflow#2

Open
anthony-tom1 wants to merge 3 commits intof25from
add-html-stripping-and-ci
Open

Add HTML stripping for NodeBB integration and CI workflow#2
anthony-tom1 wants to merge 3 commits intof25from
add-html-stripping-and-ci

Conversation

@anthony-tom1
Copy link
Copy Markdown

@anthony-tom1 anthony-tom1 commented Apr 2, 2026

Context

The Ollama-backed translator works end-to-end when the model is running, but NodeBB stores post bodies as HTML (e.g. <p>Bonjour</p>). Sending that markup in the prompt adds noise and can hurt language detection and translation. Separately, small chat models often reply with slightly off-format lines (markdown like **LANGUAGE:**, numbered lists, or Language: instead of LANGUAGE:). The previous parser only recognized lines that started with exactly LANGUAGE: / TRANSLATION:, so a valid answer could be misread as “no language line” and collapsed to English with the original text. The repo also lacked a CI workflow to run tests on every push/PR.

Description

translate_content strips HTML tags before calling query_llm_robust, so the model sees plain text. Response parsing is more tolerant of common formatting variants, and the HTTP API now includes a language field (string or JSON null) parsed from the model when possible, so clients like NodeBB can show a real detected language instead of only English vs Unknown. A GitHub Actions workflow runs pytest on pushes and pull requests targeting f25 so regressions are caught automatically.

Changes in the codebase

  • src/translator.py — Added _strip_html() (regex tag removal) and use it in translate_content(). Added _normalize_response_line() and updated _parse_model_content() to accept markdown/list prefixes and case-insensitive language: / translation: labels; return type now includes detected language where available. Error paths return (is_english, text, language=None) consistently.
  • src/api.pyTranslateResponse includes optional language; GET / passes it through from translate_content.
  • test/unit/test_translator.pytest_translate_content_strips_html (HTML stripped before the LLM call). New parser cases for **LANGUAGE:**, 1. LANGUAGE:, and Language: / Translation: casing. Assertions updated for the 3-tuple return and language on successful parses.
  • .github/workflows/ci.yml (new) — On push/PR to f25: checkout, Python 3.12, pip install -r requirements.txt, pytest test/ -v.

How this was tested

  • pytest test/ -v locally — all tests pass (currently 17), including HTML stripping and the new parser scenarios.
  • Manual (optional): run uvicorn on port 5001, GET /?content=... with URL-encoded <p>Bonjour</p> and confirm behavior with Ollama running.

How to verify

  • CI should go green on this PR once the workflow runs.
  • JSON shape: GET /?content=<text> returns is_english, translated_content, and language. language is a string when a LANGUAGE: value is parsed from the model output, and null otherwise (e.g. parse failure, Ollama errors, or missing/invalid format). Clients that only use is_english and translated_content remain compatible.
  • HTML-wrapped input from NodeBB should yield cleaner prompts and more reliable detection/translation than sending raw tags.
  • Quick check (with Ollama up):

curl "http://127.0.0.1:5001/?content=%3Cp%3EBonjour%3C%2Fp%3E"

NodeBB wraps post content in HTML tags (e.g. <p>). Strip them before
sending to Ollama so the LLM receives clean text.
Normalize model output lines (markdown, list prefixes, case) so LANGUAGE/TRANSLATION are detected reliably and NodeBB no longer mislabels non-English as English when the model formats replies loosely.

- Add _normalize_response_line and extend _parse_model_content to return detected language
- Return optional language from translate_content/query_llm_robust; include in TranslateResponse
- Expand unit tests for **LANGUAGE:**, numbered lists, and Language:/Translation: casing
@anthony-tom1 anthony-tom1 force-pushed the add-html-stripping-and-ci branch from 3a5f415 to b60a502 Compare April 3, 2026 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant