Add HTML stripping for NodeBB integration and CI workflow by anthony-tom1 · Pull Request #2 · CMU-313/llm-experiment-microservice-kernel-panic

anthony-tom1 · 2026-04-02T00:36:03Z

Context

The Ollama-backed translator works end-to-end when the model is running, but NodeBB stores post bodies as HTML (e.g. Bonjour). Sending that markup in the prompt adds noise and can hurt language detection and translation. Separately, small chat models often reply with slightly off-format lines (markdown like **LANGUAGE:**, numbered lists, or Language: instead of LANGUAGE:). The previous parser only recognized lines that started with exactly LANGUAGE: / TRANSLATION:, so a valid answer could be misread as “no language line” and collapsed to English with the original text. The repo also lacked a CI workflow to run tests on every push/PR.

Description

translate_content strips HTML tags before calling query_llm_robust, so the model sees plain text. Response parsing is more tolerant of common formatting variants, and the HTTP API now includes a language field (string or JSON null) parsed from the model when possible, so clients like NodeBB can show a real detected language instead of only English vs Unknown. A GitHub Actions workflow runs pytest on pushes and pull requests targeting f25 so regressions are caught automatically.

Changes in the codebase

src/translator.py — Added _strip_html() (regex tag removal) and use it in translate_content(). Added _normalize_response_line() and updated _parse_model_content() to accept markdown/list prefixes and case-insensitive language: / translation: labels; return type now includes detected language where available. Error paths return (is_english, text, language=None) consistently.
src/api.py — TranslateResponse includes optional language; GET / passes it through from translate_content.
test/unit/test_translator.py — test_translate_content_strips_html (HTML stripped before the LLM call). New parser cases for **LANGUAGE:**, 1. LANGUAGE:, and Language: / Translation: casing. Assertions updated for the 3-tuple return and language on successful parses.
.github/workflows/ci.yml (new) — On push/PR to f25: checkout, Python 3.12, pip install -r requirements.txt, pytest test/ -v.

How this was tested

pytest test/ -v locally — all tests pass (currently 17), including HTML stripping and the new parser scenarios.
Manual (optional): run uvicorn on port 5001, GET /?content=... with URL-encoded Bonjour and confirm behavior with Ollama running.

How to verify

CI should go green on this PR once the workflow runs.
JSON shape: GET /?content=<text> returns is_english, translated_content, and language. language is a string when a LANGUAGE: value is parsed from the model output, and null otherwise (e.g. parse failure, Ollama errors, or missing/invalid format). Clients that only use is_english and translated_content remain compatible.
HTML-wrapped input from NodeBB should yield cleaner prompts and more reliable detection/translation than sending raw tags.
Quick check (with Ollama up):

curl "http://127.0.0.1:5001/?content=%3Cp%3EBonjour%3C%2Fp%3E"

NodeBB wraps post content in HTML tags (e.g. ). Strip them before sending to Ollama so the LLM receives clean text.

Normalize model output lines (markdown, list prefixes, case) so LANGUAGE/TRANSLATION are detected reliably and NodeBB no longer mislabels non-English as English when the model formats replies loosely. - Add _normalize_response_line and extend _parse_model_content to return detected language - Return optional language from translate_content/query_llm_robust; include in TranslateResponse - Expand unit tests for **LANGUAGE:**, numbered lists, and Language:/Translation: casing

anthony-tom1 added 3 commits April 1, 2026 20:28

Strip HTML tags from post content before translation

0f43e4d

NodeBB wraps post content in HTML tags (e.g. ). Strip them before sending to Ollama so the LLM receives clean text.

Add CI workflow to run pytest on push and pull requests

9d5e520

anthony-tom1 force-pushed the add-html-stripping-and-ci branch from 3a5f415 to b60a502 Compare April 3, 2026 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HTML stripping for NodeBB integration and CI workflow#2

Add HTML stripping for NodeBB integration and CI workflow#2
anthony-tom1 wants to merge 3 commits intof25from
add-html-stripping-and-ci

anthony-tom1 commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anthony-tom1 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Description

Changes in the codebase

How this was tested

How to verify

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anthony-tom1 commented Apr 2, 2026 •

edited

Loading