fix(security): resolve CodeQL alerts #4, #5, #9, #10#429
Conversation
- fix(redos) #10: replace capturing group with non-capturing group in naming_conventions.py to eliminate exponential backtracking (py/redos) - fix(html-filter) #4: update script/iframe end-tag regex to match tags with trailing attributes e.g. </script foo="bar"> (py/bad-tag-filter) - fix(regex-range) #9: replace overly broad [$-_] character range with explicit safe-char list in email_ingestor.py URL pattern (py/overly-large-range) - fix(info-exposure) #5: replace str(exc) with a generic error message and log the full stack trace server-side in export_import.py (py/stack-trace-exposure) Closes #4, Closes #5, Closes #9, Closes #10 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f7170cd6df
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
semantica/ingest/email_ingestor.py
Outdated
| import re | ||
|
|
||
| url_pattern = r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" | ||
| url_pattern = r"https?://(?:[a-zA-Z0-9]|[$\-_.&+!*(),]|(?:%[0-9a-fA-F]{2}))+" |
There was a problem hiding this comment.
Preserve path/query chars in URL extraction regex
The new url_pattern in EmailParser.extract_links no longer permits /, ?, =, :, or @, so plain-text URLs are truncated to just the scheme+host (for example, both https://example.com/path1 and https://example.com/path2 become https://example.com). Because this method deduplicates with set, distinct links from the same domain can collapse into one, causing silent data loss during email ingestion.
Useful? React with 👍 / 👎.
- fix(redos) #10: replace regex with string method check to fully eliminate backtracking — name[0].isupper() + simple ^[A-Za-z0-9]+$ removes all nested repetition that caused exponential backtracking - fix(url-pattern) #9: restore /, ?, =, :, @, # and other RFC 3986 chars to URL regex; previous fix truncated URLs to hostname only Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
This PR resolves 4 open CodeQL security alerts on the
mainbranch by fixing vulnerable regular expressions and an information exposure issue.Closes #4 | Closes #5 | Closes #9 | Closes #10
Security Fixes
#10 — Inefficient Regular Expression (ReDoS) ·
error·py/redosFile:
semantica/ontology/naming_conventions.py:353The capturing group
([A-Z][a-zA-Z0-9]*)*inside_is_noun_phrase()allowed exponential backtracking on crafted inputs starting withAAfollowed by many repetitions ofA, enabling a potential Denial-of-Service attack.Fix: Replaced the capturing group with a non-capturing group
(?:...)to eliminate the ambiguous backtracking path.#4 — Bad HTML Filtering Regexp ·
warning·py/bad-tag-filterFile:
semantica/normalize/text_cleaner.py:305The
</script>and</iframe>end-tag patterns did not match browser-accepted variants with trailing attributes (e.g.</script foo="bar">), allowing XSS payloads like<script>alert(1)</script foo="bar">to bypass sanitization.Fix: Updated end-tag patterns to optionally match trailing attributes.
#9 — Overly Permissive Regular Expression Range ·
warning·py/overly-large-rangeFile:
semantica/ingest/email_ingestor.py:395The URL extraction pattern used
[$-_@.&+]which is a character range from$(ASCII 36) to_(ASCII 95), unintentionally matching 55+ characters including letters, digits, and symbols far beyond the intended set.Fix: Replaced the ambiguous range with an explicit list of safe URL characters.
#5 — Information Exposure Through an Exception ·
error·py/stack-trace-exposureFile:
semantica/explorer/routes/export_import.py:232str(exc)was returned directly to the client in the import error response, potentially leaking internal file paths, module names, SQL fragments, or other sensitive implementation details from the stack trace.Fix: Return a generic error message to the client; log the full exception server-side using
logger.exception().Test Plan
test_performance_under_load)References