fix(security): resolve CodeQL alerts #4, #5, #9, #10 by KaifAhmad1 · Pull Request #429 · Hawksight-AI/semantica

KaifAhmad1 · 2026-03-31T09:57:51Z

Summary

This PR resolves 4 open CodeQL security alerts on the main branch by fixing vulnerable regular expressions and an information exposure issue.

Closes #4 | Closes #5 | Closes #9 | Closes #10

Security Fixes

#10 — Inefficient Regular Expression (ReDoS) · `error` · `py/redos`

File: semantica/ontology/naming_conventions.py:353

The capturing group ([A-Z][a-zA-Z0-9]*)* inside _is_noun_phrase() allowed exponential backtracking on crafted inputs starting with AA followed by many repetitions of A, enabling a potential Denial-of-Service attack.

Fix: Replaced the capturing group with a non-capturing group (?:...) to eliminate the ambiguous backtracking path.

- return bool(re.match(r"^[A-Z][a-zA-Z0-9]*([A-Z][a-zA-Z0-9]*)*$", name))
+ return bool(re.match(r"^[A-Z][a-zA-Z0-9]*(?:[A-Z][a-zA-Z0-9]*)*$", name))

#4 — Bad HTML Filtering Regexp · `warning` · `py/bad-tag-filter`

File: semantica/normalize/text_cleaner.py:305

The </script> and </iframe> end-tag patterns did not match browser-accepted variants with trailing attributes (e.g. </script foo="bar">), allowing XSS payloads like <script>alert(1)</script foo="bar"> to bypass sanitization.

Fix: Updated end-tag patterns to optionally match trailing attributes.

- r"<script[^>]*>.*?</script>"
+ r"<script[^>]*>.*?</script(?:\s[^>]*)?>"

- r"<iframe[^>]*>.*?</iframe>"
+ r"<iframe[^>]*>.*?</iframe(?:\s[^>]*)?>"`

#9 — Overly Permissive Regular Expression Range · `warning` · `py/overly-large-range`

File: semantica/ingest/email_ingestor.py:395

The URL extraction pattern used [$-_@.&+] which is a character range from $ (ASCII 36) to _ (ASCII 95), unintentionally matching 55+ characters including letters, digits, and symbols far beyond the intended set.

Fix: Replaced the ambiguous range with an explicit list of safe URL characters.

- url_pattern = r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
+ url_pattern = r"https?://(?:[a-zA-Z0-9]|[$\-_.&+!*(),]|(?:%[0-9a-fA-F]{2}))+"

#5 — Information Exposure Through an Exception · `error` · `py/stack-trace-exposure`

File: semantica/explorer/routes/export_import.py:232

str(exc) was returned directly to the client in the import error response, potentially leaking internal file paths, module names, SQL fragments, or other sensitive implementation details from the stack trace.

Fix: Return a generic error message to the client; log the full exception server-side using logger.exception().

- result = {"status": "error", "detail": str(exc)}
+ logger.exception("Import failed")
+ result = {"status": "error", "detail": "An internal error occurred during import"}

Test Plan

All four fixes verified with targeted smoke tests covering both valid and bypass-case inputs
Full test suite passes (886 passed, 9 skipped, 0 failed — excluding pre-existing flaky performance test test_performance_under_load)
No functional behaviour changed — only security hardening

References

- fix(redos) #10: replace capturing group with non-capturing group in naming_conventions.py to eliminate exponential backtracking (py/redos) - fix(html-filter) #4: update script/iframe end-tag regex to match tags with trailing attributes e.g. </script foo="bar"> (py/bad-tag-filter) - fix(regex-range) #9: replace overly broad [$-_] character range with explicit safe-char list in email_ingestor.py URL pattern (py/overly-large-range) - fix(info-exposure) #5: replace str(exc) with a generic error message and log the full stack trace server-side in export_import.py (py/stack-trace-exposure) Closes #4, Closes #5, Closes #9, Closes #10 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

semantica/ontology/naming_conventions.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7170cd6df

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T10:00:37Z

semantica/ingest/email_ingestor.py

        import re

-        url_pattern = r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
+        url_pattern = r"https?://(?:[a-zA-Z0-9]|[$\-_.&+!*(),]|(?:%[0-9a-fA-F]{2}))+"


Preserve path/query chars in URL extraction regex

The new url_pattern in EmailParser.extract_links no longer permits /, ?, =, :, or @, so plain-text URLs are truncated to just the scheme+host (for example, both https://example.com/path1 and https://example.com/path2 become https://example.com). Because this method deduplicates with set, distinct links from the same domain can collapse into one, causing silent data loss during email ingestion.

Useful? React with 👍 / 👎.

- fix(redos) #10: replace regex with string method check to fully eliminate backtracking — name[0].isupper() + simple ^[A-Za-z0-9]+$ removes all nested repetition that caused exponential backtracking - fix(url-pattern) #9: restore /, ?, =, :, @, # and other RFC 3986 chars to URL regex; previous fix truncated URLs to hostname only Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-advanced-security bot found potential problems Mar 31, 2026

View reviewed changes

semantica/ontology/naming_conventions.py Fixed Show fixed Hide fixed

chatgpt-codex-connector bot reviewed Mar 31, 2026

View reviewed changes

KaifAhmad1 merged commit a036c44 into main Mar 31, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(security): resolve CodeQL alerts #4, #5, #9, #10#429

fix(security): resolve CodeQL alerts #4, #5, #9, #10#429
KaifAhmad1 merged 2 commits intomainfrom
security-enhancement

KaifAhmad1 commented Mar 31, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

KaifAhmad1 commented Mar 31, 2026

Summary

Security Fixes

#10 — Inefficient Regular Expression (ReDoS) · error · py/redos

#4 — Bad HTML Filtering Regexp · warning · py/bad-tag-filter

#9 — Overly Permissive Regular Expression Range · warning · py/overly-large-range

#5 — Information Exposure Through an Exception · error · py/stack-trace-exposure

Test Plan

References

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

#10 — Inefficient Regular Expression (ReDoS) · `error` · `py/redos`

#4 — Bad HTML Filtering Regexp · `warning` · `py/bad-tag-filter`

#9 — Overly Permissive Regular Expression Range · `warning` · `py/overly-large-range`

#5 — Information Exposure Through an Exception · `error` · `py/stack-trace-exposure`