Integrate Semgrep for static analysis (SAST) by connorcarpenter15 · Pull Request #51 · CMU-313/nodebb-spring-26-the-league

connorcarpenter15 · 2026-03-10T15:12:47Z

Summary

This PR integrates Semgrep into the project on a dedicated testing branch named for the tool and documents installation, run artifacts, and an assessment of the tool.

1. Evidence of successful installation (trackable file changes)

The following files were added or modified to install and configure Semgrep:

Change	File	Purpose
Added	`requirements-semgrep.txt`	Python dependency list for Semgrep; install with `pip install -r requirements-semgrep.txt`.
Added	`.semgrepignore`	Ignore patterns (e.g. `node_modules/`, `vendor/`, `build/`, `public/language/`, `*.min.js`) so the scan focuses on project code.
Modified	`package.json`	New NPM script: `"semgrep": "semgrep scan --config auto"` so the tool can be run via `npm run semgrep` after Semgrep is installed (Python/pip).

Note: Semgrep is distributed as a Python package, not an NPM package. No new NPM packages were added; the script assumes semgrep is on PATH after installing from requirements-semgrep.txt.

2. Artifacts demonstrating successful run

The following artifacts show that Semgrep was run successfully on this repository:

semgrep-output.txt — Full human-readable report (1,868 lines) from semgrep scan --config auto, including all findings with file paths, line numbers, rule IDs, and messages.
semgrep-output.json — Machine-readable JSON output of the same run (e.g. for CI or tooling).

Summary from the run:

Findings: 188 (all blocking under the default config)
Rules run: 305 (from community “auto” config)
Targets scanned: 1,549 files (tracked by git)
Skipped: 4,197 files matching .semgrepignore (e.g. node_modules/, public/language/)

3. Assessment: pros, cons, and customization

3.1 Customization

A priori (before use):

Installation: One-time install via pip install -r requirements-semgrep.txt. No NPM packages; the only project change is the semgrep script in package.json and the config/ignore files above.
Config: We use --config auto, which pulls the community rule set. No custom rules were written for this test.
Scope: .semgrepignore was added to exclude node_modules/, vendor/, build/, public/language/, coverage/, .nyc_output/, and *.min.js, reducing noise and scan time.

Over time:

Tuning: Rules can be disabled or customized in a local config (e.g. .semgrep.yml) or by choosing a subset of rule IDs. Severity and blocking behavior can be adjusted.
CI: The same command can be run in CI; exit code is non-zero when there are blocking findings, so the pipeline can fail on new issues.
Custom rules: Semgrep supports custom rules (e.g. for project-specific patterns); we did not add any for this evaluation.

3.2 Strengths (with evidence)

Multi-language and broad coverage: The run applied 305 rules across JS, YAML, JSON, Dockerfile, bash, HTML, etc. (e.g. 864 JS files, 309 YAML). One tool covers many surfaces.
Actionable, linked guidance: Each finding includes a short message and a “Details” link (e.g. https://sg.run/...) to documentation and remediation.
Structured output: JSON output includes CWE/OWASP metadata, severity, and line/column ranges, which supports automation and reporting.
Low integration cost: No new NPM deps; a single script and an ignore file are enough to run it locally or in CI.
Useful infra/security checks: It flagged real configuration issues (e.g. Docker Compose services without no-new-privileges or with writable root filesystem) and common app-security patterns (e.g. path traversal, regex injection, prototype pollution, session/cookie settings).

Quantitative:

188 findings in one run; many are in a small set of files (e.g. path-traversal and regex rules account for a large share), which helps prioritize.
Severity mix in this run: 174 WARNING, 12 ERROR, 2 INFO (from JSON summary).
Top rule by count: path-join-resolve-traversal (113); others include detect-non-literal-regexp (17), express-path-join-resolve-traversal (8), prototype-pollution-loop (7).

3.3 Weaknesses (with evidence)

False positives and noise: Several rules are conservative. For example, many path.join/path.resolve usages are flagged as possible path traversal even when inputs are constrained (e.g. internal config or fixed segments). Tuning or disabling rules per file/pattern will be needed for a clean baseline.
No NPM integration: Semgrep is Python-based, so teams that only use Node/npm must install Python and run pip install -r requirements-semgrep.txt. There is no npm install-only story.
Blocking-by-default: With --config auto, 188 blocking findings may be too strict for an initial rollout; teams may want to start with a smaller rule set or non-blocking severity and then tighten over time.
Limited project-specific context: Rules are generic (e.g. “ensure CSRF middleware”); they don’t know this app’s actual CSRF setup. Some findings (e.g. in install/web.js) may not apply to production code paths.

Quantitative:

113 of 188 findings are from a single rule family (path-join-resolve-traversal), which can dominate the report and require bulk review or rule customization.
4,197 files were skipped via .semgrepignore; without that, the run would be slower and noisier.

3.4 Conclusion

Semgrep is a strong fit for broad, multi-language static analysis with minimal setup. For this repo, it quickly surfaced infrastructure and application-security issues and produced both human- and machine-readable artifacts. The main follow-up work is to reduce noise (e.g. by adjusting or disabling rules and refining .semgrepignore) and to integrate the command into CI with a policy for which findings are blocking.

Add semgrep, an AI-assisted SAST tool to provide static analysis for the repository. Run semgrep on the codebase and save the output to both a JSON and txt file.

connorcarpenter15 · 2026-03-10T15:23:44Z

Tool Overview & Analysis Type

Tool: Semgrep (GitHub Source)
Type: Static analysis. It evaluates our source code and configuration files (JS, YAML, Dockerfiles, HTML, bash) without needing to execute the application.
Description: A fast, multi-language static analysis tool that uses customizable, pattern-based rules to find bugs, detect security vulnerabilities, and enforce coding standards.

Types of Problems Caught

This run flagged several highly relevant issues across different layers of the stack:

Application Security: Path traversal risks, regular expression injection, prototype pollution loops, and insecure session/cookie settings.
Infrastructure/Config: Misconfigured Docker Compose services (e.g., missing no-new-privileges flags or running with writable root filesystems).

Customization

Necessary: The .semgrepignore file included in this PR is strictly required to skip node_modules/, compiled assets, and vendor files to keep the scan fast. We will also need to tune the rules immediately to silence the high volume of default blocking findings.
Possible: We can create a local configuration file (e.g., .semgrep.yml) to cherry-pick specific rule IDs, write custom project-specific rules, and adjust severity levels (like downgrading an ERROR to a WARNING).

Development Process Integration

Semgrep is lightweight enough to run locally and in CI, though it does require a Python environment:

Local Development: Handled via pip install -r requirements-semgrep.txt and executed using the new npm run semgrep wrapper script.
CI/CD Pipeline: The semgrep scan command natively returns a non-zero exit code when it detects blocking findings. We can drop this directly into our CI pipeline to fail builds automatically, but we should define a policy for "blocking" rules first so we don't break the build for false positives.

Finding Accuracy & Noise

Category	Observations
False Positives	The default community rules are highly conservative. Notably, `path.join` and `path.resolve` usages are flagged as traversal risks even when inputs are constrained or hardcoded (113 of 188 findings).
True Positives (Low Priority)	Generic rules flag issues in contexts where they don't apply, such as requiring CSRF middleware in installation/setup scripts rather than actual production code paths.
False Negatives	Because the scan used the generic "auto" configuration without custom project context, highly specific business-logic flaws unique to our app might be missed.

feat(tools): add semgrep

3554bd4

Add semgrep, an AI-assisted SAST tool to provide static analysis for the repository. Run semgrep on the codebase and save the output to both a JSON and txt file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Semgrep for static analysis (SAST)#51

Integrate Semgrep for static analysis (SAST)#51
connorcarpenter15 wants to merge 1 commit intomainfrom
feature/semgrep

connorcarpenter15 commented Mar 10, 2026

Uh oh!

connorcarpenter15 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

connorcarpenter15 commented Mar 10, 2026

Summary

1. Evidence of successful installation (trackable file changes)

2. Artifacts demonstrating successful run

3. Assessment: pros, cons, and customization

3.1 Customization

3.2 Strengths (with evidence)

3.3 Weaknesses (with evidence)

3.4 Conclusion

Uh oh!

connorcarpenter15 commented Mar 10, 2026

Tool Overview & Analysis Type

Types of Problems Caught

Customization

Development Process Integration

Finding Accuracy & Noise

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant