This project currently orchestrates OWASP ZAP (DAST) and Nuclei (template scanning) behind AWS Cognito authentication, producing unified HTML/Markdown reports. As described in Friday Afternoon Pen Test, this delivers ~60-70% of a commercial pen test's coverage for known vulnerability patterns.
The gap lies in what the blog calls "creative business logic testing" — understanding how the application works to find authorization bypasses, data leakage through business logic, and vulnerability chains that no pattern-matcher catches. Claude Code Security fills exactly this gap: it reads and reasons about source code the way a human security researcher would, tracing data flows across files and identifying complex multi-component vulnerabilities.
This plan extends the project to combine source-code-level AI reasoning (SAST) with runtime dynamic testing (DAST) — mirroring how a human pen tester works: form hypotheses from code, then confirm exploitability at runtime.
┌─────────────────────────┐
│ config.yaml │
│ + source_dir path │
│ + claude API key │
└───────────┬─────────────┘
│
┌───────────▼─────────────┐
│ Orchestrator │
│ (6-phase pipeline) │
└───────────┬─────────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
┌────────▼────────┐ ┌───────────▼──────────┐ ┌─────────▼─────────┐
│ Phase 1: Auth │ │ Phase 2: Source │ │ Phase 3: ZAP │
│ (Cognito) │ │ Analysis (Claude) │ │ (DAST) │
└────────┬────────┘ └───────────┬──────────┘ └─────────┬─────────┘
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Hypotheses: │ │
│ │ - Auth bypasses │ │
│ │ - Injection points │ │
│ │ - Logic flaws │ │
│ │ - Data flow issues │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌──────────▼──────────┐ ┌─────────▼─────────┐
│ │ Phase 4: Nuclei │ │ Phase 5: Dedup │
│ │ (+ dynamic templates│ │ + Correlation │
│ │ from hypotheses) │ │ │
│ └──────────┬──────────┘ └─────────┬─────────┘
│ │ │
│ └──────────┬───────────────┘
│ │
│ ┌──────────▼──────────┐
│ │ Phase 6: Report │
│ │ (unified + source │
│ │ context) │
│ └─────────────────────┘
Cognito auth continues as-is. Tokens are shared with all scanners.
New module: src/source_analyzer.py
This is the core addition. It uses the Claude API (via the Anthropic SDK) to perform reasoning-based static analysis on the target application's source code.
- Accept a
source_dirpath in config (or--source-dirCLI flag) - Walk the directory, collecting files by type (
.py,.ts,.js,.tsx,.jsx,.yaml,.env*, etc.) - Respect
.gitignorepatterns and a configurableexclude_patternslist - Build a file manifest with metadata (path, size, language, last modified)
- For large codebases, prioritize security-critical files:
- Route definitions / API handlers
- Authentication / authorization middleware
- Database query builders / ORM models
- Input validation / sanitization
- Configuration files
- Environment variable usage
Use the Anthropic Python SDK (anthropic) to send source code to Claude with a structured security audit prompt. This is not the /security-review slash command (which is interactive) — it's direct API calls for programmatic integration.
Approach:
-
Architecture mapping pass: Send route definitions, middleware chains, and config files. Ask Claude to map the application's attack surface: endpoints, auth requirements, data flows, trust boundaries.
-
Targeted deep-dive passes: For each security-critical area identified, send the relevant files and ask Claude to analyze for:
- Authentication bypasses (missing middleware, JWT validation gaps, role confusion)
- Authorization flaws (IDOR, privilege escalation, broken object-level access)
- Injection points (SQL, NoSQL, command, template injection)
- Business logic flaws (race conditions, state manipulation, workflow bypasses)
- Data exposure (sensitive data in logs, error messages, API responses)
- Cryptographic weaknesses (weak algorithms, hardcoded keys, insecure random)
- SSRF / open redirect patterns
-
Self-verification pass: Re-submit findings back to Claude with an adversarial prompt: "Attempt to disprove each finding. Rate confidence 1-5. Remove false positives." This mirrors Claude Code Security's multi-stage verification.
Output: A list of SourceFinding objects:
@dataclass
class SourceFinding:
source: str = "claude-sast"
name: str # e.g. "Missing authorization check on DELETE /api/interviews/{id}"
description: str # Detailed explanation of the vulnerability
severity: str # critical, high, medium, low, info
confidence: int # 1-5 confidence score
file_path: str # Source file where the issue exists
line_range: tuple[int, int] # Start and end line numbers
code_snippet: str # Relevant code excerpt
category: str # OWASP category (e.g., "Broken Access Control")
cwe_id: str # CWE identifier
suggested_fix: str # Remediation code/guidance
testable: bool # Can this be confirmed via DAST?
test_hypothesis: str | None # How to confirm dynamically (for Phase 4)
related_files: list[str] # Other files involved in the vulnerability chainThe key insight from the blog: an agent reads code to understand business logic, then configures dynamic tools to confirm exploitability.
For each finding where testable=True, generate:
-
Dynamic Nuclei templates: Auto-generate
.yamltemplates targeting specific endpoints with specific payloads derived from source code understanding. For example, if the source analysis finds thatDELETE /api/interviews/:idlacks authorization middleware, generate a Nuclei template that authenticates as User A and attempts to delete User B's resource. -
ZAP active scan policies: Configure ZAP to focus active scanning on specific endpoints identified as high-risk from source analysis, rather than scanning everything equally.
-
Custom HTTP requests: For business logic tests that don't fit scanner templates (e.g., testing multi-step workflows), generate raw HTTP request sequences.
ZAP scanning continues as-is, with two enhancements:
- Priority targeting: If source analysis identified high-risk endpoints, ZAP's active scan focuses there first (configure scan policy per-URL).
- Context from source: If source analysis identifies hidden endpoints not discoverable by spidering (e.g., admin routes, internal APIs), add them to ZAP's scan tree explicitly.
Nuclei runs as-is with community + custom templates, plus:
- Auto-generated templates: Templates created in Phase 2c are written to a temporary directory and included in the Nuclei run.
- Hypothesis confirmation: Results from auto-generated templates are tagged with the source finding they're confirming, enabling correlation in the report.
Current dedup uses (name, url) key matching. Enhance with:
-
Cross-scanner correlation: When a source finding (SAST) and a runtime finding (DAST) describe the same vulnerability, merge them into a single correlated finding with evidence from both. A correlated finding (confirmed by code review AND runtime test) gets elevated confidence.
-
Severity adjustment: Source-only findings that couldn't be confirmed dynamically get a note: "Identified in source code; dynamic confirmation pending." Dynamically confirmed findings get a confidence boost.
-
Chain detection: If source analysis identifies that findings A (medium) + B (low) chain into a critical impact, create a synthetic "vulnerability chain" finding at the elevated severity.
Extend the report with new sections:
-
Source Analysis Summary
- Files analyzed, lines of code scanned
- Architecture map (endpoints discovered, auth model, data flows)
- Time taken, model used, token usage
-
Correlated Findings
- Findings confirmed by both SAST and DAST get a "Confirmed" badge
- Show source code snippet alongside runtime evidence
- Higher visual priority in the report
-
Source-Only Findings
- Separate section for findings from code analysis that couldn't be dynamically tested
- Include code snippets, file paths, line numbers
- Suggested fix with diff-style remediation code
-
Vulnerability Chains
- Narrative explanation of how multiple findings combine
- Step-by-step exploitation path
- Combined severity assessment
-
Coverage Matrix
- Table showing which OWASP Top 10 categories were tested by which scanner
- Highlights gaps in coverage
# New section in config.yaml
source_analysis:
enabled: true
source_dir: "../InterviewPlatform" # Path to target source code
exclude_patterns: # Patterns to skip
- "node_modules/**"
- ".next/**"
- "*.test.*"
- "__pycache__/**"
- ".venv/**"
priority_patterns: # Files to analyze first
- "**/routes/**"
- "**/middleware/**"
- "**/auth/**"
- "**/api/**"
- "**/models/**"
- "**/*.env*"
max_file_size_kb: 100 # Skip files larger than this
model: "claude-opus-4-6" # Claude model for analysis
anthropic_api_key: "" # Or use ANTHROPIC_API_KEY env var
generate_nuclei_templates: true # Auto-generate templates from findings
self_verify: true # Run adversarial verification pass
max_files: 200 # Cap on total files analyzedCLI additions:
--source-dir, -s Path to source code directory (overrides config)
--no-source Skip source analysis phase
| File | Purpose |
|---|---|
src/source_analyzer.py |
Core source analysis module — ingests code, calls Claude API, returns findings |
src/hypothesis_generator.py |
Converts source findings into dynamic Nuclei templates and ZAP directives |
src/correlator.py |
Cross-scanner finding correlation and chain detection |
templates/prompts/architecture_map.txt |
Prompt template for initial architecture mapping pass |
templates/prompts/security_audit.txt |
Prompt template for deep-dive security analysis |
templates/prompts/self_verify.txt |
Prompt template for adversarial self-verification |
templates/prompts/hypothesis.txt |
Prompt template for generating test hypotheses |
| File | Changes |
|---|---|
pentest.py |
Add --source-dir and --no-source CLI options |
src/orchestrator.py |
Insert Phase 2 (source analysis) and Phase 2c (hypothesis gen); enhance Phase 4 and 5 |
src/reporter.py |
Add correlated findings, source-only findings, vulnerability chains, coverage matrix sections |
src/nuclei_scanner.py |
Accept additional template directory from hypothesis generator |
src/zap_scanner.py |
Accept priority endpoint list from source analysis for targeted scanning |
config.example.yaml |
Add source_analysis section |
requirements.txt |
Add anthropic>=0.52.0 |
.github/workflows/pentest.yml |
Add ANTHROPIC_API_KEY secret, source checkout step |
- Build the file walker with gitignore/exclude support
- Define
SourceFindingdataclass - Add config schema for
source_analysissection - Add CLI flags
- Implement architecture mapping pass
- Implement targeted deep-dive passes (chunked by security domain)
- Implement self-verification pass
- Handle token limits, rate limits, and cost tracking
- Convert
SourceFindingobjects withtestable=Trueinto Nuclei YAML templates - Generate ZAP priority endpoint lists
- Write templates to temp directory for Nuclei consumption
- Wire Phase 2 into the pipeline between auth and ZAP
- Pass hypotheses into Nuclei and ZAP phases
- Update progress display
- Implement cross-scanner matching (source finding ↔ runtime finding)
- Implement chain detection
- Implement severity adjustment logic
- Add new report sections (correlated findings, source-only, chains, coverage matrix)
- Source code snippets with syntax highlighting
- Confirmed/unconfirmed badges
- Update GitHub Actions workflow
- Update config example
- Update README
Claude API usage for source analysis will incur costs. Mitigations:
- File prioritization: Analyze security-critical files first, skip test files and generated code
- Caching: Hash file contents; skip re-analysis on unchanged files between runs
- Model selection: Use
claude-sonnet-4-6for initial architecture mapping (cheaper, faster),claude-opus-4-6for deep-dive security analysis (more thorough) - Token tracking: Log input/output token counts per phase; display cost estimate in report
- Budget cap: Optional
max_api_costconfig to abort if estimated cost exceeds threshold
The combination of reasoning-based SAST + traditional DAST delivers what neither can alone:
| Capability | ZAP/Nuclei Only | Claude SAST Only | Combined |
|---|---|---|---|
| Known CVE detection | Yes | No | Yes |
| OWASP Top 10 pattern matching | Yes | Yes | Yes |
| Business logic flaws | No | Yes | Yes + confirmed |
| Authorization bypasses | Limited | Yes | Yes + confirmed |
| Vulnerability chaining | No | Partial | Yes |
| Zero-day patterns | No | Yes | Yes + confirmed |
| False positive rate | High | Medium | Low (cross-validated) |
| Runtime confirmation | Yes | No | Yes |
| Source-level remediation | No | Yes | Yes |
| Code snippet in report | No | Yes | Yes |
This moves coverage from ~60-70% of a commercial pen test to something meaningfully higher, with the remaining gap being truly creative lateral thinking, physical security, and social engineering — things that require a human in the loop.
This section captures known gaps in the plan that need resolution before implementation.
The plan says "chunked by security domain" but has no concrete strategy for handling a real codebase. InterviewPlatform is a full Next.js + Python backend — even 200 files at 100KB each far exceeds any single context window. The "architecture mapping pass" alone could blow limits.
Needs: A concrete chunking/batching strategy — file batching with summarization chains, a RAG-like retrieval approach, or progressive summarization. This is a hard problem that deserves its own design section.
Nuclei YAML templates have strict schema requirements (matchers, extractors, conditions, specific syntax). Claude generating valid templates from prose hypotheses is ambitious. A malformed template silently fails or produces false results. There is no validation step in the current plan.
Needs: Template validation — at minimum a schema check or nuclei -validate dry-run before execution. Consider a template-generation library or constrained output format rather than free-form YAML generation.
Asking the same model to "disprove each finding" is asking it to argue with itself. In practice this mostly confirms the original analysis with slightly different phrasing. Real verification means testing the hypothesis — which is what Phase 4 already does.
Needs: Either drop this pass (save tokens, Phase 4 provides real verification) or redesign it with a genuinely different approach (different model, structured counterargument prompts that force specific failure modes, or a checklist-based review rather than open-ended adversarial prompting).
Opus across 200 files over 3 passes (architecture + deep-dive + self-verify) could easily cost $50-100+ per run. The plan lists max_api_cost as optional. A misconfigured run shouldn't silently rack up a large bill.
Needs: Mandatory budget cap with a sensible default (e.g., $20). Token counting should happen before each API call with an estimate, not just after. Abort with a clear message when approaching the cap.
The diagram shows Phase 2 (Source Analysis) and Phase 3 (ZAP) as parallel branches. But the text says Phase 2's output feeds into Phase 3 (ZAP priority targeting) and Phase 4 (Nuclei auto-generated templates). These can't both be true.
Needs: Decide the actual execution model. Best option: run ZAP spidering + passive scan in parallel with source analysis, then feed source hypotheses into ZAP active scanning and Nuclei (which come later). Update the diagram to match.
Cross-scanner matching is mentioned with zero detail on the algorithm. How do you match a source finding like "Missing auth check on DELETE /api/interviews/:id" to a ZAP alert like "Broken Access Control at https://interviews-api.inmydata.ai/api/interviews/123"?
Needs: Design the matching algorithm. Options include: endpoint URL normalization + fuzzy matching, CWE-based grouping, or explicit linking via the test_hypothesis field (source finding generates a Nuclei template, template result links back by ID). The explicit linking approach is most reliable.
The existing tool works without source analysis. If the Claude API is down, rate-limited, or returns unparseable output, the whole pipeline shouldn't fail.
Needs: Explicit design for Phase 2 failures being non-fatal. Fall back to ZAP + Nuclei only, log a warning, and note in the report that source analysis was unavailable. Same for partial failures (e.g., analysis worked but template generation failed).
Four prompt files are listed in "New Files" with zero specification. These prompts determine whether findings are useful or garbage — they're the most critical component.
Needs: At minimum, outline-level specifications for each prompt: what context is provided, what structure is expected in the response, what examples are included, and what guardrails prevent hallucinated findings. Consider using structured output (JSON mode) for parseable results.
How do you know if findings are accurate? What's the false positive rate? There's no benchmark, test suite, or review process.
Needs: (a) A known-vulnerable test application to validate against (e.g., OWASP Juice Shop or a purpose-built test harness). (b) Precision/recall tracking across runs. (c) A manual review process for at least the first several runs before trusting automated output.
Source analysis on a large codebase could add 10-30 minutes. ZAP spidering is also slow. These could run concurrently — ZAP starts immediately while Claude analyzes source code. Hypotheses would only need to be ready before active scanning and Nuclei, which come later in the pipeline.
Needs: Design a parallel execution model. Source analysis and ZAP spidering/passive scanning have no dependencies on each other. Only the active scan phase and Nuclei need source analysis results.
Nothing prevents Claude from generating a Nuclei template that performs destructive actions (e.g., DELETE requests against production data, mass POST requests, or payloads that modify state).
Needs: A whitelist of allowed HTTP methods in generated templates (GET-only by default), a review/approval gate before executing auto-generated templates, or at minimum a --confirm-generated-templates flag.
"Hash file contents; skip re-analysis on unchanged files" is mentioned under cost but has no design. Where are hashes stored? What's the cache key — file hash alone, or file hash + prompt version? How do you invalidate when prompts change (same code, different analysis)?
Needs: Specify the caching layer. Cache key should include file content hash + prompt template hash + model version. Storage could be a local JSON/SQLite file in the project directory. Define a --no-cache flag to force re-analysis.
Summary of changes to make before starting implementation:
- Add a chunking/batching design section for large codebase handling
- Add Nuclei template validation (dry-run or schema check) before execution
- Drop or redesign the self-verification pass — Phase 4 provides real verification
- Make budget cap mandatory with a sensible default
- Fix the architecture diagram to show actual execution order (parallel where possible)
- Spec out the correlation matching algorithm — prefer explicit ID-based linking
- Add graceful degradation for API failures (non-fatal Phase 2)
- Draft outline-level prompt specifications for each prompt template
- Add a testing/validation section with a benchmark strategy
- Design for parallel execution of ZAP spidering + source analysis
- Add safety checks on auto-generated Nuclei templates
- Specify the caching layer with proper invalidation