Plan: Source-Aware Penetration Testing with Claude Code Security

Context

This project currently orchestrates OWASP ZAP (DAST) and Nuclei (template scanning) behind AWS Cognito authentication, producing unified HTML/Markdown reports. As described in Friday Afternoon Pen Test, this delivers ~60-70% of a commercial pen test's coverage for known vulnerability patterns.

The gap lies in what the blog calls "creative business logic testing" — understanding how the application works to find authorization bypasses, data leakage through business logic, and vulnerability chains that no pattern-matcher catches. Claude Code Security fills exactly this gap: it reads and reasons about source code the way a human security researcher would, tracing data flows across files and identifying complex multi-component vulnerabilities.

This plan extends the project to combine source-code-level AI reasoning (SAST) with runtime dynamic testing (DAST) — mirroring how a human pen tester works: form hypotheses from code, then confirm exploitability at runtime.

Architecture Overview

                          ┌─────────────────────────┐
                          │      config.yaml        │
                          │  + source_dir path      │
                          │  + claude API key        │
                          └───────────┬─────────────┘
                                      │
                          ┌───────────▼─────────────┐
                          │     Orchestrator         │
                          │   (6-phase pipeline)     │
                          └───────────┬─────────────┘
                                      │
           ┌──────────────────────────┼──────────────────────────┐
           │                          │                          │
  ┌────────▼────────┐    ┌───────────▼──────────┐    ┌─────────▼─────────┐
  │  Phase 1: Auth  │    │  Phase 2: Source      │    │  Phase 3: ZAP     │
  │  (Cognito)      │    │  Analysis (Claude)    │    │  (DAST)           │
  └────────┬────────┘    └───────────┬──────────┘    └─────────┬─────────┘
           │                         │                          │
           │              ┌──────────▼──────────┐              │
           │              │ Hypotheses:          │              │
           │              │ - Auth bypasses      │              │
           │              │ - Injection points   │              │
           │              │ - Logic flaws        │              │
           │              │ - Data flow issues   │              │
           │              └──────────┬──────────┘              │
           │                         │                          │
           │              ┌──────────▼──────────┐    ┌─────────▼─────────┐
           │              │  Phase 4: Nuclei     │    │  Phase 5: Dedup   │
           │              │  (+ dynamic templates│    │  + Correlation     │
           │              │   from hypotheses)   │    │                   │
           │              └──────────┬──────────┘    └─────────┬─────────┘
           │                         │                          │
           │                         └──────────┬───────────────┘
           │                                    │
           │                         ┌──────────▼──────────┐
           │                         │  Phase 6: Report     │
           │                         │  (unified + source   │
           │                         │   context)           │
           │                         └─────────────────────┘

Phase Breakdown

Phase 1: Authentication (existing — no change)

Cognito auth continues as-is. Tokens are shared with all scanners.

Phase 2: Source Code Analysis (NEW)

New module: src/source_analyzer.py

This is the core addition. It uses the Claude API (via the Anthropic SDK) to perform reasoning-based static analysis on the target application's source code.

2a. Source Directory Ingestion

Accept a source_dir path in config (or --source-dir CLI flag)
Walk the directory, collecting files by type (.py, .ts, .js, .tsx, .jsx, .yaml, .env*, etc.)
Respect .gitignore patterns and a configurable exclude_patterns list
Build a file manifest with metadata (path, size, language, last modified)
For large codebases, prioritize security-critical files:
- Route definitions / API handlers
- Authentication / authorization middleware
- Database query builders / ORM models
- Input validation / sanitization
- Configuration files
- Environment variable usage

2b. Claude API Security Analysis

Use the Anthropic Python SDK (anthropic) to send source code to Claude with a structured security audit prompt. This is not the /security-review slash command (which is interactive) — it's direct API calls for programmatic integration.

Approach:

Architecture mapping pass: Send route definitions, middleware chains, and config files. Ask Claude to map the application's attack surface: endpoints, auth requirements, data flows, trust boundaries.
Targeted deep-dive passes: For each security-critical area identified, send the relevant files and ask Claude to analyze for:
- Authentication bypasses (missing middleware, JWT validation gaps, role confusion)
- Authorization flaws (IDOR, privilege escalation, broken object-level access)
- Injection points (SQL, NoSQL, command, template injection)
- Business logic flaws (race conditions, state manipulation, workflow bypasses)
- Data exposure (sensitive data in logs, error messages, API responses)
- Cryptographic weaknesses (weak algorithms, hardcoded keys, insecure random)
- SSRF / open redirect patterns
Self-verification pass: Re-submit findings back to Claude with an adversarial prompt: "Attempt to disprove each finding. Rate confidence 1-5. Remove false positives." This mirrors Claude Code Security's multi-stage verification.

Output: A list of SourceFinding objects:

@dataclass
class SourceFinding:
    source: str = "claude-sast"
    name: str                      # e.g. "Missing authorization check on DELETE /api/interviews/{id}"
    description: str               # Detailed explanation of the vulnerability
    severity: str                  # critical, high, medium, low, info
    confidence: int                # 1-5 confidence score
    file_path: str                 # Source file where the issue exists
    line_range: tuple[int, int]    # Start and end line numbers
    code_snippet: str              # Relevant code excerpt
    category: str                  # OWASP category (e.g., "Broken Access Control")
    cwe_id: str                    # CWE identifier
    suggested_fix: str             # Remediation code/guidance
    testable: bool                 # Can this be confirmed via DAST?
    test_hypothesis: str | None    # How to confirm dynamically (for Phase 4)
    related_files: list[str]       # Other files involved in the vulnerability chain

2c. Hypothesis Generation for DAST Confirmation

The key insight from the blog: an agent reads code to understand business logic, then configures dynamic tools to confirm exploitability.

For each finding where testable=True, generate:

Dynamic Nuclei templates: Auto-generate .yaml templates targeting specific endpoints with specific payloads derived from source code understanding. For example, if the source analysis finds that DELETE /api/interviews/:id lacks authorization middleware, generate a Nuclei template that authenticates as User A and attempts to delete User B's resource.
ZAP active scan policies: Configure ZAP to focus active scanning on specific endpoints identified as high-risk from source analysis, rather than scanning everything equally.
Custom HTTP requests: For business logic tests that don't fit scanner templates (e.g., testing multi-step workflows), generate raw HTTP request sequences.

Phase 3: OWASP ZAP (existing — enhanced)

ZAP scanning continues as-is, with two enhancements:

Priority targeting: If source analysis identified high-risk endpoints, ZAP's active scan focuses there first (configure scan policy per-URL).
Context from source: If source analysis identifies hidden endpoints not discoverable by spidering (e.g., admin routes, internal APIs), add them to ZAP's scan tree explicitly.

Phase 4: Nuclei Scanning (existing — enhanced)

Nuclei runs as-is with community + custom templates, plus:

Auto-generated templates: Templates created in Phase 2c are written to a temporary directory and included in the Nuclei run.
Hypothesis confirmation: Results from auto-generated templates are tagged with the source finding they're confirming, enabling correlation in the report.

Phase 5: Deduplication + Correlation (enhanced)

Current dedup uses (name, url) key matching. Enhance with:

Cross-scanner correlation: When a source finding (SAST) and a runtime finding (DAST) describe the same vulnerability, merge them into a single correlated finding with evidence from both. A correlated finding (confirmed by code review AND runtime test) gets elevated confidence.
Severity adjustment: Source-only findings that couldn't be confirmed dynamically get a note: "Identified in source code; dynamic confirmation pending." Dynamically confirmed findings get a confidence boost.
Chain detection: If source analysis identifies that findings A (medium) + B (low) chain into a critical impact, create a synthetic "vulnerability chain" finding at the elevated severity.

Phase 6: Report Generation (enhanced)

Extend the report with new sections:

Source Analysis Summary
- Files analyzed, lines of code scanned
- Architecture map (endpoints discovered, auth model, data flows)
- Time taken, model used, token usage
Correlated Findings
- Findings confirmed by both SAST and DAST get a "Confirmed" badge
- Show source code snippet alongside runtime evidence
- Higher visual priority in the report
Source-Only Findings
- Separate section for findings from code analysis that couldn't be dynamically tested
- Include code snippets, file paths, line numbers
- Suggested fix with diff-style remediation code
Vulnerability Chains
- Narrative explanation of how multiple findings combine
- Step-by-step exploitation path
- Combined severity assessment
Coverage Matrix
- Table showing which OWASP Top 10 categories were tested by which scanner
- Highlights gaps in coverage

Configuration Changes

# New section in config.yaml
source_analysis:
  enabled: true
  source_dir: "../InterviewPlatform"      # Path to target source code
  exclude_patterns:                        # Patterns to skip
    - "node_modules/**"
    - ".next/**"
    - "*.test.*"
    - "__pycache__/**"
    - ".venv/**"
  priority_patterns:                       # Files to analyze first
    - "**/routes/**"
    - "**/middleware/**"
    - "**/auth/**"
    - "**/api/**"
    - "**/models/**"
    - "**/*.env*"
  max_file_size_kb: 100                    # Skip files larger than this
  model: "claude-opus-4-6"                 # Claude model for analysis
  anthropic_api_key: ""                    # Or use ANTHROPIC_API_KEY env var
  generate_nuclei_templates: true          # Auto-generate templates from findings
  self_verify: true                        # Run adversarial verification pass
  max_files: 200                           # Cap on total files analyzed

CLI additions:

--source-dir, -s    Path to source code directory (overrides config)
--no-source         Skip source analysis phase

New Files

File	Purpose
`src/source_analyzer.py`	Core source analysis module — ingests code, calls Claude API, returns findings
`src/hypothesis_generator.py`	Converts source findings into dynamic Nuclei templates and ZAP directives
`src/correlator.py`	Cross-scanner finding correlation and chain detection
`templates/prompts/architecture_map.txt`	Prompt template for initial architecture mapping pass
`templates/prompts/security_audit.txt`	Prompt template for deep-dive security analysis
`templates/prompts/self_verify.txt`	Prompt template for adversarial self-verification
`templates/prompts/hypothesis.txt`	Prompt template for generating test hypotheses

Modified Files

File	Changes
`pentest.py`	Add `--source-dir` and `--no-source` CLI options
`src/orchestrator.py`	Insert Phase 2 (source analysis) and Phase 2c (hypothesis gen); enhance Phase 4 and 5
`src/reporter.py`	Add correlated findings, source-only findings, vulnerability chains, coverage matrix sections
`src/nuclei_scanner.py`	Accept additional template directory from hypothesis generator
`src/zap_scanner.py`	Accept priority endpoint list from source analysis for targeted scanning
`config.example.yaml`	Add `source_analysis` section
`requirements.txt`	Add `anthropic>=0.52.0`
`.github/workflows/pentest.yml`	Add `ANTHROPIC_API_KEY` secret, source checkout step

Implementation Order

Step 1: Source Ingestion + `SourceFinding` Data Model

Build the file walker with gitignore/exclude support
Define SourceFinding dataclass
Add config schema for source_analysis section
Add CLI flags

Step 2: Claude API Integration

Implement architecture mapping pass
Implement targeted deep-dive passes (chunked by security domain)
Implement self-verification pass
Handle token limits, rate limits, and cost tracking

Step 3: Hypothesis Generator

Convert SourceFinding objects with testable=True into Nuclei YAML templates
Generate ZAP priority endpoint lists
Write templates to temp directory for Nuclei consumption

Step 4: Orchestrator Integration

Wire Phase 2 into the pipeline between auth and ZAP
Pass hypotheses into Nuclei and ZAP phases
Update progress display

Step 5: Correlator

Implement cross-scanner matching (source finding ↔ runtime finding)
Implement chain detection
Implement severity adjustment logic

Step 6: Report Enhancement

Add new report sections (correlated findings, source-only, chains, coverage matrix)
Source code snippets with syntax highlighting
Confirmed/unconfirmed badges

Step 7: CI/CD + Config

Update GitHub Actions workflow
Update config example
Update README

Cost Considerations

Claude API usage for source analysis will incur costs. Mitigations:

File prioritization: Analyze security-critical files first, skip test files and generated code
Caching: Hash file contents; skip re-analysis on unchanged files between runs
Model selection: Use claude-sonnet-4-6 for initial architecture mapping (cheaper, faster), claude-opus-4-6 for deep-dive security analysis (more thorough)
Token tracking: Log input/output token counts per phase; display cost estimate in report
Budget cap: Optional max_api_cost config to abort if estimated cost exceeds threshold

What This Enables

The combination of reasoning-based SAST + traditional DAST delivers what neither can alone:

Capability	ZAP/Nuclei Only	Claude SAST Only	Combined
Known CVE detection	Yes	No	Yes
OWASP Top 10 pattern matching	Yes	Yes	Yes
Business logic flaws	No	Yes	Yes + confirmed
Authorization bypasses	Limited	Yes	Yes + confirmed
Vulnerability chaining	No	Partial	Yes
Zero-day patterns	No	Yes	Yes + confirmed
False positive rate	High	Medium	Low (cross-validated)
Runtime confirmation	Yes	No	Yes
Source-level remediation	No	Yes	Yes
Code snippet in report	No	Yes	Yes

This moves coverage from ~60-70% of a commercial pen test to something meaningfully higher, with the remaining gap being truly creative lateral thinking, physical security, and social engineering — things that require a human in the loop.

Open Problems & Shortcomings

This section captures known gaps in the plan that need resolution before implementation.

1. Context Window vs. Codebase Size

The plan says "chunked by security domain" but has no concrete strategy for handling a real codebase. InterviewPlatform is a full Next.js + Python backend — even 200 files at 100KB each far exceeds any single context window. The "architecture mapping pass" alone could blow limits.

Needs: A concrete chunking/batching strategy — file batching with summarization chains, a RAG-like retrieval approach, or progressive summarization. This is a hard problem that deserves its own design section.

2. Auto-Generated Nuclei Templates Are Fragile

Nuclei YAML templates have strict schema requirements (matchers, extractors, conditions, specific syntax). Claude generating valid templates from prose hypotheses is ambitious. A malformed template silently fails or produces false results. There is no validation step in the current plan.

Needs: Template validation — at minimum a schema check or nuclei -validate dry-run before execution. Consider a template-generation library or constrained output format rather than free-form YAML generation.

3. Self-Verification Pass Is Questionable

Asking the same model to "disprove each finding" is asking it to argue with itself. In practice this mostly confirms the original analysis with slightly different phrasing. Real verification means testing the hypothesis — which is what Phase 4 already does.

Needs: Either drop this pass (save tokens, Phase 4 provides real verification) or redesign it with a genuinely different approach (different model, structured counterargument prompts that force specific failure modes, or a checklist-based review rather than open-ended adversarial prompting).

4. Cost — Budget Cap Must Be Mandatory

Opus across 200 files over 3 passes (architecture + deep-dive + self-verify) could easily cost $50-100+ per run. The plan lists max_api_cost as optional. A misconfigured run shouldn't silently rack up a large bill.

Needs: Mandatory budget cap with a sensible default (e.g., $20). Token counting should happen before each API call with an estimate, not just after. Abort with a clear message when approaching the cap.

5. Architecture Diagram Contradicts the Text

The diagram shows Phase 2 (Source Analysis) and Phase 3 (ZAP) as parallel branches. But the text says Phase 2's output feeds into Phase 3 (ZAP priority targeting) and Phase 4 (Nuclei auto-generated templates). These can't both be true.

Needs: Decide the actual execution model. Best option: run ZAP spidering + passive scan in parallel with source analysis, then feed source hypotheses into ZAP active scanning and Nuclei (which come later). Update the diagram to match.

6. Correlation Logic Is Underspecified

Cross-scanner matching is mentioned with zero detail on the algorithm. How do you match a source finding like "Missing auth check on DELETE /api/interviews/:id" to a ZAP alert like "Broken Access Control at https://interviews-api.inmydata.ai/api/interviews/123"?

Needs: Design the matching algorithm. Options include: endpoint URL normalization + fuzzy matching, CWE-based grouping, or explicit linking via the test_hypothesis field (source finding generates a Nuclei template, template result links back by ID). The explicit linking approach is most reliable.

7. No Graceful Degradation

The existing tool works without source analysis. If the Claude API is down, rate-limited, or returns unparseable output, the whole pipeline shouldn't fail.

Needs: Explicit design for Phase 2 failures being non-fatal. Fall back to ZAP + Nuclei only, log a warning, and note in the report that source analysis was unavailable. Same for partial failures (e.g., analysis worked but template generation failed).

8. Prompt Templates Are a Black Box

Four prompt files are listed in "New Files" with zero specification. These prompts determine whether findings are useful or garbage — they're the most critical component.

Needs: At minimum, outline-level specifications for each prompt: what context is provided, what structure is expected in the response, what examples are included, and what guardrails prevent hallucinated findings. Consider using structured output (JSON mode) for parseable results.

9. No Testing or Validation Strategy

How do you know if findings are accurate? What's the false positive rate? There's no benchmark, test suite, or review process.

Needs: (a) A known-vulnerable test application to validate against (e.g., OWASP Juice Shop or a purpose-built test harness). (b) Precision/recall tracking across runs. (c) A manual review process for at least the first several runs before trusting automated output.

10. Parallel Execution Would Save Significant Time

Source analysis on a large codebase could add 10-30 minutes. ZAP spidering is also slow. These could run concurrently — ZAP starts immediately while Claude analyzes source code. Hypotheses would only need to be ready before active scanning and Nuclei, which come later in the pipeline.

Needs: Design a parallel execution model. Source analysis and ZAP spidering/passive scanning have no dependencies on each other. Only the active scan phase and Nuclei need source analysis results.

11. No Safety Checks on Auto-Generated Templates

Nothing prevents Claude from generating a Nuclei template that performs destructive actions (e.g., DELETE requests against production data, mass POST requests, or payloads that modify state).

Needs: A whitelist of allowed HTTP methods in generated templates (GET-only by default), a review/approval gate before executing auto-generated templates, or at minimum a --confirm-generated-templates flag.

12. Caching Design Is Unspecified

"Hash file contents; skip re-analysis on unchanged files" is mentioned under cost but has no design. Where are hashes stored? What's the cache key — file hash alone, or file hash + prompt version? How do you invalidate when prompts change (same code, different analysis)?

Needs: Specify the caching layer. Cache key should include file content hash + prompt template hash + model version. Storage could be a local JSON/SQLite file in the project directory. Define a --no-cache flag to force re-analysis.

Recommended Implementation Adjustments

Summary of changes to make before starting implementation:

Add a chunking/batching design section for large codebase handling
Add Nuclei template validation (dry-run or schema check) before execution
Drop or redesign the self-verification pass — Phase 4 provides real verification
Make budget cap mandatory with a sensible default
Fix the architecture diagram to show actual execution order (parallel where possible)
Spec out the correlation matching algorithm — prefer explicit ID-based linking
Add graceful degradation for API failures (non-fatal Phase 2)
Draft outline-level prompt specifications for each prompt template
Add a testing/validation section with a benchmark strategy
Design for parallel execution of ZAP spidering + source analysis
Add safety checks on auto-generated Nuclei templates
Specify the caching layer with proper invalidation

FilesExpand file tree

PLAN.md

Latest commit

History