Skip to content

Latest commit

 

History

History
550 lines (432 loc) · 23.4 KB

File metadata and controls

550 lines (432 loc) · 23.4 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Overview

HolmesGPT is an AI-powered troubleshooting agent that connects to observability platforms (Kubernetes, Prometheus, Grafana, etc.) to automatically diagnose and analyze infrastructure and application issues. It uses an agentic loop to investigate problems by calling tools to gather data from multiple sources.

Development Commands

Environment Setup

# Install dependencies with Poetry
poetry install

Testing

# Install test dependencies with Poetry
poetry install --with dev
# Run all non-LLM tests (unit and integration tests)
make test-without-llm
poetry run pytest tests -m "not llm"

# Run LLM evaluation tests (requires API keys)
make test-llm-ask-holmes          # Test single-question interactions
make test-llm-investigate         # Test AlertManager investigations
poetry run pytest tests/llm/ -n 6 -vv  # Run all LLM tests in parallel

# Run pre-commit checks (includes ruff, mypy, poetry validation)
# NOTE: Only run these when the user explicitly asks. They run in CI automatically.
make check
poetry run pre-commit run -a

Code Quality (only run when explicitly asked)

# Format code with ruff
poetry run ruff format

# Check code with ruff (auto-fix issues)
poetry run ruff check --fix

# Type checking with mypy
poetry run mypy

Architecture Overview

Core Components

CLI Entry Point (holmes/main.py):

  • Typer-based CLI with subcommands for ask, investigate, toolset
  • Handles configuration loading, logging setup, and command routing

** Interactive mode for CLI** (holmes/interactive.py):

  • Handles interactive mode for ask subcommand
  • Implements slash commands

Configuration System (holmes/config.py):

  • Loads settings from ~/.holmes/config.yaml or via CLI options
  • Manages API keys, model selection, and toolset configurations
  • Factory methods for creating sources (AlertManager, Jira, PagerDuty, etc.)

Core Investigation Engine (holmes/core/):

  • tool_calling_llm.py: Main LLM interaction with tool calling capabilities
  • investigation.py: Orchestrates multi-step investigations with runbooks
  • toolset_manager.py: Manages available tools and their configurations
  • tools.py: Tool definitions and execution logic

Plugin System (holmes/plugins/):

  • Sources: AlertManager, Jira, PagerDuty, OpsGenie integrations
  • Toolsets: Kubernetes, Prometheus, Grafana, AWS, Docker, etc.
  • Prompts: Jinja2 templates for different investigation scenarios
  • Destinations: Slack integration for sending results

Key Patterns

Toolset Architecture:

  • Each toolset is a YAML file defining available tools and their parameters
  • Tools can be Python functions or bash commands with safety validation
  • Toolsets are loaded dynamically and can be customized via config files
  • Important: All toolsets MUST return detailed error messages from underlying APIs to enable LLM self-correction
    • Include the exact query/command that was executed
    • Include time ranges, parameters, and filters used
    • Include the full API error response (status code and message)
    • For "no data" responses, specify what was searched and where

Thin API Wrapper Pattern for Python Toolsets:

  • Reference implementation: servicenow_tables/servicenow_tables.py
  • Use requests library for HTTP calls (not specialized client libraries like opensearchpy)
  • Simple config class with Pydantic validation
  • Health check in prerequisites_callable() method
  • Each tool is a thin wrapper around a single API endpoint

Server-Side Filtering is Critical:

  • Never return unbounded data from APIs - this causes token overflow
  • Always include filter parameters on tools that query collections (e.g., index parameter for Elasticsearch _cat APIs)
  • Example problem: opensearch_list_shards returned ALL shards → 25K+ tokens on large clusters
  • Example fix: elasticsearch_cat tool requires index parameter for shards/segments endpoints
  • When server-side filtering is not possible, use JsonFilterMixin (see json_filter_mixin.py) to add max_depth and jq parameters for client-side filtering

Toolset Config Backwards Compatibility: When renaming config fields in a toolset, maintain backwards compatibility using Pydantic's extra="allow":

# ✅ DO: Use extra="allow" to accept deprecated fields without polluting schema
class MyToolsetConfig(BaseModel):
    model_config = ConfigDict(extra="allow")

    # Only define current field names in schema
    new_field_name: int = 10

    @model_validator(mode="after")
    def handle_deprecated_fields(self):
        extra = self.model_extra or {}
        deprecated = []

        # Map old names to new names
        if "old_field_name" in extra:
            self.new_field_name = extra["old_field_name"]
            deprecated.append("old_field_name -> new_field_name")

        if deprecated:
            logging.warning(f"Deprecated config names: {', '.join(deprecated)}")
        return self

# ❌ DON'T: Define deprecated fields in schema with Optional[None]
class BadConfig(BaseModel):
    new_field_name: int = 10
    old_field_name: Optional[int] = None  # Pollutes schema, shows in model_dump()

Benefits of extra="allow" approach:

  • Schema only shows current field names
  • model_dump() returns clean output without deprecated fields
  • Old configs still work (backwards compatible)
  • Deprecation warnings guide users to update

See prometheus/prometheus.py PrometheusConfig for a complete example.

LLM Integration:

  • Uses LiteLLM for multi-provider support (OpenAI, Anthropic, Azure, etc.)
  • Structured tool calling with automatic retry and error handling
  • Context-aware prompting with system instructions and examples

Investigation Flow:

  1. Load user question/alert
  2. Select relevant toolsets based on context
  3. Execute LLM with available tools
  4. LLM calls tools to gather data
  5. LLM analyzes results and provides conclusions
  6. Optionally write results back to source system

Testing Framework

Three-tier testing approach:

  1. Unit Tests (tests/): Standard pytest tests for individual components
  2. Integration Tests: Test toolset integrations
  3. LLM Evaluation Tests (tests/llm/): End-to-end tests using fixtures

Running regular (non-LLM) tests:

poetry run pytest tests -m "not llm"
make test-without-llm

Running LLM eval tests:

# Run specific eval - IMPORTANT: Use -k flag, NOT full test path with brackets
poetry run pytest -k "09_crashpod" --no-cov

# Run all evals in parallel
poetry run pytest tests/llm/ -n 6 --no-cov

# Regression evals
poetry run pytest -m 'llm and easy' --no-cov

For the complete eval CLI reference (flags, env vars, model comparison, debugging), see the /create-eval skill which contains full documentation in its reference files.

Configuration

Config File Location: ~/.holmes/config.yaml

Key Configuration Sections:

  • model: LLM model to use (default: gpt-4.1)
  • api_key: LLM API key (or use environment variables)
  • custom_toolsets: Override or add toolsets
  • custom_runbooks: Add investigation runbooks
  • Platform-specific settings (alertmanager_url, jira_url, etc.)

Environment Variables:

  • OPENAI_API_KEY, ANTHROPIC_API_KEY: LLM API keys
  • OPENROUTER_API_KEY: Alternative LLM provider via OpenRouter (domain: api.openrouter.ai)
  • MODEL: Override default model(s) - supports comma-separated list
  • RUN_LIVE: Enable live execution of tools in tests (default: true)
  • BRAINTRUST_API_KEY: For test result tracking and CI/CD report generation
  • BRAINTRUST_ORG: Braintrust organization name (default: "robustadev")
  • ELASTICSEARCH_URL, ELASTICSEARCH_API_KEY: For Elasticsearch/OpenSearch cloud testing

Development Guidelines

Code Quality:

  • Use Ruff for formatting and linting (configured in pyproject.toml)
  • Type hints required (mypy configuration in pyproject.toml)
  • Pre-commit hooks enforce quality checks in CI
  • ALWAYS place Python imports at the top of the file, not inside functions or methods
  • NEVER run pre-commit, ruff, or mypy unless the user explicitly asks you to. These tools are triggered by commit hooks which are not installed on all machines, and running them causes widespread formatting/type changes to files unrelated to your task. Only lint/format files you are actively editing, and only if asked.

Documentation Examples:

  • ALWAYS use Anthropic Claude models in code examples and documentation:
    • Recommended: anthropic/claude-sonnet-4-5-20250929 or anthropic/claude-opus-4-5-20251101
    • Use the latest Claude 4.5 family models (Sonnet or Opus)
  • Avoid using deprecated or older model versions like claude-3.5-sonnet, gpt-4-vision-preview
  • Do NOT use GPT-4o or Gemini models in documentation examples

Testing Requirements:

  • All new features require unit tests
  • New toolsets require integration tests
  • Complex investigations should have LLM evaluation tests
  • Maintain 40% minimum test coverage
  • Live execution is now enabled by default to ensure tests match real-world behavior

Pull Request Process:

  • PRs require maintainer approval
  • Pre-commit hooks are checked in CI (do NOT run them locally unless asked)
  • LLM evaluation tests run automatically in CI
  • Keep PRs focused and include tests
  • ALWAYS use git commit -s to sign off commits (required for DCO)
  • When committing, use git commit -s --no-verify to skip local pre-commit hooks (they are not installed consistently and will cause unrelated changes)

Git Workflow Guidelines:

  • ALWAYS create commits, NEVER amend
  • ALWAYS merge, NEVER rebase
  • ALWAYS push, NEVER force push
  • Maintain a history of your work to allow the user to revert back to a previous iteration

File Structure Conventions:

  • Toolsets: holmes/plugins/toolsets/{name}.yaml or {name}/
  • Prompts: holmes/plugins/prompts/{name}.jinja2
  • Tests: Match source structure under tests/

Security Notes

  • All tools have read-only access by design
  • Bash toolset validates commands for safety
  • No secrets should be committed to repository
  • Use environment variables or config files for API keys
  • RBAC permissions are respected for Kubernetes access

Eval Tests (LLM Evaluations)

For creating, running, and debugging LLM eval tests, use the /create-eval skill. It contains the complete workflow, test_case.yaml field reference, anti-hallucination patterns, infrastructure setup guides, and CLI reference.

Test Structure:

  • Use sequential test numbers: check existing tests for next available number
  • Required files: test_case.yaml, infrastructure manifests, toolsets.yaml (if needed)
  • Use dedicated namespace per test: app-<testid> (e.g., app-177)
  • All resource names must be unique across tests to prevent conflicts

Tags:

  • CRITICAL: Only use valid tags from pyproject.toml - invalid tags cause test collection failures
  • Check existing tags before adding new ones, ask user permission for new tags

Cloud Service Evals (No Kubernetes Required):

  • Evals can test against cloud services (Elasticsearch, external APIs) directly via environment variables
  • Faster setup (<30 seconds vs minutes for K8s infrastructure)
  • before_test creates test data in the cloud service, after_test cleans up
  • Use toolsets.yaml to configure the toolset with env var references: api_url: "{{ env.ELASTICSEARCH_URL }}"
  • CI/CD secrets: When adding evals for a new integration, you must add the required environment variables to .github/workflows/eval-regression.yaml in the "Run tests" step. Tell the user which secrets they need to add to their GitHub repository settings (e.g., ELASTICSEARCH_URL, ELASTICSEARCH_API_KEY).
  • HTTP request passthrough: The root conftest.py has a responses fixture with autouse=True that mocks ALL HTTP requests by default. When adding a new cloud integration, you MUST add the service's URL pattern to the passthrough list in conftest.py (search for rsps.add_passthru). Use re.compile() for pattern matching (e.g., rsps.add_passthru(re.compile(r"https://.*\.cloud\.es\.io"))).

User Prompts & Expected Outputs:

  • Be specific: Test exact values like "The dashboard title is 'Home'" not generic "Holmes retrieves dashboard"
  • Match prompt to test: User prompt must explicitly request what you're testing
    • BAD: "Get the dashboard"
    • GOOD: "Get the dashboard and tell me the title, panels, and time range"
  • Anti-cheat prompts: Don't use technical terms that give away solutions
    • BAD: "Find node_exporter metrics"
    • GOOD: "Find CPU pressure monitoring queries"
  • Test discovery, not recognition: Holmes should search/analyze, not guess from context
  • Ruling out hallucinations is paramount: When choosing between test approaches, prefer the one that rules out hallucinations:
    • Best: Check specific values that can only be discovered by querying (e.g., unique IDs, injected error codes, exact counts)
    • Acceptable: Use include_tool_calls: true to verify the tool was called when output values are too generic to rule out hallucinations
    • Bad: Check generic output patterns that an LLM could plausibly guess (e.g., "cluster status is green/yellow/red", "has N nodes")
  • expected_output is invisible to LLM: The expected_output field is only used by the evaluator - the LLM never sees it. This means:
    • You can safely put secrets/verification codes in expected_output that the LLM must discover
    • before_test can inject a unique verification code into test data, and expected_output can check for it
    • This is a powerful pattern for cloud service tests: create data with a unique code in before_test, ask LLM to find it, verify with expected_output
    # Example: before_test creates a page with verification code "HOLMES-EVAL-7x9k2m4p"
    # The LLM must discover this code by querying the service
    expected_output:
      - "Must report the verification code: HOLMES-EVAL-7x9k2m4p"
  • include_tool_calls: true: Use when expected output is too generic to be hallucination-proof. Prefer specific answer checking when possible, but verifying tool calls is better than a test that can't rule out hallucinations.
    # Use when values are generic (cluster health could be guessed)
    include_tool_calls: true
    expected_output:
      - "Must call elasticsearch_cluster_health tool"
      - "Must report cluster status"

Infrastructure Setup:

  • Don't just test pod readiness - verify actual service functionality
  • Poll real API endpoints and check for expected content (e.g., "title":"Home", "type":"welcome")
  • CRITICAL: Use exit 1 when setup verification fails to fail the test early
  • Never use :latest container tags - use specific versions like grafana/grafana:12.3.1

Running and Testing Evals

🚨 CRITICAL: Always Test Your Changes

NEVER submit test changes without verification:

Required Testing Workflow:

  1. Setup Phase: poetry run pytest -k "test_name" --only-setup --no-cov
  2. Full Test: poetry run pytest -k "test_name" --no-cov
  3. Verify Results: Ensure 100% pass rate and expected behavior

When to Test:

  • ✅ After creating new tests
  • ✅ After modifying existing tests
  • ✅ After refactoring shared infrastructure
  • ✅ After performance optimizations
  • ✅ After adding/changing tags

Red Flags - Never Skip Testing:

  • ❌ "The changes look good" without running
  • ❌ "It's just a small change"
  • ❌ "I'll test it later"

Testing is Part of Development: Testing is not optional - it's an integral part of the development process. Untested code is broken code.

Testing Methodology:

  • Phase 1: Test setup with --only-setup flag first
  • Phase 2: Run full test after confirming setup works
  • Use background execution for long tests: nohup ... > logfile.log 2>&1 &
  • Handle port conflicts: clean up previous test port forwards before running

Common Flags:

  • --skip-cleanup: Keep resources after test (useful for debugging setup)
  • --skip-setup: Skip before_test commands (useful for iterative testing)

Shared Infrastructure Pattern

When to use shared infrastructure:

  • Multiple tests use the same service (Grafana, Loki, Prometheus)
  • Service configuration is standardized across tests

Implementation:

# Create shared manifest in tests/llm/fixtures/shared/servicename.yaml
# Use in tests:
kubectl apply -f ../../shared/servicename.yaml -n app-<testid>

Benefits:

  • Single place for version updates
  • Consistent configuration across tests
  • Reduced maintenance overhead
  • Follows established pattern (Loki, Prometheus, Grafana)

Setup Verification Best Practices

Prefer kubectl exec over port forwarding for setup verification:

# GOOD - kubectl exec pattern (no port conflicts)
kubectl exec -n namespace deployment/service -- wget -q -O- http://localhost:port/health

# AVOID - port forward for setup verification (causes conflicts)
kubectl port-forward svc/service port:port &
curl localhost:port/health
kill $PORTFWD_PID

Performance optimization guidelines:

  • Use sleep 1 instead of sleep 5 for most retry loops
  • Remove sleeps after straightforward operations (port forward start)
  • Reduce timeout values: 60s for pod readiness, 30s for API verification
  • Question every sleep - many are unnecessary

Race Condition Handling: Never use bare kubectl wait immediately after resource creation. Use retry loops:

# WRONG - fails if pod not scheduled yet
kubectl apply -f deployment.yaml
kubectl wait --for=condition=ready pod -l app=myapp --timeout=300s

# CORRECT - retry loop handles race condition
kubectl apply -f deployment.yaml
POD_READY=false
for i in {1..60}; do
  if kubectl wait --for=condition=ready pod -l app=myapp --timeout=5s 2>/dev/null; then
    echo "✅ Pod is ready!"
    POD_READY=true
    break
  fi
  sleep 1
done
if [ "$POD_READY" = false ]; then
  echo "❌ Pod failed to become ready after 60 seconds"
  kubectl logs -l app=myapp --tail=20  # Diagnostic info
  exit 1  # CRITICAL: Fail the test early
fi

Eval Best Practices

Realism:

  • No fake/obvious logs like "Memory usage stabilized at 800MB"
  • No hints in filenames like "disk_consumer.py" - use realistic names like "training_pipeline.py"
  • No error messages that give away it's simulated like "Simulated processing error"
  • Use real-world scenarios: ML pipelines with checkpoint issues, database connection pools
  • Resource naming should be neutral, not hint at the problem (avoid "broken-pod", "crashloop-app")

Architecture:

  • Implement full architecture even if complex (e.g., use Loki for log aggregation, not simplified alternatives)
  • Proper separation of concerns (app → file → Promtail → Loki → Holmes)
  • ALWAYS use Secrets for scripts, not inline manifests or ConfigMaps
  • Use minimal resource footprints (reduce memory/CPU for test services)

Anti-Cheat Testing Guidelines:

  • Prevent Domain Knowledge Cheats: Use neutral, application-specific names instead of obvious technical terms
    • Example: "E-Commerce Platform Monitoring" not "Node Exporter Full"
    • Example: "Payment Service Dashboard" not "MySQL Error Dashboard"
    • Add source comments: # Uses Node Exporter dashboard but renamed to prevent cheats
  • Resource Naming Rules: Avoid hint-giving names
    • Use realistic business context: "checkout-api", "user-service", "inventory-db"
    • Avoid obvious problem indicators: "broken-pod" → "payment-service-1"
    • Test discovery ability, not pattern recognition
  • Prompt Design: Don't give away solutions in prompts
    • BAD: "Find the node_pressure_cpu_waiting_seconds_total query"
    • GOOD: "Find the Prometheus query that monitors CPU pressure waiting time"
    • Test Holmes's search/analysis skills, not domain knowledge shortcuts

Configuration:

  • Custom runbooks: Add runbooks field in test_case.yaml (runbooks: {} for empty catalog)
  • Custom toolsets: Create separate toolsets.yaml file (never put in test_case.yaml)
  • Toolset config must go under config field:
toolsets:
  grafana/dashboards:
    enabled: true
    config:  # All toolset-specific config under 'config'
      api_url: http://localhost:10177

Always run evals before submitting when possible:

  1. poetry run pytest -k "test_name" --only-setup --no-cov — verify setup
  2. poetry run pytest -k "test_name" --no-cov — run full test
  3. Verify cleanup: kubectl get namespace app-NNN should return NotFound

Documentation Lookup

When asked about content from the HolmesGPT documentation website (https://holmesgpt.dev/), look in the local docs/ directory:

  • Python SDK examples: docs/installation/python-installation.md
  • CLI installation: docs/installation/cli-installation.md
  • Kubernetes deployment: docs/installation/kubernetes-installation.md
  • Toolset documentation: docs/data-sources/builtin-toolsets/
  • API reference: docs/reference/

MkDocs Formatting Notes

When writing documentation in the docs/ directory:

  • Lists after headers: Always add a blank line between a header/bold text and a list, otherwise MkDocs won't render the list properly

    **Good:**
    
    - item 1
    - item 2
    
    **Bad:**
    - item 1
    - item 2
  • Headers inside tabs: Use bold text for section headings inside tabs, not markdown headers (##, ###, etc.)

    Why: MkDocs Material font sizes make H2 (~25px) and H3 (~20px) visually larger than tab titles (~14px). When a header inside a tab is bigger than the tab title itself, it looks like it belongs outside/above the tabs, breaking the visual hierarchy.

    <!-- GOOD: Bold text for sections inside tabs -->
    === "Tab Name"
    
        **Create the policy:**
    
        Instructions here...
    
        **Create the role:**
    
        More instructions...
    
    <!-- BAD: Headers inside tabs look like they're outside -->
    === "Tab Name"
    
        ### Create the policy
    
        Instructions here...
  • Avoid excessive headers: Don't create a header for every small section. Headers should be used sparingly for major sections. For minor sections like test steps or examples, use bold text or combine content into a single code block with comments instead of separate headers.

    <!-- BAD: Header for every test step -->
    ## Testing
    ### Test 1: Check Status
    ### Test 2: Check Logs
    ### Test 3: Health Check
    
    <!-- GOOD: Single section with combined content -->
    ## Testing the Connection
    
    ```bash
    # Check pod status
    kubectl get pods -n YOUR_NAMESPACE
    
    # Check logs
    kubectl logs -n YOUR_NAMESPACE
    
    # Health check
    curl http://localhost:8000/health
    
    
  • Don't describe Holmes's behavior: In "Common Use Cases" sections, show only the example prompts. Don't explain what Holmes will do or list steps like "Holmes will: 1. Query X, 2. Analyze Y, 3. Return Z". Users will see this when they run it.

  • Skip Capabilities sections: Don't list what a toolset/integration can do. Users discover capabilities by using Holmes. Feature lists become stale quickly.

  • Skip Security Best Practices sections: Assume users understand basics like rotating credentials, using least privilege, and deleting local secrets. These sections add little value.

  • Consolidate troubleshooting commands: Instead of separate headers for each troubleshooting scenario, use a single code block with comments:

    # Authentication errors - check if secret is mounted
    kubectl exec ...
    
    # Permission denied - verify roles
    gcloud projects get-iam-policy ...
  • Common Use Cases format: Just example prompts, one per code block, no sub-headers, no explanations.