This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
HolmesGPT is an AI-powered troubleshooting agent that connects to observability platforms (Kubernetes, Prometheus, Grafana, etc.) to automatically diagnose and analyze infrastructure and application issues. It uses an agentic loop to investigate problems by calling tools to gather data from multiple sources.
# Install dependencies with Poetry
poetry install# Install test dependencies with Poetry
poetry install --with dev# Run all non-LLM tests (unit and integration tests)
make test-without-llm
poetry run pytest tests -m "not llm"
# Run LLM evaluation tests (requires API keys)
make test-llm-ask-holmes # Test single-question interactions
make test-llm-investigate # Test AlertManager investigations
poetry run pytest tests/llm/ -n 6 -vv # Run all LLM tests in parallel
# Run pre-commit checks (includes ruff, mypy, poetry validation)
# NOTE: Only run these when the user explicitly asks. They run in CI automatically.
make check
poetry run pre-commit run -a# Format code with ruff
poetry run ruff format
# Check code with ruff (auto-fix issues)
poetry run ruff check --fix
# Type checking with mypy
poetry run mypyCLI Entry Point (holmes/main.py):
- Typer-based CLI with subcommands for
ask,investigate,toolset - Handles configuration loading, logging setup, and command routing
** Interactive mode for CLI** (holmes/interactive.py):
- Handles interactive mode for
asksubcommand - Implements slash commands
Configuration System (holmes/config.py):
- Loads settings from
~/.holmes/config.yamlor via CLI options - Manages API keys, model selection, and toolset configurations
- Factory methods for creating sources (AlertManager, Jira, PagerDuty, etc.)
Core Investigation Engine (holmes/core/):
tool_calling_llm.py: Main LLM interaction with tool calling capabilitiesinvestigation.py: Orchestrates multi-step investigations with runbookstoolset_manager.py: Manages available tools and their configurationstools.py: Tool definitions and execution logic
Plugin System (holmes/plugins/):
- Sources: AlertManager, Jira, PagerDuty, OpsGenie integrations
- Toolsets: Kubernetes, Prometheus, Grafana, AWS, Docker, etc.
- Prompts: Jinja2 templates for different investigation scenarios
- Destinations: Slack integration for sending results
Toolset Architecture:
- Each toolset is a YAML file defining available tools and their parameters
- Tools can be Python functions or bash commands with safety validation
- Toolsets are loaded dynamically and can be customized via config files
- Important: All toolsets MUST return detailed error messages from underlying APIs to enable LLM self-correction
- Include the exact query/command that was executed
- Include time ranges, parameters, and filters used
- Include the full API error response (status code and message)
- For "no data" responses, specify what was searched and where
Thin API Wrapper Pattern for Python Toolsets:
- Reference implementation:
servicenow_tables/servicenow_tables.py - Use
requestslibrary for HTTP calls (not specialized client libraries likeopensearchpy) - Simple config class with Pydantic validation
- Health check in
prerequisites_callable()method - Each tool is a thin wrapper around a single API endpoint
Server-Side Filtering is Critical:
- Never return unbounded data from APIs - this causes token overflow
- Always include filter parameters on tools that query collections (e.g.,
indexparameter for Elasticsearch _cat APIs) - Example problem:
opensearch_list_shardsreturned ALL shards → 25K+ tokens on large clusters - Example fix:
elasticsearch_cattool requiresindexparameter for shards/segments endpoints - When server-side filtering is not possible, use
JsonFilterMixin(seejson_filter_mixin.py) to addmax_depthandjqparameters for client-side filtering
Toolset Config Backwards Compatibility:
When renaming config fields in a toolset, maintain backwards compatibility using Pydantic's extra="allow":
# ✅ DO: Use extra="allow" to accept deprecated fields without polluting schema
class MyToolsetConfig(BaseModel):
model_config = ConfigDict(extra="allow")
# Only define current field names in schema
new_field_name: int = 10
@model_validator(mode="after")
def handle_deprecated_fields(self):
extra = self.model_extra or {}
deprecated = []
# Map old names to new names
if "old_field_name" in extra:
self.new_field_name = extra["old_field_name"]
deprecated.append("old_field_name -> new_field_name")
if deprecated:
logging.warning(f"Deprecated config names: {', '.join(deprecated)}")
return self
# ❌ DON'T: Define deprecated fields in schema with Optional[None]
class BadConfig(BaseModel):
new_field_name: int = 10
old_field_name: Optional[int] = None # Pollutes schema, shows in model_dump()Benefits of extra="allow" approach:
- Schema only shows current field names
model_dump()returns clean output without deprecated fields- Old configs still work (backwards compatible)
- Deprecation warnings guide users to update
See prometheus/prometheus.py PrometheusConfig for a complete example.
LLM Integration:
- Uses LiteLLM for multi-provider support (OpenAI, Anthropic, Azure, etc.)
- Structured tool calling with automatic retry and error handling
- Context-aware prompting with system instructions and examples
Investigation Flow:
- Load user question/alert
- Select relevant toolsets based on context
- Execute LLM with available tools
- LLM calls tools to gather data
- LLM analyzes results and provides conclusions
- Optionally write results back to source system
Three-tier testing approach:
- Unit Tests (
tests/): Standard pytest tests for individual components - Integration Tests: Test toolset integrations
- LLM Evaluation Tests (
tests/llm/): End-to-end tests using fixtures
Running regular (non-LLM) tests:
poetry run pytest tests -m "not llm"
make test-without-llmRunning LLM eval tests:
# Run specific eval - IMPORTANT: Use -k flag, NOT full test path with brackets
poetry run pytest -k "09_crashpod" --no-cov
# Run all evals in parallel
poetry run pytest tests/llm/ -n 6 --no-cov
# Regression evals
poetry run pytest -m 'llm and easy' --no-covFor the complete eval CLI reference (flags, env vars, model comparison, debugging), see the /create-eval skill which contains full documentation in its reference files.
Config File Location: ~/.holmes/config.yaml
Key Configuration Sections:
model: LLM model to use (default: gpt-4.1)api_key: LLM API key (or use environment variables)custom_toolsets: Override or add toolsetscustom_runbooks: Add investigation runbooks- Platform-specific settings (alertmanager_url, jira_url, etc.)
Environment Variables:
OPENAI_API_KEY,ANTHROPIC_API_KEY: LLM API keysOPENROUTER_API_KEY: Alternative LLM provider via OpenRouter (domain:api.openrouter.ai)MODEL: Override default model(s) - supports comma-separated listRUN_LIVE: Enable live execution of tools in tests (default: true)BRAINTRUST_API_KEY: For test result tracking and CI/CD report generationBRAINTRUST_ORG: Braintrust organization name (default: "robustadev")ELASTICSEARCH_URL,ELASTICSEARCH_API_KEY: For Elasticsearch/OpenSearch cloud testing
Code Quality:
- Use Ruff for formatting and linting (configured in pyproject.toml)
- Type hints required (mypy configuration in pyproject.toml)
- Pre-commit hooks enforce quality checks in CI
- ALWAYS place Python imports at the top of the file, not inside functions or methods
- NEVER run
pre-commit,ruff, ormypyunless the user explicitly asks you to. These tools are triggered by commit hooks which are not installed on all machines, and running them causes widespread formatting/type changes to files unrelated to your task. Only lint/format files you are actively editing, and only if asked.
Documentation Examples:
- ALWAYS use Anthropic Claude models in code examples and documentation:
- Recommended:
anthropic/claude-sonnet-4-5-20250929oranthropic/claude-opus-4-5-20251101 - Use the latest Claude 4.5 family models (Sonnet or Opus)
- Recommended:
- Avoid using deprecated or older model versions like
claude-3.5-sonnet,gpt-4-vision-preview - Do NOT use GPT-4o or Gemini models in documentation examples
Testing Requirements:
- All new features require unit tests
- New toolsets require integration tests
- Complex investigations should have LLM evaluation tests
- Maintain 40% minimum test coverage
- Live execution is now enabled by default to ensure tests match real-world behavior
Pull Request Process:
- PRs require maintainer approval
- Pre-commit hooks are checked in CI (do NOT run them locally unless asked)
- LLM evaluation tests run automatically in CI
- Keep PRs focused and include tests
- ALWAYS use
git commit -sto sign off commits (required for DCO) - When committing, use
git commit -s --no-verifyto skip local pre-commit hooks (they are not installed consistently and will cause unrelated changes)
Git Workflow Guidelines:
- ALWAYS create commits, NEVER amend
- ALWAYS merge, NEVER rebase
- ALWAYS push, NEVER force push
- Maintain a history of your work to allow the user to revert back to a previous iteration
File Structure Conventions:
- Toolsets:
holmes/plugins/toolsets/{name}.yamlor{name}/ - Prompts:
holmes/plugins/prompts/{name}.jinja2 - Tests: Match source structure under
tests/
- All tools have read-only access by design
- Bash toolset validates commands for safety
- No secrets should be committed to repository
- Use environment variables or config files for API keys
- RBAC permissions are respected for Kubernetes access
For creating, running, and debugging LLM eval tests, use the /create-eval skill. It contains the complete workflow, test_case.yaml field reference, anti-hallucination patterns, infrastructure setup guides, and CLI reference.
Test Structure:
- Use sequential test numbers: check existing tests for next available number
- Required files:
test_case.yaml, infrastructure manifests,toolsets.yaml(if needed) - Use dedicated namespace per test:
app-<testid>(e.g.,app-177) - All resource names must be unique across tests to prevent conflicts
Tags:
- CRITICAL: Only use valid tags from
pyproject.toml- invalid tags cause test collection failures - Check existing tags before adding new ones, ask user permission for new tags
Cloud Service Evals (No Kubernetes Required):
- Evals can test against cloud services (Elasticsearch, external APIs) directly via environment variables
- Faster setup (<30 seconds vs minutes for K8s infrastructure)
before_testcreates test data in the cloud service,after_testcleans up- Use
toolsets.yamlto configure the toolset with env var references:api_url: "{{ env.ELASTICSEARCH_URL }}" - CI/CD secrets: When adding evals for a new integration, you must add the required environment variables to
.github/workflows/eval-regression.yamlin the "Run tests" step. Tell the user which secrets they need to add to their GitHub repository settings (e.g.,ELASTICSEARCH_URL,ELASTICSEARCH_API_KEY). - HTTP request passthrough: The root
conftest.pyhas aresponsesfixture withautouse=Truethat mocks ALL HTTP requests by default. When adding a new cloud integration, you MUST add the service's URL pattern to the passthrough list inconftest.py(search forrsps.add_passthru). Usere.compile()for pattern matching (e.g.,rsps.add_passthru(re.compile(r"https://.*\.cloud\.es\.io"))).
User Prompts & Expected Outputs:
- Be specific: Test exact values like
"The dashboard title is 'Home'"not generic"Holmes retrieves dashboard" - Match prompt to test: User prompt must explicitly request what you're testing
- BAD:
"Get the dashboard" - GOOD:
"Get the dashboard and tell me the title, panels, and time range"
- BAD:
- Anti-cheat prompts: Don't use technical terms that give away solutions
- BAD:
"Find node_exporter metrics" - GOOD:
"Find CPU pressure monitoring queries"
- BAD:
- Test discovery, not recognition: Holmes should search/analyze, not guess from context
- Ruling out hallucinations is paramount: When choosing between test approaches, prefer the one that rules out hallucinations:
- Best: Check specific values that can only be discovered by querying (e.g., unique IDs, injected error codes, exact counts)
- Acceptable: Use
include_tool_calls: trueto verify the tool was called when output values are too generic to rule out hallucinations - Bad: Check generic output patterns that an LLM could plausibly guess (e.g., "cluster status is green/yellow/red", "has N nodes")
- expected_output is invisible to LLM: The
expected_outputfield is only used by the evaluator - the LLM never sees it. This means:- You can safely put secrets/verification codes in
expected_outputthat the LLM must discover before_testcan inject a unique verification code into test data, andexpected_outputcan check for it- This is a powerful pattern for cloud service tests: create data with a unique code in
before_test, ask LLM to find it, verify withexpected_output
# Example: before_test creates a page with verification code "HOLMES-EVAL-7x9k2m4p" # The LLM must discover this code by querying the service expected_output: - "Must report the verification code: HOLMES-EVAL-7x9k2m4p"
- You can safely put secrets/verification codes in
include_tool_calls: true: Use when expected output is too generic to be hallucination-proof. Prefer specific answer checking when possible, but verifying tool calls is better than a test that can't rule out hallucinations.# Use when values are generic (cluster health could be guessed) include_tool_calls: true expected_output: - "Must call elasticsearch_cluster_health tool" - "Must report cluster status"
Infrastructure Setup:
- Don't just test pod readiness - verify actual service functionality
- Poll real API endpoints and check for expected content (e.g.,
"title":"Home","type":"welcome") - CRITICAL: Use
exit 1when setup verification fails to fail the test early - Never use
:latestcontainer tags - use specific versions likegrafana/grafana:12.3.1
NEVER submit test changes without verification:
- Setup Phase:
poetry run pytest -k "test_name" --only-setup --no-cov - Full Test:
poetry run pytest -k "test_name" --no-cov - Verify Results: Ensure 100% pass rate and expected behavior
- ✅ After creating new tests
- ✅ After modifying existing tests
- ✅ After refactoring shared infrastructure
- ✅ After performance optimizations
- ✅ After adding/changing tags
- ❌ "The changes look good" without running
- ❌ "It's just a small change"
- ❌ "I'll test it later"
Testing is Part of Development: Testing is not optional - it's an integral part of the development process. Untested code is broken code.
Testing Methodology:
- Phase 1: Test setup with
--only-setupflag first - Phase 2: Run full test after confirming setup works
- Use background execution for long tests:
nohup ... > logfile.log 2>&1 & - Handle port conflicts: clean up previous test port forwards before running
Common Flags:
--skip-cleanup: Keep resources after test (useful for debugging setup)--skip-setup: Skip before_test commands (useful for iterative testing)
When to use shared infrastructure:
- Multiple tests use the same service (Grafana, Loki, Prometheus)
- Service configuration is standardized across tests
Implementation:
# Create shared manifest in tests/llm/fixtures/shared/servicename.yaml
# Use in tests:
kubectl apply -f ../../shared/servicename.yaml -n app-<testid>Benefits:
- Single place for version updates
- Consistent configuration across tests
- Reduced maintenance overhead
- Follows established pattern (Loki, Prometheus, Grafana)
Prefer kubectl exec over port forwarding for setup verification:
# GOOD - kubectl exec pattern (no port conflicts)
kubectl exec -n namespace deployment/service -- wget -q -O- http://localhost:port/health
# AVOID - port forward for setup verification (causes conflicts)
kubectl port-forward svc/service port:port &
curl localhost:port/health
kill $PORTFWD_PIDPerformance optimization guidelines:
- Use
sleep 1instead ofsleep 5for most retry loops - Remove sleeps after straightforward operations (port forward start)
- Reduce timeout values: 60s for pod readiness, 30s for API verification
- Question every sleep - many are unnecessary
Race Condition Handling:
Never use bare kubectl wait immediately after resource creation. Use retry loops:
# WRONG - fails if pod not scheduled yet
kubectl apply -f deployment.yaml
kubectl wait --for=condition=ready pod -l app=myapp --timeout=300s
# CORRECT - retry loop handles race condition
kubectl apply -f deployment.yaml
POD_READY=false
for i in {1..60}; do
if kubectl wait --for=condition=ready pod -l app=myapp --timeout=5s 2>/dev/null; then
echo "✅ Pod is ready!"
POD_READY=true
break
fi
sleep 1
done
if [ "$POD_READY" = false ]; then
echo "❌ Pod failed to become ready after 60 seconds"
kubectl logs -l app=myapp --tail=20 # Diagnostic info
exit 1 # CRITICAL: Fail the test early
fiRealism:
- No fake/obvious logs like "Memory usage stabilized at 800MB"
- No hints in filenames like "disk_consumer.py" - use realistic names like "training_pipeline.py"
- No error messages that give away it's simulated like "Simulated processing error"
- Use real-world scenarios: ML pipelines with checkpoint issues, database connection pools
- Resource naming should be neutral, not hint at the problem (avoid "broken-pod", "crashloop-app")
Architecture:
- Implement full architecture even if complex (e.g., use Loki for log aggregation, not simplified alternatives)
- Proper separation of concerns (app → file → Promtail → Loki → Holmes)
- ALWAYS use Secrets for scripts, not inline manifests or ConfigMaps
- Use minimal resource footprints (reduce memory/CPU for test services)
Anti-Cheat Testing Guidelines:
- Prevent Domain Knowledge Cheats: Use neutral, application-specific names instead of obvious technical terms
- Example: "E-Commerce Platform Monitoring" not "Node Exporter Full"
- Example: "Payment Service Dashboard" not "MySQL Error Dashboard"
- Add source comments:
# Uses Node Exporter dashboard but renamed to prevent cheats
- Resource Naming Rules: Avoid hint-giving names
- Use realistic business context: "checkout-api", "user-service", "inventory-db"
- Avoid obvious problem indicators: "broken-pod" → "payment-service-1"
- Test discovery ability, not pattern recognition
- Prompt Design: Don't give away solutions in prompts
- BAD: "Find the node_pressure_cpu_waiting_seconds_total query"
- GOOD: "Find the Prometheus query that monitors CPU pressure waiting time"
- Test Holmes's search/analysis skills, not domain knowledge shortcuts
Configuration:
- Custom runbooks: Add
runbooksfield in test_case.yaml (runbooks: {}for empty catalog) - Custom toolsets: Create separate
toolsets.yamlfile (never put in test_case.yaml) - Toolset config must go under
configfield:
toolsets:
grafana/dashboards:
enabled: true
config: # All toolset-specific config under 'config'
api_url: http://localhost:10177Always run evals before submitting when possible:
poetry run pytest -k "test_name" --only-setup --no-cov— verify setuppoetry run pytest -k "test_name" --no-cov— run full test- Verify cleanup:
kubectl get namespace app-NNNshould return NotFound
When asked about content from the HolmesGPT documentation website (https://holmesgpt.dev/), look in the local docs/ directory:
- Python SDK examples:
docs/installation/python-installation.md - CLI installation:
docs/installation/cli-installation.md - Kubernetes deployment:
docs/installation/kubernetes-installation.md - Toolset documentation:
docs/data-sources/builtin-toolsets/ - API reference:
docs/reference/
When writing documentation in the docs/ directory:
-
Lists after headers: Always add a blank line between a header/bold text and a list, otherwise MkDocs won't render the list properly
**Good:** - item 1 - item 2 **Bad:** - item 1 - item 2
-
Headers inside tabs: Use bold text for section headings inside tabs, not markdown headers (
##,###, etc.)Why: MkDocs Material font sizes make H2 (~25px) and H3 (~20px) visually larger than tab titles (~14px). When a header inside a tab is bigger than the tab title itself, it looks like it belongs outside/above the tabs, breaking the visual hierarchy.
<!-- GOOD: Bold text for sections inside tabs --> === "Tab Name" **Create the policy:** Instructions here... **Create the role:** More instructions... <!-- BAD: Headers inside tabs look like they're outside --> === "Tab Name" ### Create the policy Instructions here...
-
Avoid excessive headers: Don't create a header for every small section. Headers should be used sparingly for major sections. For minor sections like test steps or examples, use bold text or combine content into a single code block with comments instead of separate headers.
<!-- BAD: Header for every test step --> ## Testing ### Test 1: Check Status ### Test 2: Check Logs ### Test 3: Health Check <!-- GOOD: Single section with combined content --> ## Testing the Connection ```bash # Check pod status kubectl get pods -n YOUR_NAMESPACE # Check logs kubectl logs -n YOUR_NAMESPACE # Health check curl http://localhost:8000/health
-
Don't describe Holmes's behavior: In "Common Use Cases" sections, show only the example prompts. Don't explain what Holmes will do or list steps like "Holmes will: 1. Query X, 2. Analyze Y, 3. Return Z". Users will see this when they run it.
-
Skip Capabilities sections: Don't list what a toolset/integration can do. Users discover capabilities by using Holmes. Feature lists become stale quickly.
-
Skip Security Best Practices sections: Assume users understand basics like rotating credentials, using least privilege, and deleting local secrets. These sections add little value.
-
Consolidate troubleshooting commands: Instead of separate headers for each troubleshooting scenario, use a single code block with comments:
# Authentication errors - check if secret is mounted kubectl exec ... # Permission denied - verify roles gcloud projects get-iam-policy ...
-
Common Use Cases format: Just example prompts, one per code block, no sub-headers, no explanations.