Skip to content

redhat-et/rhdp-rca-plugin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Red Hat Logo

RHDP RCA Plugin

Version GitHub Stars Visitors License

Claude Code plugin for AI-assisted root-cause analysis of infrastructure failures and operational incidents.

How It Works β€’ Quick Start β€’ Available Skills β€’ Contributing


What is RHDP RCA Plugin?

RHDP RCA Plugin is a Claude Code marketplace containing specialized skills designed for Red Hat Demo Platform (RHDP) root cause analysis. This plugin suite enables AI-powered investigation of infrastructure failures, log analysis, and root cause diagnosis. These skills provide Claude with the tools to:

  • Fetch and analyze logs from remote servers
  • Correlate multiple data sources (Ansible, Splunk, GitHub)
  • Perform automated root cause analysis
  • Capture and organize user feedback

Quick Start

Prerequisites

  • Claude Code installed
  • SSH access to remote servers (for optional log auto-fetch)
  • Splunk credentials (for log correlation)

1) Install the plugin

  • Open Claude Code and run /plugin
  • Add marketplace: redhat-et/rhdp-rca-plugin
  • Install the plugin and restart Claude Code

2) Run root cause analysis

Start with a normal RCA request:

/aiops-plugin:root-cause-analysis job 123456

The root-cause-analysis workflow runs preflight checks and setup guidance first, then runs steps 1-4 automatically (log parse, Splunk correlation, GitHub context fetch), followed by Step 5 analysis.

3) Manual fallback (only if needed)

If preflight setup does not complete in your environment:

  1. Copy and update .claude/settings.example.json.
  2. Apply it to your local .claude/settings.local.json (project-level), including env vars and hooks.

Note: These hooks are required for MLflow tracing. The Stop hook flushes traces and the SessionStart hook captures the session ID.

For tracing:


Available Skills

Skill Description Key Features
template-skill Template for creating new skills Starter template, best practices
logs-fetcher Fetch Ansible/AAP logs via SSH Time-based filtering, job number lookup
root-cause-analysis Automated RCA for failed jobs Log correlation, Splunk + GitHub integration
context-fetcher Fetch job configs and docs GitHub and Confluence integration
feedback-capture Capture user feedback Structured storage, categorization

Skill Details

πŸ” logs-fetcher

Fetch Ansible/AAP logs from remote servers with flexible filtering

# Fetch logs from a specific time range
python -m scripts.fetch_logs_ssh \
  --start-time "2025-12-09 08:00:00" \
  --end-time "2025-12-10 17:00:00" \
  --mode processed

# Fetch logs by job number
python -m scripts.fetch_logs_by_job 1234567 1234568 1234569

Use cases:

  • Fetch logs from specific time windows (minute/second precision)
  • Retrieve logs for specific job numbers
  • Download recent processed or ignored job logs
  • Investigate incidents within a known timeframe

View detailed documentation β†’


πŸ”Ž root-cause-analysis

Investigate failed jobs by correlating Ansible/AAP logs with Splunk OCP pod logs and GitHub configuration

Step 1   [Python]  Parse local job log (extract GUID, namespace, failed tasks)
Step 2   [Python]  Query Splunk for correlated pod logs
Step 3   [Python]  Build correlation timeline
Step 4   [Python]  Fetch GitHub configs (AgnosticD/AgnosticV)
Step 5   [Claude]  Analyze and summarize root cause

Command Usage:

# By job ID (auto-fetches log from remote if not found locally)
.venv/bin/python scripts/cli.py analyze --job-id <JOB_ID> --fetch

# By explicit path (when you already have the log file)
.venv/bin/python scripts/cli.py analyze --job-log <path-to-job-log>

Use cases:

  • Investigate job failures
  • Analyze logs for errors and patterns
  • Find root causes of infrastructure issues
  • Debug failed deployments
  • Troubleshoot Kubernetes/OpenShift problems

View detailed documentation β†’

context-fetcher

Fetch configuration and documentation context via MCP servers

Integrates with:

  • GitHub: Job configs, recent commits, CI workflows
  • Confluence: Runbooks, troubleshooting guides, documentation

Use cases:

  • Retrieve job configuration from repositories
  • Access relevant documentation during investigations
  • Review recent code changes related to failures

View detailed documentation β†’


πŸ’¬ feedback-capture

Capture and store user feedback during interactions

Features:

  • Ask users for feedback interactively
  • Categorize feedback (Complexity, Clarity, Accuracy, etc.)
  • Summarize interaction context
  • Record structured feedback with timestamps

Feedback is appended to ~/feedback.txt by default with session tracking.

Use cases:

  • Collect feedback at the end of skill invocations
  • Track user sentiment across sessions
  • Categorize and store bug reports

View detailed documentation β†’


How It Works

Architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Claude Code UI    β”‚
                    β”‚  (User Interface)   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚         RHDP RCA Plugin Marketplace       β”‚
         β”‚                                           β”‚
         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
         β”‚  β”‚  Skills (SKILL.md definitions)      β”‚  β”‚
         β”‚  β”‚                                     β”‚  β”‚
         β”‚  β”‚  β€’ template-skill                   β”‚  β”‚
         β”‚  β”‚  β€’ logs-fetcher ──────► SSH         β”‚  β”‚
         β”‚  β”‚  β€’ root-cause-analysis ──► Splunk   β”‚  β”‚
         β”‚  β”‚                      └──► GitHub APIβ”‚  β”‚
         β”‚  β”‚  β€’ context-fetcher ──► MCP Servers  β”‚  β”‚
         β”‚  β”‚  β€’ feedback-capture ──► Local FS    β”‚  β”‚
         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                                           β”‚
    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
    β”‚ GitHub  β”‚    β”‚Confluenβ”‚    β”‚ External     β”‚   β”‚
    β”‚   MCP   β”‚    β”‚ce MCP  β”‚    β”‚ Systems      β”‚   β”‚
    β”‚         β”‚    β”‚        β”‚    β”‚ (SSH/Splunk) β”‚   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚

Integration Points:

  • MCP Servers: GitHub (code search, file retrieval) and Confluence (documentation)
  • Direct APIs: Splunk REST API, GitHub API
  • SSH: Remote log server access
  • Local: File system for logs and feedback

Each skill follows the Anthropic Agent Skills Specification with SKILL.md definitions that Claude Code loads automatically.

End-to-End RCA Workflow

When investigating a failed job:

  1. User Query: "/root-cause-analysis job 1234567"
  2. Skill Selection: Claude selects root-cause-analysis
  3. Data Collection (Steps 1-4, automated):
    • Parse job log (local file)
    • Query Splunk for pod logs
    • Correlate timeline
    • Fetch GitHub configs via API
  4. AI Analysis (Step 5): Claude analyzes and identifies root cause
  5. Results: Summary with evidence and recommendations

Usage with Claude Code

Simply invoke skills by describing your task:

"Analyze job 1234567 for root cause"
"Investigate why this deployment failed"
"Fetch logs from the last 2 hours"

Claude will automatically select and invoke the appropriate skill based on your request.


Creating a New Skill

  1. Create a directory with your skill name (lowercase, hyphen-separated)
  2. Add a SKILL.md file:
---
name: my-skill
description: Brief description of what this skill does
allowed-tools:
  - Bash
  - Read
---

# My Skill

Instructions for Claude...

See template-skill for a minimal example and agent_skills_spec.md for the full specification.


Contributing

We welcome contributions! Please ensure your skill:

  • Follows the Agent Skills Spec
  • Includes clear, actionable instructions
  • Is focused on a specific AIOps domain
  • Includes appropriate documentation and examples

See CONTRIBUTING.md for detailed contribution guidelines.


License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Copyright 2025 Red Hat ACE Team

Individual skills may specify their own licenses in their frontmatter.


Support


Built by the Red Hat ACE Team

About

Claude skills for automated root cause analysis of infrastructure failures.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors