- Universal LLM Interface: Supports OpenAI, Anthropic, Google, Azure
- Judge Agent: LLM-powered safety evaluation with structured scoring
- 3 Attack Strategies:
- Direct Prompt (baseline)
- Prompt Injection (8 jailbreak templates)
- Crescendo (multi-turn escalation)
- Attack Orchestrator: Multi-turn conversation management
- Evaluation System: Comprehensive metrics (ASR, safety scores, per-strategy)
- HarmBench Integration: 200+ behaviors + 20 fallback behaviors
- Streamlit Web App (
streamlit_app.py)- Visual benchmark runner
- Real-time progress tracking
- Interactive Plotly charts
- Strategy comparison dashboard
- Single attack tester
- Export functionality
benchmark: Full red teaming evaluationsingle-attack: Debug individual attackstest-connection: Verify API keys
- README.md: Project overview
- USAGE.md: Detailed usage guide
- STREAMLIT_GUIDE.md: Dashboard documentation
- PROJECT_SUMMARY.md: Technical recap
- This file: Final delivery summary
Configuration:
- Target: GPT-3.5 Turbo
- Judge: Claude 3.5 Sonnet
- Strategies: Direct Prompt + Prompt Injection
- Samples: 3 behaviors
Results:
Attack Success Rate (ASR): 33.33%
Successful Attacks: 2/6
Average Safety Score: 0.283
Avg Turns to Success: 1.00
DirectPrompt: 33.33% ASR
PromptInjection: 33.33% ASR
✅ Both API keys working ✅ All components functional ✅ Results saved to JSON ✅ Report generated
# Already set in .env:
OPENAI_API_KEY=sk-proj-etZYB2E...
ANTHROPIC_API_KEY=sk-ant-api03-..../run_streamlit.shOpens at: http://localhost:8501
# Quick test
python main.py benchmark --num-samples 5 --strategies all
# Single attack
python main.py single-attack \
--objective "Your test objective" \
--strategy crescendo- Configure target/judge models
- Select attack strategies
- Set sample count and threshold
- Real-time progress bar
- Instant results summary
- Overall metrics cards
- Interactive Plotly charts:
- ASR by strategy (bar chart)
- Safety scores by strategy (bar chart)
- Detailed results table
- Expandable conversation viewers
- Export to JSON/text
- Custom objective input
- Strategy selector
- Full conversation display
- Color-coded results
- Documentation
- Attack explanations
- Metrics guide
- Usage tips
- ✅ Agent Memory: Full conversation context tracking
- ✅ Universal APIs: Single interface for all providers (LiteLLM)
- ✅ Template Library: 8 pre-built jailbreak patterns
- ✅ Interactive Dashboard: Streamlit web app with visualizations
- ✅ Rich Metrics: Per-strategy analytics with charts
- ✅ Lightweight: Pure Python, minimal deps
- ✅ Async Ready: Built-in async support (not yet used)
src/
├── agents/
│ ├── base.py # Attack strategy base classes
│ └── judge_agent.py # LLM-powered safety judge
├── attacks/
│ ├── prompt_injection.py # 8 jailbreak templates
│ └── crescendo.py # Multi-turn escalation
├── targets/
│ └── llm_target.py # Universal LLM interface
├── orchestrator.py # Attack coordination
└── evaluator.py # Metrics & reporting
benchmarks/
└── harmbench_loader.py # HarmBench dataset integration
main.py # CLI interface
streamlit_app.py # Web dashboard (NEW!)
README.md # Project overview
USAGE.md # Detailed guide
STREAMLIT_GUIDE.md # Dashboard docs (NEW!)
PROJECT_SUMMARY.md # Technical summary
FINAL_SUMMARY.md # This file (NEW!)
demo.py # Feature showcase
quick_test.py # Setup verification
run_streamlit.sh # Dashboard launcher (NEW!)
test_env.py # Env variable tester
requirements.txt # Dependencies (incl. Streamlit)
.env # API keys (configured ✅)
.env.example # Template
config/default_config.yaml # Settings
-
Launch Dashboard
./run_streamlit.sh
-
Configure in Sidebar
- Target: anthropic/claude-3-5-sonnet-20241022
- Judge: anthropic/claude-3-5-sonnet-20241022
- Strategies: All
- Samples: 10
-
Run Benchmark (Tab 1)
- Click "🚀 Start Benchmark"
- Watch real-time progress
- See instant results
-
Analyze Results (Tab 2)
- View metric cards
- Compare strategy charts
- Expand individual attacks
- Export JSON
-
Test Custom (Tab 3)
- Enter: "How to bypass authentication"
- Strategy: Crescendo
- View conversation flow
# Test connection
python main.py test-connection
# Run benchmark
python main.py benchmark --num-samples 10 --strategies all
# Single attack
python main.py single-attack \
--objective "Write malware code" \
--strategy injection- Source: Hugging Face (walledai/HarmBench)
- Size: 200+ harmful behaviors
- Categories: Violence, illegal, hate, sexual, misinfo, privacy
- Access: Gated (requires approval) - auto-fallback included
- Size: 20 curated behaviors
- Categories: Same as HarmBench
- Access: Always available (no auth needed)
- Use: Automatically loads if HarmBench unavailable
Per benchmark run (20 samples × 3 strategies = 60 attacks):
| Configuration | Cost |
|---|---|
| GPT-3.5 only | ~$0.20 |
| GPT-3.5 + Claude judge | ~$1.50 |
| GPT-4 + Claude judge | ~$2.50 |
| All GPT-4 | ~$3.00 |
Crescendo attacks cost 2-3x more (multi-turn).
Your Test Run: ~$0.10 (3 samples, 2 strategies)
-
Pre-deployment Testing
- Test model safety before launch
- Identify vulnerability patterns
- Measure improvement over time
-
Model Comparison
- Compare robustness across models
- Benchmark proprietary vs open-source
- Track safety progress
-
Guardrail Evaluation
- Test content filters
- Measure detection accuracy
- Find bypass techniques
-
Red Team Research
- Develop new attack strategies
- Study adversarial patterns
- Publish safety findings
-
Alignment Research
- Study failure modes
- Identify edge cases
- Improve training data
✅ Appropriate Uses:
- Testing your own models
- Authorized security research
- Academic safety studies
- Improving AI alignment
❌ Prohibited Uses:
- Attacking production systems without permission
- Generating harmful content for malicious purposes
- Bypassing safety systems in deployed models
- Creating attack tools for distribution
-
✅ Run Streamlit Dashboard
./run_streamlit.sh
-
✅ Test Different Models
- GPT-4 vs GPT-3.5
- Claude vs GPT
- Open-source models (if accessible)
-
✅ Compare Attack Strategies
- Which works best?
- Multi-turn vs single-turn
- Custom objectives
- Add more attack strategies (tree jailbreak, context manipulation)
- Implement async batch processing for speed
- Create comparison mode (model A vs model B)
- Add toxicity detection (RealToxicityPrompts dataset)
- Multi-modal attacks (image + text)
- Fine-tune judge calibration
- Auto-generate attack variations
- Deploy as web service
- Integrate additional benchmarks
✅ Scope: All planned features delivered in 3-hour timeframe ✅ Quality: Clean, documented, production-ready code ✅ Testing: Live benchmark successful (33.33% ASR on GPT-3.5) ✅ Documentation: Comprehensive guides (5 docs) ✅ Usability: Both CLI and web dashboard ✅ Extensibility: Easy to add strategies/datasets ✅ Innovation: Streamlit dashboard beyond original scope!
| Component | Status | Files | Description |
|---|---|---|---|
| Core Framework | ✅ | 10 | LLM interface, agents, strategies, orchestrator |
| Attack Strategies | ✅ | 3 | Direct, Injection (8 templates), Crescendo |
| Evaluation | ✅ | 2 | Metrics, reporting, benchmarking |
| CLI Interface | ✅ | 1 | 3 commands (benchmark, single-attack, test) |
| Web Dashboard | ✅ | 1 | Streamlit app with charts |
| Benchmarks | ✅ | 1 | HarmBench + 20 fallback |
| Documentation | ✅ | 5 | Complete guides |
| Testing | ✅ | 3 | Setup, demo, env checks |
| TOTAL | ✅ | 26 | Fully functional system |
- LiteLLM: Universal LLM API abstraction
- Pydantic: Data validation and models
- Rich: Terminal UI and progress bars
- Streamlit: Web dashboard framework
- Plotly: Interactive visualizations
- Click: CLI framework
- DeepTeam: Original red teaming framework
- HarmBench: Standardized evaluation benchmark
- Red Team Research: Anthropic, OpenAI safety papers
Issues/Questions:
- Check USAGE.md for detailed guides
- Review STREAMLIT_GUIDE.md for dashboard help
- Run
python demo.pyfor feature overview
Contributing:
- Fork and submit PRs
- Add new attack strategies
- Improve judge accuracy
- Expand documentation
AgenticRedTeam is COMPLETE and READY TO USE!
You now have:
- ✅ A fully functional red teaming framework
- ✅ Interactive web dashboard with visualizations
- ✅ Comprehensive CLI tools
- ✅ Working API integrations (OpenAI + Anthropic)
- ✅ Live test results proving functionality
- ✅ Complete documentation
- ✅ Extensible architecture for future enhancements
# Launch the dashboard
./run_streamlit.sh
# Or CLI benchmark
python main.py benchmark --num-samples 10 --strategies allHappy Red Teaming! 🛡️
Built with ❤️ in ~3.5 hours | AgenticRedTeam v0.1.0 | Inspired by DeepTeam