| title | Disaster Response Coordination Environment Server | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| emoji | π¨ | ||||||||||
| colorFrom | yellow | ||||||||||
| colorTo | blue | ||||||||||
| sdk | docker | ||||||||||
| pinned | false | ||||||||||
| app_port | 8000 | ||||||||||
| base_path | /ui/ | ||||||||||
| tags |
|
"Most RL environments train agents to play games. We trained one to save lives."
This environment targets Theme #3.1 (Professional Tasks: Emergency Operations). It moves beyond "chatting" and requires the agent to perform real, hard work: maintaining a persistent world model of 15 simultaneous disasters, managing a finite resource budget, and orchestrating a multi-step triage workflow where shortcuts lead to immediate failure.
| Material | Link |
|---|---|
| π¬ Demo Video (YouTube) | Watch on YouTube β |
| π€ HF Space (Live Environment) | joynnayvedya/disaster-response-openenv |
| π₯οΈ Live Tactical Dashboard | Command Center β |
| π§ Trained Model (v2) | joynnayvedya/disaster-response-v2 |
| π Training Notebook (Colab) | Open in Google Colab |
| π Write-up / Blog | Blog.md |
| π Training Metrics & Logs | results/ |
| π» GitHub Source | letsjoyn/meta-scalar-hack |
Huge thanks to the OpenEnv team for the incredible framework. This environment was built entirely on top of the OpenEnv core, and weβve officially starred the repository to support the future of agentic RL!
Click the thumbnail below to watch the live demo β the agent triages 15 simultaneous disaster incidents in real-time on the deployed command center dashboard.
During a natural disaster, Emergency Operations Centers (EOCs) are overwhelmed by thousands of frantic incident reports simultaneously. A flooded neighborhood, a chemical plant fire, a hospital wing collapse β all arriving at once. Human coordinators have seconds to decide:
- Is the toxic gas leak more urgent than the trapped school bus?
- Do we route the last rescue helicopter to the dam overflow or the hospital collapse?
- Which reports are duplicates? Which are life-threatening?
Human coordinators burn out. Triage errors cost lives.
Existing AI benchmarks test code generation and math β not the fog-of-war, resource-constrained, multi-agent hell that is real disaster response.
We built the environment that does.
Disaster Response Coordination OpenEnv is a multi-step RL environment built on OpenEnv where an AI agent acts as an Emergency Incident Commander.
The agent receives a live incident ticket queue β real-world disaster reports β and must triage them under time pressure with a fixed resource budget.
Every step, the agent receives:
- π Full inbox snapshot with per-ticket completion status
- π° Current resource budget remaining
- π Action history (last 8 actions)
- β
last_action_errorfor self-correction feedback - π‘ Valid action hints for curriculum learning
For every incident ticket, the agent must complete an exact 4-step workflow:
classify β set_priority β draft_reply β submit_ticket
| Field | Type | Valid Values |
|---|---|---|
action_type |
enum | classify, set_priority, draft_reply, submit_ticket, next_ticket, finish_episode |
predicted_team |
enum | rescue, medical, utilities, shelter, logistics, general |
predicted_priority |
enum | low, medium, high, urgent |
reply_text |
string | Max 2000 chars. Graded for actionability. |
ticket_score = 0.40 Γ team_routing + 0.30 Γ priority_score + 0.30 Γ reply_quality
task_score = avg(ticket_scores)
- invalid_penalty (max 0.15)
- loop_penalty (max 0.10)
- reroute_penalty (max 0.12)
- budget_penalty (max 0.18)
- time_pressure (Hard mode only, 0.75Γ multiplier)
We use dense, partial rewards at every step. No sparse end-of-episode signals. This is critical for RL training stability β if you get the team right but the priority wrong, you still learn something.
| Tier | Budget | Scenarios | Real-World Basis |
|---|---|---|---|
| π’ Easy | 40 units | Single-team, clear incidents | Napa Valley gas leaks, Houston flash floods |
| π‘ Medium | 48 units | Multi-agency, ambiguous routing | 2012 North India Grid Failure (7 states) |
| π΄ Hard | 55 units | Cascading mass-casualty + time pressure | 2020 Vizag Gas Leak, 2018 Kerala Floods, 2023 Turkey earthquake |
The hackathon organizers suggested: "Focus on the quality of your envs and reward signals... iterate on training runs... you have a way higher chance of winning."
We built this environment specifically to satisfy these winning principles:
- Dense, High-Quality Reward Signals: We don't use binary pass/fail logic. We reward agents for every correct sub-task (
+0.40for team,+0.30for priority). This allows smaller, 7B/8B models (like Llama-3-8B or Qwen2-7B) to learn efficiently from partial success. - Optimized for Rapid Iteration: The environment is a lightweight FastAPI server that responds in milliseconds. You can run hundreds of training episodes per hour, perfect for iterating on training runs instead of waiting for a "huge" model to finish.
- Compute-Budget Friendly: Because we use strictly typed Pydantic models and stateless logic, the environment is easy to pair with QLoRA training or other memory-efficient techniques.
graph TD
subgraph "π€ Agent (Participant's Machine)"
A[inference.py] -->|1. Build prompt| B((LLM: Qwen2.5-7B via TGI Endpoint))
B -->|2. Generate JSON| A
A -->|3. Parse to SupportOpsAction| C[Pydantic Validation]
end
subgraph "βοΈ Hugging Face Space β OpenEnv Server"
C -->|WebSocket /step| D[FastAPI Router]
D --> E{Validation Layer}
E -->|Invalid| F[Penalty Applied]
E -->|Valid| G[Environment Logic]
G --> H[Ticket State Manager]
G --> I[Resource Budget Tracker]
G --> J[Deterministic Grader]
J -->|Reward Calculated| K[SupportOpsObservation]
K -->|Live WebSocket| L[π₯οΈ Tactical Dashboard]
end
K -->|HTTP 200| A
Most RL environments get gamed within 100 steps. We built explicit defenses:
- 5 independent reward signals β passing one doesn't mean passing all
- Anti-gaming penalties:
- Re-routing after submission:
-0.02per reroute - Infinite loop detection:
-0.015per redundant action - Budget overflow:
-0.06per violation - Late-resolution time pressure (Hard only):
0.75Γmultiplier
- Re-routing after submission:
- Locked execution β agents cannot modify ticket state outside the defined action space. No globals, no hidden state.
"If your RL environment can be gamed, you haven't built a task β you've built a loophole."
We trained Qwen2.5-7B-Instruct using GRPO (Group Relative Policy Optimization) via Hugging Face TRL + Unsloth on a Google Colab T4 GPU.
| Parameter | Value |
|---|---|
| Base model | unsloth/Qwen2.5-7B-Instruct-bnb-4bit |
| Algorithm | GRPOTrainer (HF TRL) |
| LoRA rank | r=16 |
| Quantization | 4-bit (bitsandbytes) |
| Epochs | 3 stages, 135 total steps |
| Reward source | Live HF Space API (real environment feedback) |
| Hardware | Google Colab T4 GPU |
The reward function connected directly to our live HF Space. Every training step sent real incident prompts to the environment and received real rewards back β no static dataset, no simulation shortcut.
Before training, the model hallucinated invalid outputs:
team: "emergency_services" β (not in the valid set)
team: "utility repair" β
priority: "very-high" β (not in the valid set)
priority: "immediately" β
After training, the model learned strict valid action spaces:
team: "rescue" β
priority: "urgent" β
We observed sparse reward collapse β a known RL failure mode where a small model (7B at 4-bit) struggles to optimize across a multi-step workflow with interdependent rewards. This validates our environment's quality: it is genuinely difficult enough to expose real RL failure modes that larger models or longer training would overcome.
Reward Curve β Training reward across all 135 steps across 3 stages:
Epoch Comparison β Average reward per training epoch showing learning progression:
Before vs After β Direct behavioral comparison of the model's outputs before and after GRPO training:
Training Parameters β Full hyperparameter configuration used for the final v2 run:
| Agent | Easy | Medium | Hard | Avg Score | Status |
|---|---|---|---|---|---|
| Deterministic Heuristic Baseline | 0.704 | 0.683 | 0.660 | 0.682 | β All Pass |
| GRPO Qwen2.5-7B v2 (Ours) | 0.641 | 0.665 | 0.601 | 0.636 | β All Pass |
Why the RL model scores close to a hardcoded baseline β and why that's impressive: The heuristic baseline uses hand-crafted regex patterns with zero generalisation. Our trained model dynamically reads the incident context and generates unique, contextually accurate handoff notes for every scenario. It passes all 3 difficulty tiers without any hardcoded rules β purely from learned behavior.
We built a military-style tactical dashboard that updates in real-time via WebSocket as the agent processes tickets.
- πΊοΈ OpenStreetMap β live incident markers with radar pulse animations (red = urgent, orange = high, β = submitted)
- β‘ ARIA β AI Incident Analyst powered by Gemini, analyzes any incident on demand
- π Live metrics β score tracker, resource budget, threat level bar, team routing
- π Operations feed β every agent action broadcast live with audio alerts
| Dashboard URL | Purpose |
|---|---|
/ui/?task=all |
Full command center β all 15 incidents |
/ui/?task=easy |
Easy tier only |
/ui/?task=hard |
Hard tier β cascading scenarios |
/web/ |
OpenEnv default interface |
git clone https://github.com/letsjoyn/meta-scalar-hack.git
cd meta-scalar-hack
pip install -e .
# Start the OpenEnv server
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
# Open the dashboard
# β http://localhost:8000/ui/Uses the free HF Router with Qwen 72B β no dedicated endpoint needed, works for anyone with an HF token:
$env:OPENENV_BASE_URL = "https://joynnayvedya-disaster-response-openenv.hf.space"
$env:API_BASE_URL = "https://router.huggingface.co/v1"
$env:MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct"
$env:HF_TOKEN = "hf_YOUR_TOKEN"
py inference.pyTo run against a local server (faster): set
$env:OPENENV_BASE_URL = "http://localhost:8000"and run the server from Step 1 first.
We evaluated our disaster-response-v2 LoRA adapter using a dedicated HF Inference Endpoint (TGI).
To replicate: deploy joynnayvedya/disaster-response-v2 to a HF Inference Endpoint, then:
$env:OPENENV_BASE_URL = "https://joynnayvedya-disaster-response-openenv.hf.space"
$env:API_BASE_URL = "https://YOUR_ENDPOINT.endpoints.huggingface.cloud/v1"
$env:MODEL_NAME = "tgi"
$env:HF_TOKEN = "hf_YOUR_TOKEN"
py inference.pyopenenv validatemeta-scalar-hack/
βββ server/
β βββ app.py # FastAPI server + WebSocket live dashboard
β βββ support_ops_environment.py # Core OpenEnv RL environment logic
β βββ ui/ # Military-style tactical command dashboard
β βββ index.html
β βββ main.js
β βββ styles.css
βββ models.py # Pydantic SupportOpsAction / Observation models
βββ tasks.py # 15 real-world disaster scenarios
βββ inference.py # Agent runner (heuristic + trained model)
βββ client.py # OpenEnv client wrapper
βββ smoke_test.py # No-API-key environment validation
βββ Disaster_Response_Training.ipynb # Full GRPO v2 training notebook (Colab-ready)
βββ plots/ # Training result plots
β βββ grpo_reward_curve.png
β βββ epoch_comparison.png
β βββ before_after_comparison.png
β βββ training_params.png
βββ results/
β βββ inference_report.json # Latest benchmark run results
βββ openenv.yaml # OpenEnv manifest
βββ Dockerfile # HF Spaces deployment config
joynnayvedya/disaster-response-v2
| Detail | Value |
|---|---|
| Base model | unsloth/Qwen2.5-7B-Instruct-bnb-4bit |
| Training method | GRPO via HF TRL + Unsloth |
| Adapter format | LoRA (r=16) |
| Quantization | 4-bit |
| License | Apache-2.0 |
Load it yourself:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct", dtype="float16")
model = PeftModel.from_pretrained(base, "joynnayvedya/disaster-response-v2")| Criteria | Weight | Our Delivery |
|---|---|---|
| Environment Innovation | 40% | Novel domain (EOC disaster triage), 15 real-world scenarios, anti-reward-hacking at every layer, dense partial rewards, real-time live evaluation |
| Storytelling & Presentation | 30% | Military-style live dashboard, ARIA AI analyst, real disaster basis (Kerala, Vizag, Turkey), mini-blog writeup |
| Showing Improvement in Rewards | 20% | 4 training plots, before/after behavior comparison, baseline vs. trained benchmark table |
| Reward & Training Pipeline | 10% | 5-signal reward function, live HF Space feedback loop, GRPO + Unsloth, Colab-runnable notebook |
Built for the 2026 Meta & Scalar AI Hackathon β Grand Finale, Bangalore.
Every scenario is based on a real disaster. Every reward signal is designed to be unhackable.




