This document explains the two deployment modes available in the FFmpeg RTMP Power Monitoring system and helps you choose the right one for your use case.
The system supports two distinct deployment modes:
- Local Testing Mode (Docker Compose) - For development and local testing
- Distributed Compute Mode (Master + Agents) - For production workloads
| Feature | Local Testing (Docker Compose) | Distributed Compute (Master + Agents) |
|---|---|---|
| Primary Use Case | Development, testing, demos | Production workloads, scaling |
| Deployment Complexity | Simple (one command) | Moderate (multiple nodes) |
| Resource Requirements | Single machine (4+ GB RAM) | Master + multiple compute nodes |
| Scalability | Limited to single machine | Horizontal scaling across nodes |
| Setup Time | 2-5 minutes | 10-15 minutes |
| Production Ready | No | Yes |
| Best For | Quick tests, feature development | Long-running benchmarks, multi-node workloads |
┌─────────────────────────────────────────────────────────┐
│ Single Host Machine (Development) │
│ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Docker Compose Stack │ │
│ │ │ │
│ │ ┌─────────────┐ ┌──────────────┐ ┌─────────┐ │ │
│ │ │ Victoria │ │ Grafana │ │ Nginx │ │ │
│ │ │ Metrics │ │ │ │ RTMP │ │ │
│ │ └─────────────┘ └──────────────┘ └─────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ 12+ Exporters (CPU, GPU, Docker, etc) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────┐ ┌──────────────┐ │ │
│ │ │ FFmpeg │ │ Alertmanager │ │ │
│ │ │ (Host) │ │ │ │ │
│ │ └─────────────┘ └──────────────┘ │ │
│ └───────────────────────────────────────────────────┘ │
│ │
│ All components run on localhost │
│ Network: Docker bridge (streaming-net) │
└─────────────────────────────────────────────────────────┘
Local Testing mode deploys all components on a single machine using Docker Compose. This mode is optimized for development and is NOT recommended for production workloads.
Use Local Testing Mode when:
- Developing new features or exporters
- Testing configuration changes
- Running quick performance tests
- Learning the system
- Creating demos or tutorials
- Debugging issues locally
Do NOT use Local Testing Mode for:
- Production workloads
- Long-running benchmarks (>1 hour)
- Multi-node scaling requirements
- High-availability deployments
- Resource-intensive transcoding jobs
- Docker 20.10+ and Docker Compose 2.0+
- Linux host with kernel 4.15+ (for RAPL power monitoring)
- 4+ GB RAM, 10+ GB free disk space
- Python 3.10+
- FFmpeg installed on host
1. Clone Repository
git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp2. Start Stack
# Start all services with one command
make up-build
# Or with GPU support (requires nvidia-container-toolkit)
make nvidia-up-build3. Verify Deployment
# Check all containers are running
docker compose ps
# Should see 12+ containers in "healthy" state4. Access Dashboards
- Grafana: http://localhost:3000 (admin/admin)
- VictoriaMetrics: http://localhost:8428
- Alertmanager: http://localhost:9093
5. Run Test
# Run a simple 60-second test
python3 scripts/run_tests.py single \
--name "test1" \
--bitrate 2000k \
--duration 606. View Results
- Open Grafana dashboard at http://localhost:3000
- Navigate to "Power Monitoring" dashboard
- View real-time metrics
# Stop all containers
make down
# Stop and remove volumes (clean slate)
docker compose down -v- Single machine only: Cannot distribute workloads across multiple nodes
- Resource contention: All services compete for CPU/RAM on one machine
- No high availability: If the machine fails, entire system goes down
- Limited scalability: Constrained by single machine's resources
- Not production-grade: No redundancy, no automatic failover
- Memory: ~1-2 GB (idle), ~2-4 GB (active test)
- CPU: ~5-10% (idle), ~60-420% (active test)
- Disk: ~3-6 GB initial, +50-110 MB per day
- Network: ~10-550 KB/s
┌────────────────────────────────────────────────────────────────┐
│ MASTER NODE │
│ (Coordination & Monitoring) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Master │ │ Victoria │ │ Grafana │ │
│ │ Service │ │ Metrics │ │ │ │
│ │ (Go HTTP) │ │ (TSDB) │ │ (Dashboards) │ │
│ │ │ │ │ │ │ │
│ │ Port 8080 │ │ Port 8428 │ │ Port 3000 │ │
│ └──────┬───────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ │ REST API (HTTP/JSON) │
└─────────┼──────────────────────────────────────────────────────┘
│
│ Network (LAN/WAN)
│
┌─────────┴──────────────────────────────────────────────────────┐
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ COMPUTE NODE 1 │ │ COMPUTE NODE N │ │
│ │ │ │ │ │
│ │ ┌────────────┐ │ │ ┌────────────┐ │ │
│ │ │ Agent │ │ ... │ │ Agent │ │ │
│ │ │ (Go Binary)│ │ │ │ (Go Binary)│ │ │
│ │ └────────────┘ │ │ └────────────┘ │ │
│ │ │ │ │ │
│ │ When job runs: │ │ When job runs: │ │
│ │ ┌────────────┐ │ │ ┌────────────┐ │ │
│ │ │ FFmpeg │ │ │ │ FFmpeg │ │ │
│ │ │ Exporters │ │ │ │ Exporters │ │ │
│ │ │ Analyzer │ │ │ │ Analyzer │ │ │
│ │ └────────────┘ │ │ └────────────┘ │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
│ Agents poll master for jobs (HTTP GET /jobs/next) │
│ Agents send heartbeats (HTTP POST /nodes/{id}/heartbeat) │
│ Agents submit results (HTTP POST /results) │
└────────────────────────────────────────────────────────────────┘
Distributed Compute mode separates the control plane (master) from the data plane (agents). The master node coordinates job distribution and aggregates results, while compute nodes execute workloads. This is the recommended mode for production deployments.
Use Distributed Compute Mode when:
- Running production workloads
- Scaling across multiple machines
- Executing long-running benchmarks (hours to days)
- Utilizing heterogeneous hardware (CPUs, GPUs, different specs)
- Requiring high availability (multiple agents)
- Optimizing resource utilization across fleet
- Running automated CI/CD pipelines
- Need to isolate compute from coordination
Do NOT use Distributed Compute Mode when:
- Quick local testing (use Local Testing instead)
- Single machine is sufficient
- Setup complexity is a concern
Master Node:
- Go 1.21+ (for building binaries)
- Docker 20.10+ and Docker Compose 2.0+ (for monitoring stack)
- 2+ GB RAM, 10+ GB disk
- Public IP or accessible hostname
- Open ports: 8080 (master API), 3000 (Grafana), 8428 (VictoriaMetrics)
Compute Nodes:
- Go 1.21+ (for building agent binary)
- Python 3.10+ (for analyzer scripts)
- FFmpeg with codec support
- 4+ GB RAM, 10+ GB disk
- Network connectivity to master node
- No inbound ports required (outbound only)
1. Clone and Build
git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp
# Build master binary
make build-master
# Creates ./bin/master2. Start Master Service
Option A: Manual Start (Development)
# Start in foreground
./bin/master --port 8080
# Or background with logging
./bin/master --port 8080 > /var/log/ffmpeg-master.log 2>&1 &Option B: Systemd Service (Production)
# Copy systemd service file
sudo cp deployment/ffmpeg-master.service /etc/systemd/system/
# Edit service file to set correct paths
sudo nano /etc/systemd/system/ffmpeg-master.service
# Enable and start service
sudo systemctl enable ffmpeg-master
sudo systemctl start ffmpeg-master
# Check status
sudo systemctl status ffmpeg-master3. Start Monitoring Stack
# Start VictoriaMetrics and Grafana
make vm-up-build4. Verify Master is Running
# Health check
curl http://localhost:8080/health
# Should return: {"status":"healthy"}
# Check registered nodes (initially empty)
curl http://localhost:8080/nodes1. Clone and Build
git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp
# Build agent binary
make build-agent
# Creates ./bin/agent2. Start Agent Service
Option A: Manual Start (Development)
# Replace MASTER_IP with your master node's IP or hostname
./bin/agent --register --master http://MASTER_IP:8080
# Agent will:
# - Auto-detect hardware (CPU, GPU, RAM)
# - Register with master
# - Start heartbeat loop (every 30s)
# - Start job polling loop (every 10s)Option B: Systemd Service (Production)
# Copy systemd service file
sudo cp deployment/ffmpeg-agent.service /etc/systemd/system/
# Edit service file to set correct paths and master URL
sudo nano /etc/systemd/system/ffmpeg-agent.service
# Enable and start service
sudo systemctl enable ffmpeg-agent
sudo systemctl start ffmpeg-agent
# Check status
sudo systemctl status ffmpeg-agent3. Verify Agent Registration
# On master node, check registered nodes
curl http://MASTER_IP:8080/nodes | jq
# Should see your agent with hardware info1. Submit Job to Master
curl -X POST http://MASTER_IP:8080/jobs \
-H "Content-Type: application/json" \
-d '{
"scenario": "1080p-h264",
"confidence": "auto",
"parameters": {
"duration": 300,
"bitrate": "5000k",
"resolution": "1920x1080"
}
}'
# Returns job_id2. Monitor Job Execution
# Check job status
curl http://MASTER_IP:8080/jobs | jq
# Agent will automatically pick up job and execute3. View Results
- Open Grafana: http://MASTER_IP:3000
- Check master logs for completed job results
- Query master API for job results (future feature)
Systemd provides:
- Automatic restart on failure
- Logging to journald
- Start on boot
- Resource limits
- Dependency management
See deployment/ directory for service templates.
IMPORTANT: The current implementation uses HTTP. For production:
- Deploy nginx reverse proxy with TLS termination
- Use Let's Encrypt for certificates
- Configure mTLS between master and agents (future)
Example nginx config:
server {
listen 443 ssl;
server_name master.example.com;
ssl_certificate /etc/letsencrypt/live/master.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/master.example.com/privkey.pem;
location / {
proxy_pass http://localhost:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}- Configure Alertmanager for critical alerts
- Set up alerts for:
- Master service down
- Agent heartbeat failures
- Job failures
- Resource exhaustion
- Integrate with PagerDuty, Slack, or email
Master Node:
# Backup VictoriaMetrics data
docker run --rm -v victoriametrics-data:/data -v $(pwd):/backup \
ubuntu tar czf /backup/vm-backup-$(date +%Y%m%d).tar.gz /data
# Backup Grafana dashboards
curl -u admin:admin http://localhost:3000/api/dashboards/uid/power_monitoring \
> dashboard-backup-$(date +%Y%m%d).jsonScheduled Backups:
# Add to cron
0 2 * * * /opt/ffmpeg-rtmp/scripts/backup.shSet systemd resource limits in service files:
[Service]
# Limit memory
MemoryLimit=4G
MemoryAccounting=yes
# Limit CPU
CPUQuota=400% # 4 cores max
# Restart policy
Restart=always
RestartSec=10sMaster Node:
# Allow master API, Grafana, VictoriaMetrics
sudo ufw allow 8080/tcp comment 'FFmpeg Master API'
sudo ufw allow 3000/tcp comment 'Grafana'
sudo ufw allow 8428/tcp comment 'VictoriaMetrics'
# If using SSH for management
sudo ufw allow 22/tcp
# Enable firewall
sudo ufw enableCompute Nodes:
# No inbound ports required (agents initiate connections)
# Only allow SSH if needed for management
sudo ufw allow 22/tcp
sudo ufw enableSystemd Journal:
# View master logs
sudo journalctl -u ffmpeg-master -f
# View agent logs
sudo journalctl -u ffmpeg-agent -f
# Set log retention
sudo journalctl --vacuum-time=7dCentralized Logging (Optional):
- Ship logs to ELK stack, Splunk, or Loki
- Use fluentd or filebeat for log forwarding
Adding More Agents:
- Deploy agent to new machine
- Start agent with
--registerflag - Agent automatically joins pool
- Master distributes jobs across all available agents
Horizontal Scaling Benefits:
- Increased throughput (more parallel jobs)
- Hardware diversity (mix CPUs, GPUs)
- Fault tolerance (one agent down doesn't affect others)
- Load distribution (master balances work)
Example: 10-Node Cluster
- 1 Master node (2 GB RAM)
- 10 Compute nodes (4 GB RAM each)
- Total capacity: 10 concurrent jobs
- Total cost: ~$50-100/month (cloud VMs)
- Latency: <100ms between agents and master recommended
- Bandwidth: ~1-10 KB/s per agent (heartbeats + job metadata)
- Connectivity: Agents need outbound HTTP access to master
- Firewall: No inbound ports needed on agents
Update Master:
# Pull latest code
cd /opt/ffmpeg-rtmp
git pull
# Rebuild master
make build-master
# Restart service
sudo systemctl restart ffmpeg-masterUpdate Agent:
# Pull latest code
cd /opt/ffmpeg-rtmp
git pull
# Rebuild agent
make build-agent
# Restart service
sudo systemctl restart ffmpeg-agentZero-Downtime Updates (Future):
- Update agents one at a time
- Running jobs complete before restart
- Master continues serving other agents
Master Issues:
# Check if master is running
curl http://localhost:8080/health
# Check logs
sudo journalctl -u ffmpeg-master -n 100
# Check port binding
sudo lsof -i :8080
# Restart master
sudo systemctl restart ffmpeg-masterAgent Issues:
# Check agent logs
sudo journalctl -u ffmpeg-agent -n 100
# Test connectivity to master
curl http://MASTER_IP:8080/health
# Check agent process
ps aux | grep agent
# Restart agent
sudo systemctl restart ffmpeg-agentNetwork Issues:
# Test connectivity
ping MASTER_IP
# Test port reachability
telnet MASTER_IP 8080
# Check firewall rules
sudo ufw status
# Check DNS resolution
nslookup master.example.com- You're developing features
- You need quick tests
- Single machine is enough
- Minimal setup complexity desired
- Running production workloads
- Need to scale horizontally
- Long-running benchmarks
- High availability required
- Resource optimization needed
If you've been using Local Testing mode and want to migrate to Distributed:
1. Extract Configuration
- Export Grafana dashboards
- Save custom pricing_config.json
- Backup test_results/ directory
- Document any custom modifications
2. Deploy Master Node
- Follow master setup steps above
- Import Grafana dashboards
- Copy pricing_config.json
3. Deploy Compute Nodes
- Follow agent setup steps above
- Register agents with master
- Verify registration
4. Migrate Workloads
- Convert
run_tests.pycommands to job submissions - Submit jobs via master API instead of running locally
- Monitor via master Grafana dashboards
5. Deprecate Local Stack
# Stop local Docker Compose stack
make down
# Optionally remove volumes
docker compose down -vYou can keep both modes available:
- Use Local Testing for development
- Use Distributed for production benchmarks
- Same codebase, different deployment methods
File: /etc/systemd/system/ffmpeg-master.service
[Unit]
Description=FFmpeg RTMP Master Service
After=network.target
[Service]
Type=simple
User=ffmpeg
Group=ffmpeg
WorkingDirectory=/opt/ffmpeg-rtmp
ExecStart=/opt/ffmpeg-rtmp/bin/master --port 8080
Restart=always
RestartSec=10s
# Resource limits
MemoryLimit=2G
CPUQuota=200%
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=ffmpeg-master
[Install]
WantedBy=multi-user.targetFile: /etc/systemd/system/ffmpeg-agent.service
[Unit]
Description=FFmpeg RTMP Agent Service
After=network.target
[Service]
Type=simple
User=ffmpeg
Group=ffmpeg
WorkingDirectory=/opt/ffmpeg-rtmp
Environment="MASTER_URL=http://master.example.com:8080"
ExecStart=/opt/ffmpeg-rtmp/bin/agent --register --master ${MASTER_URL}
Restart=always
RestartSec=10s
# Resource limits
MemoryLimit=4G
CPUQuota=800%
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=ffmpeg-agent
[Install]
WantedBy=multi-user.target# Create service user
sudo useradd -r -s /bin/false ffmpeg
# Clone repository to /opt
sudo git clone https://github.com/psantana5/ffmpeg-rtmp.git /opt/ffmpeg-rtmp
sudo chown -R ffmpeg:ffmpeg /opt/ffmpeg-rtmp
# Build binaries
cd /opt/ffmpeg-rtmp
sudo -u ffmpeg make build-master # or build-agent
# Install service file
sudo cp deployment/ffmpeg-master.service /etc/systemd/system/
# Customize service file
sudo nano /etc/systemd/system/ffmpeg-master.service
# Reload systemd
sudo systemctl daemon-reload
# Enable and start
sudo systemctl enable ffmpeg-master
sudo systemctl start ffmpeg-master
# Check status
sudo systemctl status ffmpeg-masterA: Technically yes, but not recommended. Each agent should have dedicated resources. If you need to maximize single-machine utilization, use Local Testing mode instead.
A: In v1.0, there's no automatic failover. Agents will log connection errors and continue retrying. Jobs in the queue are lost. For high availability, consider:
- Running master in a container with restart policy
- Using systemd auto-restart
- Planning for multi-master support in v1.5+
A: Yes! That's a key benefit. The master tracks each agent's hardware capabilities (CPU, GPU, RAM). Future versions will use this for intelligent job scheduling.
A: In v1.0, updates require restart. Best practice:
- Update agents one at a time
- Wait for running jobs to complete
- Restart agent service
- Verify reconnection to master
A: Not recommended. Choose one mode per environment. However, you can run distributed mode with the master using Docker Compose for its monitoring stack (VictoriaMetrics + Grafana).
A: Not formally tested. The in-memory data structures should handle hundreds of agents. Bottlenecks:
- Master CPU for job dispatch
- Master memory for node registry
- Network bandwidth for heartbeats
Optimize by increasing heartbeat interval if >100 agents.
- Internal Architecture Reference - Complete runtime model
- Distributed Architecture v1 - Detailed distributed design
- Getting Started Guide - Initial setup
- Contributing Guide - Development workflow
Version: 1.0
Last Updated: 2025-12-30
Maintainers: See CONTRIBUTING.md