Deployment Modes

This document explains the two deployment modes available in the FFmpeg RTMP Power Monitoring system and helps you choose the right one for your use case.

Overview

The system supports two distinct deployment modes:

Local Testing Mode (Docker Compose) - For development and local testing
Distributed Compute Mode (Master + Agents) - For production workloads

Quick Comparison

Feature	Local Testing (Docker Compose)	Distributed Compute (Master + Agents)
Primary Use Case	Development, testing, demos	Production workloads, scaling
Deployment Complexity	Simple (one command)	Moderate (multiple nodes)
Resource Requirements	Single machine (4+ GB RAM)	Master + multiple compute nodes
Scalability	Limited to single machine	Horizontal scaling across nodes
Setup Time	2-5 minutes	10-15 minutes
Production Ready	No	Yes
Best For	Quick tests, feature development	Long-running benchmarks, multi-node workloads

Mode 1: Local Testing (Docker Compose)

Architecture Diagram

┌─────────────────────────────────────────────────────────┐
│              Single Host Machine (Development)          │
│                                                         │
│  ┌───────────────────────────────────────────────────┐ │
│  │         Docker Compose Stack                      │ │
│  │                                                   │ │
│  │  ┌─────────────┐  ┌──────────────┐  ┌─────────┐ │ │
│  │  │ Victoria    │  │   Grafana    │  │ Nginx   │ │ │
│  │  │ Metrics     │  │              │  │ RTMP    │ │ │
│  │  └─────────────┘  └──────────────┘  └─────────┘ │ │
│  │                                                   │ │
│  │  ┌─────────────────────────────────────────────┐ │ │
│  │  │     12+ Exporters (CPU, GPU, Docker, etc)   │ │ │
│  │  └─────────────────────────────────────────────┘ │ │
│  │                                                   │ │
│  │  ┌─────────────┐  ┌──────────────┐              │ │
│  │  │ FFmpeg      │  │ Alertmanager │              │ │
│  │  │ (Host)      │  │              │              │ │
│  │  └─────────────┘  └──────────────┘              │ │
│  └───────────────────────────────────────────────────┘ │
│                                                         │
│  All components run on localhost                        │
│  Network: Docker bridge (streaming-net)                 │
└─────────────────────────────────────────────────────────┘

Description

Local Testing mode deploys all components on a single machine using Docker Compose. This mode is optimized for development and is NOT recommended for production workloads.

When to Use

Use Local Testing Mode when:

Developing new features or exporters
Testing configuration changes
Running quick performance tests
Learning the system
Creating demos or tutorials
Debugging issues locally

Do NOT use Local Testing Mode for:

Production workloads
Long-running benchmarks (>1 hour)
Multi-node scaling requirements
High-availability deployments
Resource-intensive transcoding jobs

Prerequisites

Docker 20.10+ and Docker Compose 2.0+
Linux host with kernel 4.15+ (for RAPL power monitoring)
4+ GB RAM, 10+ GB free disk space
Python 3.10+
FFmpeg installed on host

Setup Steps

1. Clone Repository

git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp

2. Start Stack

# Start all services with one command
make up-build

# Or with GPU support (requires nvidia-container-toolkit)
make nvidia-up-build

3. Verify Deployment

# Check all containers are running
docker compose ps

# Should see 12+ containers in "healthy" state

4. Access Dashboards

Grafana: http://localhost:3000 (admin/admin)
VictoriaMetrics: http://localhost:8428
Alertmanager: http://localhost:9093

5. Run Test

# Run a simple 60-second test
python3 scripts/run_tests.py single \
  --name "test1" \
  --bitrate 2000k \
  --duration 60

6. View Results

Open Grafana dashboard at http://localhost:3000
Navigate to "Power Monitoring" dashboard
View real-time metrics

Stopping the Stack

# Stop all containers
make down

# Stop and remove volumes (clean slate)
docker compose down -v

Limitations

Single machine only: Cannot distribute workloads across multiple nodes
Resource contention: All services compete for CPU/RAM on one machine
No high availability: If the machine fails, entire system goes down
Limited scalability: Constrained by single machine's resources
Not production-grade: No redundancy, no automatic failover

Resource Usage

Memory: ~1-2 GB (idle), ~2-4 GB (active test)
CPU: ~5-10% (idle), ~60-420% (active test)
Disk: ~3-6 GB initial, +50-110 MB per day
Network: ~10-550 KB/s

Mode 2: Distributed Compute (Master + Agents)

Architecture Diagram

┌────────────────────────────────────────────────────────────────┐
│                      MASTER NODE                               │
│                  (Coordination & Monitoring)                    │
│                                                                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │   Master     │  │ Victoria     │  │   Grafana    │       │
│  │   Service    │  │ Metrics      │  │              │       │
│  │  (Go HTTP)   │  │   (TSDB)     │  │ (Dashboards) │       │
│  │              │  │              │  │              │       │
│  │  Port 8080   │  │  Port 8428   │  │  Port 3000   │       │
│  └──────┬───────┘  └──────────────┘  └──────────────┘       │
│         │                                                      │
│         │ REST API (HTTP/JSON)                                │
└─────────┼──────────────────────────────────────────────────────┘
          │
          │ Network (LAN/WAN)
          │
┌─────────┴──────────────────────────────────────────────────────┐
│                                                                │
│  ┌──────────────────┐         ┌──────────────────┐           │
│  │  COMPUTE NODE 1  │         │  COMPUTE NODE N  │           │
│  │                  │         │                  │           │
│  │  ┌────────────┐  │         │  ┌────────────┐  │           │
│  │  │   Agent    │  │   ...   │  │   Agent    │  │           │
│  │  │ (Go Binary)│  │         │  │ (Go Binary)│  │           │
│  │  └────────────┘  │         │  └────────────┘  │           │
│  │                  │         │                  │           │
│  │  When job runs:  │         │  When job runs:  │           │
│  │  ┌────────────┐  │         │  ┌────────────┐  │           │
│  │  │  FFmpeg    │  │         │  │  FFmpeg    │  │           │
│  │  │  Exporters │  │         │  │  Exporters │  │           │
│  │  │  Analyzer  │  │         │  │  Analyzer  │  │           │
│  │  └────────────┘  │         │  └────────────┘  │           │
│  └──────────────────┘         └──────────────────┘           │
│                                                                │
│  Agents poll master for jobs (HTTP GET /jobs/next)            │
│  Agents send heartbeats (HTTP POST /nodes/{id}/heartbeat)     │
│  Agents submit results (HTTP POST /results)                   │
└────────────────────────────────────────────────────────────────┘

Description

Distributed Compute mode separates the control plane (master) from the data plane (agents). The master node coordinates job distribution and aggregates results, while compute nodes execute workloads. This is the recommended mode for production deployments.

When to Use

Use Distributed Compute Mode when:

Running production workloads
Scaling across multiple machines
Executing long-running benchmarks (hours to days)
Utilizing heterogeneous hardware (CPUs, GPUs, different specs)
Requiring high availability (multiple agents)
Optimizing resource utilization across fleet
Running automated CI/CD pipelines
Need to isolate compute from coordination

Do NOT use Distributed Compute Mode when:

Quick local testing (use Local Testing instead)
Single machine is sufficient
Setup complexity is a concern

Prerequisites

Master Node:

Go 1.21+ (for building binaries)
Docker 20.10+ and Docker Compose 2.0+ (for monitoring stack)
2+ GB RAM, 10+ GB disk
Public IP or accessible hostname
Open ports: 8080 (master API), 3000 (Grafana), 8428 (VictoriaMetrics)

Compute Nodes:

Go 1.21+ (for building agent binary)
Python 3.10+ (for analyzer scripts)
FFmpeg with codec support
4+ GB RAM, 10+ GB disk
Network connectivity to master node
No inbound ports required (outbound only)

Setup Steps

Master Node Setup

1. Clone and Build

git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp

# Build master binary
make build-master
# Creates ./bin/master

2. Start Master Service

Option A: Manual Start (Development)

# Start in foreground
./bin/master --port 8080

# Or background with logging
./bin/master --port 8080 > /var/log/ffmpeg-master.log 2>&1 &

Option B: Systemd Service (Production)

# Copy systemd service file
sudo cp deployment/ffmpeg-master.service /etc/systemd/system/

# Edit service file to set correct paths
sudo nano /etc/systemd/system/ffmpeg-master.service

# Enable and start service
sudo systemctl enable ffmpeg-master
sudo systemctl start ffmpeg-master

# Check status
sudo systemctl status ffmpeg-master

3. Start Monitoring Stack

# Start VictoriaMetrics and Grafana
make vm-up-build

4. Verify Master is Running

# Health check
curl http://localhost:8080/health

# Should return: {"status":"healthy"}

# Check registered nodes (initially empty)
curl http://localhost:8080/nodes

Compute Node Setup

1. Clone and Build

git clone https://github.com/psantana5/ffmpeg-rtmp.git
cd ffmpeg-rtmp

# Build agent binary
make build-agent
# Creates ./bin/agent

2. Start Agent Service

Option A: Manual Start (Development)

# Replace MASTER_IP with your master node's IP or hostname
./bin/agent --register --master http://MASTER_IP:8080

# Agent will:
# - Auto-detect hardware (CPU, GPU, RAM)
# - Register with master
# - Start heartbeat loop (every 30s)
# - Start job polling loop (every 10s)

Option B: Systemd Service (Production)

# Copy systemd service file
sudo cp deployment/ffmpeg-agent.service /etc/systemd/system/

# Edit service file to set correct paths and master URL
sudo nano /etc/systemd/system/ffmpeg-agent.service

# Enable and start service
sudo systemctl enable ffmpeg-agent
sudo systemctl start ffmpeg-agent

# Check status
sudo systemctl status ffmpeg-agent

3. Verify Agent Registration

# On master node, check registered nodes
curl http://MASTER_IP:8080/nodes | jq

# Should see your agent with hardware info

Submit and Run Jobs

1. Submit Job to Master

curl -X POST http://MASTER_IP:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "scenario": "1080p-h264",
    "confidence": "auto",
    "parameters": {
      "duration": 300,
      "bitrate": "5000k",
      "resolution": "1920x1080"
    }
  }'

# Returns job_id

2. Monitor Job Execution

# Check job status
curl http://MASTER_IP:8080/jobs | jq

# Agent will automatically pick up job and execute

3. View Results

Open Grafana: http://MASTER_IP:3000
Check master logs for completed job results
Query master API for job results (future feature)

Production Deployment Recommendations

1. Use Systemd Services

Systemd provides:

Automatic restart on failure
Logging to journald
Start on boot
Resource limits
Dependency management

See deployment/ directory for service templates.

2. Configure TLS/HTTPS

IMPORTANT: The current implementation uses HTTP. For production:

Deploy nginx reverse proxy with TLS termination
Use Let's Encrypt for certificates
Configure mTLS between master and agents (future)

Example nginx config:

server {
    listen 443 ssl;
    server_name master.example.com;
    
    ssl_certificate /etc/letsencrypt/live/master.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/master.example.com/privkey.pem;
    
    location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

3. Set Up Monitoring and Alerting

Configure Alertmanager for critical alerts
Set up alerts for:
- Master service down
- Agent heartbeat failures
- Job failures
- Resource exhaustion
Integrate with PagerDuty, Slack, or email

4. Backup and Disaster Recovery

Master Node:

# Backup VictoriaMetrics data
docker run --rm -v victoriametrics-data:/data -v $(pwd):/backup \
  ubuntu tar czf /backup/vm-backup-$(date +%Y%m%d).tar.gz /data

# Backup Grafana dashboards
curl -u admin:admin http://localhost:3000/api/dashboards/uid/power_monitoring \
  > dashboard-backup-$(date +%Y%m%d).json

Scheduled Backups:

# Add to cron
0 2 * * * /opt/ffmpeg-rtmp/scripts/backup.sh

5. Resource Limits

Set systemd resource limits in service files:

[Service]
# Limit memory
MemoryLimit=4G
MemoryAccounting=yes

# Limit CPU
CPUQuota=400%  # 4 cores max

# Restart policy
Restart=always
RestartSec=10s

6. Firewall Configuration

Master Node:

# Allow master API, Grafana, VictoriaMetrics
sudo ufw allow 8080/tcp comment 'FFmpeg Master API'
sudo ufw allow 3000/tcp comment 'Grafana'
sudo ufw allow 8428/tcp comment 'VictoriaMetrics'

# If using SSH for management
sudo ufw allow 22/tcp

# Enable firewall
sudo ufw enable

Compute Nodes:

# No inbound ports required (agents initiate connections)
# Only allow SSH if needed for management
sudo ufw allow 22/tcp
sudo ufw enable

7. Log Management

Systemd Journal:

# View master logs
sudo journalctl -u ffmpeg-master -f

# View agent logs
sudo journalctl -u ffmpeg-agent -f

# Set log retention
sudo journalctl --vacuum-time=7d

Centralized Logging (Optional):

Ship logs to ELK stack, Splunk, or Loki
Use fluentd or filebeat for log forwarding

Scaling Compute Nodes

Adding More Agents:

Deploy agent to new machine
Start agent with --register flag
Agent automatically joins pool
Master distributes jobs across all available agents

Horizontal Scaling Benefits:

Increased throughput (more parallel jobs)
Hardware diversity (mix CPUs, GPUs)
Fault tolerance (one agent down doesn't affect others)
Load distribution (master balances work)

Example: 10-Node Cluster

1 Master node (2 GB RAM)
10 Compute nodes (4 GB RAM each)
Total capacity: 10 concurrent jobs
Total cost: ~$50-100/month (cloud VMs)

Network Requirements

Latency: <100ms between agents and master recommended
Bandwidth: ~1-10 KB/s per agent (heartbeats + job metadata)
Connectivity: Agents need outbound HTTP access to master
Firewall: No inbound ports needed on agents

Maintenance Operations

Update Master:

# Pull latest code
cd /opt/ffmpeg-rtmp
git pull

# Rebuild master
make build-master

# Restart service
sudo systemctl restart ffmpeg-master

Update Agent:

# Pull latest code
cd /opt/ffmpeg-rtmp
git pull

# Rebuild agent
make build-agent

# Restart service
sudo systemctl restart ffmpeg-agent

Zero-Downtime Updates (Future):

Update agents one at a time
Running jobs complete before restart
Master continues serving other agents

Troubleshooting

Master Issues:

# Check if master is running
curl http://localhost:8080/health

# Check logs
sudo journalctl -u ffmpeg-master -n 100

# Check port binding
sudo lsof -i :8080

# Restart master
sudo systemctl restart ffmpeg-master

Agent Issues:

# Check agent logs
sudo journalctl -u ffmpeg-agent -n 100

# Test connectivity to master
curl http://MASTER_IP:8080/health

# Check agent process
ps aux | grep agent

# Restart agent
sudo systemctl restart ffmpeg-agent

Network Issues:

# Test connectivity
ping MASTER_IP

# Test port reachability
telnet MASTER_IP 8080

# Check firewall rules
sudo ufw status

# Check DNS resolution
nslookup master.example.com

Comparison Summary

Use Local Testing Mode if:

You're developing features
You need quick tests
Single machine is enough
Minimal setup complexity desired

Use Distributed Compute Mode if:

Running production workloads
Need to scale horizontally
Long-running benchmarks
High availability required
Resource optimization needed

Migration Path

From Local Testing to Distributed

If you've been using Local Testing mode and want to migrate to Distributed:

1. Extract Configuration

Export Grafana dashboards
Save custom pricing_config.json
Backup test_results/ directory
Document any custom modifications

2. Deploy Master Node

Follow master setup steps above
Import Grafana dashboards
Copy pricing_config.json

3. Deploy Compute Nodes

Follow agent setup steps above
Register agents with master
Verify registration

4. Migrate Workloads

Convert run_tests.py commands to job submissions
Submit jobs via master API instead of running locally
Monitor via master Grafana dashboards

5. Deprecate Local Stack

# Stop local Docker Compose stack
make down

# Optionally remove volumes
docker compose down -v

Maintaining Both Modes

You can keep both modes available:

Use Local Testing for development
Use Distributed for production benchmarks
Same codebase, different deployment methods

Systemd Service Templates

Master Service

File: /etc/systemd/system/ffmpeg-master.service

[Unit]
Description=FFmpeg RTMP Master Service
After=network.target

[Service]
Type=simple
User=ffmpeg
Group=ffmpeg
WorkingDirectory=/opt/ffmpeg-rtmp
ExecStart=/opt/ffmpeg-rtmp/bin/master --port 8080
Restart=always
RestartSec=10s

# Resource limits
MemoryLimit=2G
CPUQuota=200%

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=ffmpeg-master

[Install]
WantedBy=multi-user.target

Agent Service

File: /etc/systemd/system/ffmpeg-agent.service

[Unit]
Description=FFmpeg RTMP Agent Service
After=network.target

[Service]
Type=simple
User=ffmpeg
Group=ffmpeg
WorkingDirectory=/opt/ffmpeg-rtmp
Environment="MASTER_URL=http://master.example.com:8080"
ExecStart=/opt/ffmpeg-rtmp/bin/agent --register --master ${MASTER_URL}
Restart=always
RestartSec=10s

# Resource limits
MemoryLimit=4G
CPUQuota=800%

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=ffmpeg-agent

[Install]
WantedBy=multi-user.target

Setup Instructions

# Create service user
sudo useradd -r -s /bin/false ffmpeg

# Clone repository to /opt
sudo git clone https://github.com/psantana5/ffmpeg-rtmp.git /opt/ffmpeg-rtmp
sudo chown -R ffmpeg:ffmpeg /opt/ffmpeg-rtmp

# Build binaries
cd /opt/ffmpeg-rtmp
sudo -u ffmpeg make build-master  # or build-agent

# Install service file
sudo cp deployment/ffmpeg-master.service /etc/systemd/system/

# Customize service file
sudo nano /etc/systemd/system/ffmpeg-master.service

# Reload systemd
sudo systemctl daemon-reload

# Enable and start
sudo systemctl enable ffmpeg-master
sudo systemctl start ffmpeg-master

# Check status
sudo systemctl status ffmpeg-master

FAQ

Q: Can I run multiple agents on the same machine?

A: Technically yes, but not recommended. Each agent should have dedicated resources. If you need to maximize single-machine utilization, use Local Testing mode instead.

Q: What happens if the master fails?

A: In v1.0, there's no automatic failover. Agents will log connection errors and continue retrying. Jobs in the queue are lost. For high availability, consider:

Running master in a container with restart policy
Using systemd auto-restart
Planning for multi-master support in v1.5+

Q: Can agents have different hardware specs?

A: Yes! That's a key benefit. The master tracks each agent's hardware capabilities (CPU, GPU, RAM). Future versions will use this for intelligent job scheduling.

Q: How do I update agents without downtime?

A: In v1.0, updates require restart. Best practice:

Update agents one at a time
Wait for running jobs to complete
Restart agent service
Verify reconnection to master

Q: Can I mix Docker Compose and distributed mode?

A: Not recommended. Choose one mode per environment. However, you can run distributed mode with the master using Docker Compose for its monitoring stack (VictoriaMetrics + Grafana).

Q: What's the maximum number of agents supported?

A: Not formally tested. The in-memory data structures should handle hundreds of agents. Bottlenecks:

Master CPU for job dispatch
Master memory for node registry
Network bandwidth for heartbeats

Optimize by increasing heartbeat interval if >100 agents.

Document Version

Version: 1.0
Last Updated: 2025-12-30
Maintainers: See CONTRIBUTING.md

FilesExpand file tree

DEPLOYMENT_MODES.md

Latest commit

History

DEPLOYMENT_MODES.md

File metadata and controls

Deployment Modes

Overview

Quick Comparison

Mode 1: Local Testing (Docker Compose)

Architecture Diagram

Description

When to Use

Prerequisites

Setup Steps

Stopping the Stack

Limitations

Resource Usage

Mode 2: Distributed Compute (Master + Agents)

Architecture Diagram

Description

When to Use

Prerequisites

Setup Steps

Master Node Setup

Compute Node Setup

Submit and Run Jobs

Production Deployment Recommendations

1. Use Systemd Services

2. Configure TLS/HTTPS

3. Set Up Monitoring and Alerting

4. Backup and Disaster Recovery

5. Resource Limits

6. Firewall Configuration

7. Log Management

Scaling Compute Nodes

Network Requirements

Maintenance Operations

Troubleshooting

Comparison Summary

Use Local Testing Mode if:

Use Distributed Compute Mode if:

Migration Path

From Local Testing to Distributed

Maintaining Both Modes

Systemd Service Templates

Master Service

Agent Service

Setup Instructions

FAQ

Q: Can I run multiple agents on the same machine?

Q: What happens if the master fails?

Q: Can agents have different hardware specs?

Q: How do I update agents without downtime?

Q: Can I mix Docker Compose and distributed mode?

Q: What's the maximum number of agents supported?

Related Documentation

Document Version