Architecture Overview

System Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    Internal Network (Private)                    │
│                                                                   │
│  ┌──────────────┐         ┌──────────────┐    ┌──────────────┐ │
│  │  Service 1   │         │  Service 2   │    │  Service N   │ │
│  │              │         │              │    │              │ │
│  │  • API       │         │  • Database  │    │  • Worker    │ │
│  │  • Web App   │         │  • Queue     │    │  • Job       │ │
│  └──────────────┘         └──────────────┘    └──────────────┘ │
│         │                        │                    │         │
│         │                        │                    │         │
│    ┌────▼────────────────────────▼────────────────────▼────┐   │
│    │         Heartbeat Clients (cron/systemd)            │   │
│    │                                                       │   │
│    │  • Lightweight scripts (Bash/Python/Node.js)        │   │
│    │  • Send POST requests every 2 minutes               │   │
│    │  • Include metadata (optional)                      │   │
│    └───────────────────────────┬───────────────────────────┘   │
│                                 │                               │
└─────────────────────────────────┼───────────────────────────────┘
                                  │
                  Outbound HTTPS (Port 443)
                  Only connection needed!
                                  │
                                  ▼
              ┌───────────────────────────────────┐
              │   Internet (Public)               │
              │                                   │
              │    Cloudflare Global Network     │
              └───────────────────────────────────┘
                                  │
                                  ▼
         ┌────────────────────────────────────────────┐
         │      Cloudflare Worker                     │
         │   (heartbeat-monitor.workers.dev)         │
         │                                            │
         │  Endpoints:                                │
         │  • POST /api/heartbeat  (receive)         │
         │  • GET  /api/status     (current)         │
         │  • GET  /api/logs       (history)         │
         │  • GET  /               (dashboard)        │
         │                                            │
         │  Scheduled Tasks:                         │
         │  • Check staleness (every 5 min)          │
         │  • Update service status                   │
         └────────────────────────────────────────────┘
                                  │
                                  ▼
              ┌───────────────────────────────────┐
              │    Cloudflare KV Storage          │
              │                                   │
              │  • monitor:latest                │
              │    (heartbeat timestamps)        │
              │  • monitor:data                  │
              │    (summary + uptime stats)      │
              │  • recent:alerts                 │
              │    (alert history)               │
              └───────────────────────────────────┘
                                  │
                                  ▼
              ┌───────────────────────────────────┐
              │         Dashboard Users           │
              │                                   │
              │  Access via browser:             │
              │  https://your-worker.workers.dev │
              └───────────────────────────────────┘

Data Flow

1. Heartbeat Push (Every 2 minutes)

Internal Service → Heartbeat Client → POST /api/heartbeat → Worker
                                                              ↓
                                                         KV Storage
                                                              ↓
                                                    Store heartbeat data
                                                    Update latest timestamp

Payload:

{
  "serviceId": "service-1",
  "status": "up",
  "metadata": { "hostname": "server-1" },
  "message": "Heartbeat from server-1"
}

2. Staleness Check (Every 10 minutes, via cron)

Cloudflare Cron Trigger → Worker scheduled() function
                             ↓
                    Read monitor:latest from KV
                    Read monitor:data from KV
                             ↓
                    Calculate time since last heartbeat
                             ↓
                    Compare with stalenessThreshold
                             ↓
                    Determine status (up/down/unknown)
                             ↓
                    Update uptime statistics
                             ↓
                    Store updated monitor:data in KV

3. Dashboard View (On-demand)

User Browser → GET / → Worker
                        ↓
                   Read monitor:latest from KV
                   Read monitor:data from KV
                        ↓
                   Embed data into HTML
                        ↓
                   Return dashboard with embedded data
                        ↓
                   (Optional) JavaScript polls /api/alerts/recent
                        ↓
                   Auto-refresh configurable (default: disabled)

Key Design Decisions

Why Push-Based?

Security: No inbound connections to internal services
Simplicity: No VPN, tunnels, or complex networking
Firewall Friendly: Works through corporate firewalls (outbound HTTPS only)
Scalability: Easy to add new services

Why Cloudflare Workers?

Global Edge Network: Low latency worldwide
Serverless: No servers to manage
Free Tier: 100,000 requests/day free
KV Storage: Fast, distributed key-value store
Cron Triggers: Built-in scheduling

Why KV Storage?

Fast: Edge-cached, low latency
Distributed: Global replication
Simple: Key-value interface
Cost-Effective: Free tier sufficient for most use cases
Durable: Reliable storage

Why Separate KV Keys?

The system uses two separate KV keys (monitor:latest and monitor:data) to prevent race conditions:

Problem: When both heartbeat updates and cron checks write to the same key, they can overwrite each other's changes due to KV's eventual consistency model.

Solution: Separate concerns:

Heartbeats ONLY update monitor:latest (timestamps)
Cron ONLY updates monitor:data (summary + uptime)
Dashboard reads both keys

Benefits:

No race conditions: Updates don't conflict
Smaller heartbeat writes: Only timestamps, not full statistics
Consistent status: Cron-generated summaries are never overwritten
Better performance: Reduced payload sizes for frequent operations

Component Details

Heartbeat Client

Purpose: Send periodic health signals to the worker

Features:

Lightweight (single HTTP request)
Customizable metadata
Error handling
Logging

Scheduling Options:

Cron (simple, traditional)
systemd timer (modern, reliable)
Docker (containerized)

Cloudflare Worker

Purpose: Receive heartbeats, check staleness, serve dashboard

Responsibilities:

Validate incoming heartbeats
Authenticate via API keys
Store heartbeat data in KV
Check for stale services (scheduled)
Serve dashboard and API endpoints

Routes:

POST /api/heartbeat - Receive heartbeat from services
GET /api/status - Get current status summary
GET /api/logs?serviceId=X - Get historical logs
GET /api/services - List configured services
GET / - Dashboard UI

KV Storage

Purpose: Persist heartbeat data and service status

Keys:

monitor:latest - Latest heartbeat timestamps for all services (JSON object: {serviceId: timestamp})
- Updated by: Heartbeat handler
- Read by: Cron checks, Dashboard
monitor:data - Service status and uptime statistics (JSON object)
- Contains: summary (current status) and uptime (daily statistics per service)
- Updated by: Cron scheduled task
- Read by: Dashboard, API endpoints
recent:alerts - Dashboard alert history (JSON array)
- Contains: External alerts and service status change notifications
- Updated by: Alert handlers, Service monitoring
- Configurable retention (default: 100 alerts, 7 days)

Data Retention:

Latest timestamps: All enabled services (live data)
Uptime statistics: Configurable (default: 120 days per service)
Alert history: Configurable (default: 100 alerts or 7 days)

Dashboard

Purpose: Visual monitoring interface

Features:

Real-time status display
Summary cards (total, up, down, unknown)
Per-service details
Auto-refresh (30s)
Responsive design
No authentication (by default)

Timing Configuration

Recommended Settings

Heartbeat Interval:     2-5 minutes (120-300 seconds)
Staleness Threshold:    5-10 minutes (300-600 seconds)
Staleness Check:        10 minutes (cron)
Dashboard Refresh:      Manual or configurable auto-refresh
Alert Polling:          10-60 seconds (if enabled)

Why These Values?

2-5 minute heartbeats: Balance between freshness and KV operation costs
5-10 minute threshold: Allows 2 missed heartbeats before alerting
10-minute staleness check: Efficient detection with minimal KV operations
Manual dashboard refresh: Embedded data eliminates need for auto-refresh
Alert polling: Only if real-time notifications are needed

Customization

You can adjust these based on your needs:

Critical services: 30s heartbeat, 2m threshold, 1m check
Standard services: 2m heartbeat, 5m threshold, 5m check
Low-priority: 10m heartbeat, 30m threshold, 15m check

Security Model

Authentication Flow

Heartbeat Client
    ↓
Include Authorization: Bearer {apiKey}
    ↓
POST /api/heartbeat
    ↓
Worker validates:
  1. serviceId exists in services.json
  2. apiKey matches (if configured)
    ↓
Accept or reject request

Security Layers

API Key Authentication: Per-service keys
HTTPS Only: All communication encrypted
Cloudflare Network: DDoS protection
No Credentials Stored: Services don't need to store anything sensitive
Outbound Only: No inbound firewall rules needed

Scalability

Current Limits

Services: ~100-500 (KV write limits and processing time)
Heartbeat Frequency: 2-10 minutes recommended
Storage: Minimal (2 primary KV entries + alert history)
Requests: 100,000/day (free tier)
KV Operations: Primary constraint (1000 writes/day on free tier)

Scaling Beyond Free Tier

If you need more:

Workers Paid: $5/month for 10M requests
KV Paid: $0.50/GB storage
Multiple Workers: Split services across workers

Optimization Tips

Reduce heartbeat frequency for non-critical services
Clean up old data periodically
Use metadata sparingly
Increase staleness thresholds where possible

Monitoring the Monitor

How to Monitor

Worker Logs: npm run tail
Cloudflare Dashboard: View request metrics
KV Usage: Check storage consumption
Dashboard Health: Monitor your own worker!

Key Metrics

Heartbeat success rate
KV read/write operations
Worker execution time
Error rates

Implemented Features

Recently added capabilities:

✅ Multi-Channel Notifications: Discord, Slack, Telegram, Email, PagerDuty, Pushover, Webhook
✅ External Alert Integration: Grafana, Alertmanager, custom webhooks
✅ Real-time Dashboard Alerts: Toast and browser notifications
✅ Alert History: Searchable history with configurable retention
✅ Uptime Statistics: Daily uptime tracking with configurable retention (120 days)
✅ CSV Export: Historical data export with custom date ranges
✅ API Endpoint Controls: Enable/disable individual endpoints
✅ Customizable Alerts: Severity filtering, polling intervals

Future Enhancements

Potential improvements:

Authentication: Add login to dashboard (currently supports Cloudflare Access)
Charts: Visual graphs of uptime trends
Multi-region Tracking: Identify which region/datacenter sent heartbeat
Service Dependencies: Track and visualize service dependencies
Custom Status Pages: Public-facing status page generation
Synthetic Monitoring: Active checks in addition to heartbeats
Performance Metrics: Track response times and custom metrics

Comparison with Alternatives

Feature	This Solution	Traditional Monitoring	Cloud Services
Cost	Free - $5/mo	$50-500/mo	$20-200/mo
Setup Time	10 minutes	Hours/Days	30 min - 2 hours
Exposure	None	Inbound required	Varies
Maintenance	Minimal	High	Low
Scalability	100-1000 services	Unlimited	Unlimited
Customization	Full control	Limited	Limited

Best For

This solution is ideal for:

✅ Internal services that shouldn't be exposed ✅ Small to medium deployments (< 100 services) ✅ Budget-conscious teams ✅ Simple uptime monitoring ✅ Teams comfortable with Cloudflare

Not ideal for:

❌ Complex health checks (use dedicated monitoring) ❌ Sub-second monitoring requirements ❌ 1000+ services (consider paid alternatives) ❌ Teams without Cloudflare experience

Questions?

Check the main README.md or QUICKSTART.md for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Overview

System Diagram

Data Flow

1. Heartbeat Push (Every 2 minutes)

2. Staleness Check (Every 10 minutes, via cron)

3. Dashboard View (On-demand)

Key Design Decisions

Why Push-Based?

Why Cloudflare Workers?

Why KV Storage?

Why Separate KV Keys?

Component Details

Heartbeat Client

Cloudflare Worker

KV Storage

Dashboard

Timing Configuration

Recommended Settings

Why These Values?

Customization

Security Model

Authentication Flow

Security Layers

Scalability

Current Limits

Scaling Beyond Free Tier

Optimization Tips

Monitoring the Monitor

How to Monitor

Key Metrics

Implemented Features

Future Enhancements

Comparison with Alternatives

Best For

Questions?

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture Overview

System Diagram

Data Flow

1. Heartbeat Push (Every 2 minutes)

2. Staleness Check (Every 10 minutes, via cron)

3. Dashboard View (On-demand)

Key Design Decisions

Why Push-Based?

Why Cloudflare Workers?

Why KV Storage?

Why Separate KV Keys?

Component Details

Heartbeat Client

Cloudflare Worker

KV Storage

Dashboard

Timing Configuration

Recommended Settings

Why These Values?

Customization

Security Model

Authentication Flow

Security Layers

Scalability

Current Limits

Scaling Beyond Free Tier

Optimization Tips

Monitoring the Monitor

How to Monitor

Key Metrics

Implemented Features

Future Enhancements

Comparison with Alternatives

Best For

Questions?