Version: 1.5.0
Last Updated: 2026-02-03
Addresses: FIND-032 - Automate Backup Restore Testing (DR)
Standards: ISO 27001 A.17.1, BSI C5 BCR-01
This document describes the automated disaster recovery (DR) testing framework for ThemisDB. Regular automated tests ensure backup systems are functional and recovery objectives (RTO/RPO) are met.
Target: 1 hour
Definition: Maximum acceptable downtime
Target: 15 minutes
Definition: Maximum acceptable data loss
- Validate backup integrity and completeness
- Verify restore procedures function correctly
- Measure actual RTO/RPO achievement
- Identify gaps in DR procedures
- Ensure compliance with business continuity requirements
1. Full Backup
- Frequency: Daily at 02:00 UTC
- Retention: 30 days
- Size: ~500 GB (estimated)
- Duration: ~2 hours
2. Incremental Backup
- Frequency: Every 4 hours
- Retention: 7 days
- Size: ~50 GB average
- Duration: ~15 minutes
3. Transaction Log Backup
- Frequency: Every 15 minutes
- Retention: 24 hours
- Size: ~5 GB average
- Duration: ~2 minutes
Primary: /mnt/backup/primary/
Secondary: /mnt/backup/secondary/ (offsite)
Archive: S3-compatible object storage (long-term)
Location: scripts/operations/dr-test.sh
Usage:
# Run full DR test
./scripts/operations/dr-test.sh --full
# Test specific backup
./scripts/operations/dr-test.sh --backup /mnt/backup/full-20260203
# Test restore only (no verification)
./scripts/operations/dr-test.sh --test-restore
# Verify backup integrity
./scripts/operations/dr-test.sh --verify-backup
# Measure RTO/RPO
./scripts/operations/dr-test.sh --verify-metrics
# Generate DR test report
./scripts/operations/dr-test.sh --report
# Dry-run mode
./scripts/operations/dr-test.sh --dry-run
# Quick smoke test
./scripts/operations/dr-test.sh --smoke-testOptions:
--full- Complete DR test (backup → restore → verify)--backup <path>- Test specific backup--test-restore- Restore test only--verify-backup- Verify backup integrity--verify-metrics- Check RTO/RPO compliance--report- Generate test report--dry-run- Simulation mode--smoke-test- Quick validation (15 min)--target <env>- Target environment (staging/dr-site)
Schedule: Every Sunday at 03:00 UTC
Test Steps:
1. Pre-Test Validation (5 minutes)
# Verify backup availability
check_backup_exists()
check_backup_integrity()
check_dr_environment_ready()2. Backup Verification (10 minutes)
# Validate backup
verify_backup_metadata()
verify_backup_checksums()
verify_backup_encryption()
verify_backup_completeness()3. Restore Test (30 minutes)
# Restore to DR environment
stop_dr_services()
clear_dr_data_directory()
restore_full_backup()
restore_incremental_backups()
restore_transaction_logs()4. Data Integrity Verification (10 minutes)
# Verify restored data
compare_record_counts()
verify_data_checksums()
test_query_execution()
verify_index_integrity()5. Functional Testing (10 minutes)
# Test basic operations
test_database_connectivity()
test_read_operations()
test_write_operations()
test_query_performance()6. RTO/RPO Measurement (5 minutes)
# Calculate metrics
calculate_restore_time() # RTO
calculate_data_loss() # RPO
compare_against_targets()7. Cleanup (5 minutes)
# Clean up test environment
stop_dr_services()
archive_test_results()
send_notifications()Total Duration: ~75 minutes (within RTO target)
Objective: Validate complete database restoration
Steps:
- Select latest full backup
- Restore to DR environment
- Apply incremental backups
- Apply transaction logs
- Verify data integrity
- Test application connectivity
Success Criteria:
- All data restored successfully
- No data corruption detected
- Record counts match source
- Application can connect and query
- Restore completed within RTO (1 hour)
- Data loss within RPO (15 minutes)
Objective: Restore database to specific timestamp
Steps:
- Select backup before target time
- Apply transaction logs to target time
- Verify data state at target time
- Test data consistency
Success Criteria:
- Database restored to exact timestamp
- Data matches expected state
- No data inconsistencies
- Transaction integrity maintained
Objective: Restore specific tables/collections
Steps:
- Identify target data to restore
- Extract from backup
- Restore to target environment
- Verify specific data only
Success Criteria:
- Targeted data restored
- Surrounding data unchanged
- No data corruption
- Minimal downtime
Objective: Complete failover to DR site
Steps:
- Declare disaster scenario
- Activate DR site
- Restore latest backup
- Update DNS/routing
- Redirect traffic to DR site
- Verify service availability
Success Criteria:
- DR site activated successfully
- Data fully restored
- Service accessible
- Failover completed within RTO
- All critical functions working
Checksum Verification:
# Verify backup checksums
sha256sum -c backup.sha256
# Expected output:
# database-full-20260203.tar.gz: OK
# database-incremental-20260203.tar.gz: OK
# transaction-logs-20260203.tar.gz: OKEncryption Verification:
# Verify backup encryption
gpg --verify backup.sig
# Check encryption algorithm
file database-full-20260203.tar.gz.gpg
# Expected: GPG encrypted data (AES256)Metadata Verification:
# Check backup metadata
cat backup-metadata.json
{
"backup_id": "FULL-20260203-020000",
"timestamp": "2026-02-03T02:00:00Z",
"type": "full",
"size_bytes": 536870912000,
"compressed_size_bytes": 107374182400,
"checksum": "sha256:abcd1234...",
"encryption": "AES256",
"retention_days": 30,
"database_version": "1.4.1",
"record_count": 150000000
}Prometheus Metrics:
# Backup success rate
backup_success_rate{type="full"} 1.0
backup_success_rate{type="incremental"} 1.0
# Backup duration
backup_duration_seconds{type="full"} 7200
backup_duration_seconds{type="incremental"} 900
# Restore time (RTO)
dr_restore_duration_seconds{scenario="full"} 3240
dr_rto_target_seconds 3600
# Data loss window (RPO)
dr_data_loss_seconds{scenario="full"} 600
dr_rpo_target_seconds 900
# DR test results
dr_test_success_rate 1.0
dr_test_last_run_timestamp 1738656000
Dashboard: Disaster Recovery Metrics
Panels:
-
RTO Achievement
- Current RTO vs. Target
- RTO trend over time
- RTO by restore scenario
-
RPO Achievement
- Current RPO vs. Target
- RPO trend over time
- Data loss analysis
-
Backup Health
- Backup success rate
- Backup duration trends
- Backup size growth
- Failed backups alerts
-
DR Test Results
- Test success rate
- Last successful test
- Test duration
- Issues detected
-
Compliance Status
- RTO/RPO compliance rate
- Test frequency compliance
- Backup retention compliance
File: reports/dr-test-YYYY-MM-DD.md
# DR Test Report
**Test ID:** DR-TEST-2026-02-03
**Date:** 2026-02-03 03:00 UTC
**Test Type:** Full Restore Test
**Status:** ✅ PASSED
## Executive Summary
- **RTO Achievement:** 54 minutes ✅ (Target: 60 minutes)
- **RPO Achievement:** 8 minutes ✅ (Target: 15 minutes)
- **Test Duration:** 72 minutes
- **Data Integrity:** 100% verified
- **Overall Result:** PASSED
## Test Timeline
| Step | Start Time | Duration | Status |
|------|------------|----------|--------|
| Pre-test validation | 03:00 | 5 min | ✅ PASS |
| Backup verification | 03:05 | 10 min | ✅ PASS |
| Restore execution | 03:15 | 30 min | ✅ PASS |
| Data verification | 03:45 | 10 min | ✅ PASS |
| Functional testing | 03:55 | 10 min | ✅ PASS |
| Metrics calculation | 04:05 | 5 min | ✅ PASS |
| Cleanup | 04:10 | 2 min | ✅ PASS |
## Metrics
### RTO (Recovery Time Objective)
- **Target:** 60 minutes
- **Achieved:** 54 minutes
- **Status:** ✅ WITHIN TARGET
- **Breakdown:**
- Backup restore: 30 minutes
- Transaction log replay: 15 minutes
- Service startup: 9 minutes
### RPO (Recovery Point Objective)
- **Target:** 15 minutes
- **Achieved:** 8 minutes
- **Status:** ✅ WITHIN TARGET
- **Last backup:** 02:52 UTC
- **Test start:** 03:00 UTC
- **Data loss window:** 8 minutes
## Data Integrity
- **Total records:** 150,000,000
- **Records verified:** 150,000,000 (100%)
- **Checksum mismatches:** 0
- **Data corruption:** None detected
- **Index integrity:** ✅ All indexes valid
## Functional Tests
| Test | Result | Notes |
|------|--------|-------|
| Database connectivity | ✅ PASS | Connected successfully |
| Read operations | ✅ PASS | Query execution normal |
| Write operations | ✅ PASS | Insert/update working |
| Query performance | ✅ PASS | Within baseline |
| API endpoints | ✅ PASS | All endpoints responsive |
## Issues Detected
No issues detected during this test.
## Recommendations
1. Continue weekly automated DR tests
2. No action required at this time
3. RTO/RPO well within targets
## Next Test
**Scheduled:** 2026-02-10 03:00 UTC
**Type:** Full Restore TestFile: .github/workflows/dr-testing.yml
name: DR Testing Automation
on:
schedule:
- cron: '0 3 * * 0' # Weekly on Sunday at 03:00 UTC
workflow_dispatch:
inputs:
test_type:
description: 'DR test type'
required: true
type: choice
options:
- full
- smoke-test
- point-in-time
- partial-restore
jobs:
dr-test:
runs-on: ubuntu-latest
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup DR environment
run: |
./scripts/operations/setup-dr-env.sh
- name: Run DR test
id: dr-test
run: |
./scripts/operations/dr-test.sh --${{ inputs.test_type || 'full' }}
continue-on-error: true
- name: Generate report
run: |
./scripts/operations/dr-test.sh --report
- name: Upload report
uses: actions/upload-artifact@v3
with:
name: dr-test-report
path: reports/dr-test-*.md
- name: Update metrics
run: |
./scripts/operations/export-dr-metrics.sh
- name: Notify team
if: steps.dr-test.outcome == 'failure'
run: |
./scripts/operations/notify-dr-failure.sh
- name: Cleanup
if: always()
run: |
./scripts/operations/cleanup-dr-env.sh- All backups completed successfully
- Backup integrity verified (last 7 days)
- DR site connectivity verified
- DR site capacity sufficient
- Contact lists up to date
- DR procedures documented and tested
- Team members trained on DR procedures
- Communication plan established
- Disaster declared by authorized personnel
- Incident response team activated
- DR site activated
- Latest backup identified
- Restore process initiated
- Stakeholders notified (internal)
- Status updates provided (every 15 min)
- Progress tracked and documented
- Service restored and verified
- Data integrity confirmed
- All critical functions tested
- Customers notified of resolution
- DNS/routing updated (if permanent)
- Post-incident review scheduled
- Lessons learned documented
- DR procedures updated
A.17.1 - Information security continuity
- ✅ DR procedures documented
- ✅ Regular DR testing conducted
- ✅ RTO/RPO defined and measured
- ✅ Backup integrity verified
BCR-01 - Business Continuity Management
- ✅ Business continuity plan exists
- ✅ Regular testing and validation
- ✅ Recovery procedures documented
- ✅ Continuous improvement process
Issue: Backup restore fails
# Check backup integrity
./scripts/operations/dr-test.sh --verify-backup
# Try alternative backup
./scripts/operations/dr-test.sh --backup /mnt/backup/secondary/
# Check logs
tail -f logs/dr-test.logIssue: RTO target exceeded
# Analyze restore timeline
./scripts/operations/analyze-restore-time.sh
# Check for bottlenecks
./scripts/operations/dr-performance-analysis.sh
# Optimize restore process
# Consider: parallel restore, faster storage, optimized networkIssue: Data integrity check fails
# Review specific failures
./scripts/operations/dr-test.sh --verify-data --verbose
# Compare checksums
./scripts/operations/compare-checksums.sh
# Restore from alternative backup
./scripts/operations/dr-test.sh --backup /mnt/backup/secondary/Config File: config/dr-testing.yaml
dr_testing:
# RTO/RPO targets
targets:
rto_seconds: 3600 # 1 hour
rpo_seconds: 900 # 15 minutes
# Test schedule
schedule:
full_test: "0 3 * * 0" # Weekly
smoke_test: "0 4 * * *" # Daily
point_in_time: "0 3 1 * *" # Monthly
# Backup locations
backup:
primary_path: "/mnt/backup/primary"
secondary_path: "/mnt/backup/secondary"
archive_bucket: "s3://themisdb-backups"
# DR environment
dr_site:
enabled: true
endpoint: "dr.themisdb.example.com"
capacity_check: true
# Notifications
notifications:
enabled: true
on_failure: true
on_success: false
recipients:
- ops-team@example.com
- dr-lead@example.comDocument Version: 1.5.0
Compliance: ISO 27001 A.17.1, BSI C5 BCR-01
Last Reviewed: 2026-02-03