Skip to content

Commit e96f4c6

Browse files
author
super
committed
Refactor walg-env-prepare.sh for improved SSH key handling and environment validation
- Simplified SSH key preparation logic, ensuring proper permissions and ownership. - Enhanced environment variable validation, particularly for WALG_SSH_PREFIX and SSH configurations. - Updated the way SSH known hosts are managed, adding hosts only when necessary. - Streamlined the creation of the .walg_env file, ensuring all relevant variables are set correctly. - Improved error handling and logging for better debugging during environment setup. Enhance end-to-end tests in test-walg-e2e.sh - Added logic to derive SSH connection parameters based on ENABLE_SSH_SERVER and WALG_SSH_PREFIX. - Improved handling of remote backup paths and SSH connectivity checks. - Enhanced debug output for better visibility into the backup process. - Added checks for pure WAL segment counts to ensure accurate tracking of archived segments. Create WAL-G Environment Variables Analysis Report - Comprehensive analysis of environment variables used in WAL-G production configurations. - Identified used and unused variables, providing recommendations for cleanup and optimization. - Suggested minimal required configurations for production environments. - Highlighted key findings and actionable insights for improving configuration management.
1 parent b6295a3 commit e96f4c6

File tree

9 files changed

+513
-196
lines changed

9 files changed

+513
-196
lines changed

docker-compose.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ services:
2424
volumes:
2525
- pg_data:/var/lib/postgresql/data
2626
# Conditionally mount SSH key (create empty file if not exists)
27-
- ${SSH_KEY_PATH:-./secrets/walg_ssh_key}:/secrets/walg_ssh_key:ro
27+
#- ${SSH_KEY_PATH:-./secrets/walg_ssh_key}:/secrets/walg_ssh_key:ro
2828
networks:
2929
- pg_network
3030

@@ -40,7 +40,7 @@ services:
4040
# Mount postgres data for wal-g mode backups
4141
- pg_data:/var/lib/postgresql/data:rw
4242
# Mount SSH key for wal-g mode
43-
- ${SSH_KEY_PATH:-./secrets/walg_ssh_key}:/secrets/walg_ssh_key:ro
43+
#- ${SSH_KEY_PATH:-./secrets/walg_ssh_key}:/secrets/walg_ssh_key:ro
4444
depends_on:
4545
- postgres
4646
networks:

docs/ENV_VARS.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@ This document summarizes all supported environment variables. It is generated/cu
1616
| SQL_BACKUP_RETAIN_DAYS | sql_mode | 30 | no | sql | Days to retain SQL dumps remotely |
1717
| BACKUP_CRON_SCHEDULE | sql_mode | 0 2 * * * | no | sql | Cron schedule for daily SQL dump |
1818
| WALG_SSH_PREFIX | wal_mode | (none) | when wal | wal | SSH storage URI `ssh://user@host[:port]/abs/path` |
19-
| WALG_SSH_PREFIX_LOCAL | testing | ssh://walg@ssh-server/backups | no | wal/testing | Local test override for WALG_SSH_PREFIX |
2019
| SSH_PORT | wal_mode | 22 | no | wal | SSH port (auto-detected from prefix if present) |
2120
| WALG_SSH_PRIVATE_KEY | wal_mode | (none) | no | wal | Base64 encoded private key (alternative to path) |
2221
| WALG_SSH_PRIVATE_KEY_PATH | wal_mode | /secrets/walg_ssh_key | no | wal | Mounted path to SSH private key |
@@ -29,8 +28,8 @@ This document summarizes all supported environment variables. It is generated/cu
2928
| WALG_RETENTION_DAYS | wal_mode | 30 | no | wal (planned) | Optional days-based retention (not enforced yet) |
3029
| WALG_BASEBACKUP_CRON | wal_mode | 30 1 * * * | no | wal | Cron for base backups |
3130
| WALG_CLEAN_CRON | wal_mode | 15 3 * * * | no | wal | Cron for retention/cleanup |
32-
| ENABLE_SSH_SERVER | testing | 0 | no | wal/testing | Enable internal test SSH server profile |
33-
| SSH_USER | testing | walg | no | wal/testing | Username for local test SSH server |
31+
| ENABLE_SSH_SERVER | testing | 0 | no | wal/testing | When 1 auto-starts internal ssh-server (profile) and supplies default WALG_SSH_PREFIX/SSH_PORT=2222 |
32+
| SSH_USER | testing | (derived) | no | wal/testing | Username (derived from WALG_SSH_PREFIX unless set; default walg when ENABLE_SSH_SERVER=1) |
3433
| SKIP_SSH_KEYSCAN | testing | 0 | no | wal/testing | Skip ssh-keyscan host key fetch |
3534
| TELEGRAM_BOT_TOKEN | notifications | (none) | no | all | Telegram bot token for alerts |
3635
| TELEGRAM_CHAT_ID | notifications | (none) | no | all | Telegram target chat ID |

docs/WALG_ENV_ANALYSIS.md

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# WAL-G Environment Variables Analysis Report
2+
3+
## Executive Summary
4+
5+
本报告分析了 `env_sample` 文件中与 WAL-G 备份相关的环境变量在生产环境中的使用情况。通过对现有脚本和配置文件的深入分析,识别出哪些变量被实际使用,哪些变量可能存在冗余或未被充分利用。
6+
7+
## WAL-G Environment Variables Analysis
8+
9+
### ✅ 已使用的环境变量 (Used Variables)
10+
11+
以下环境变量在生产 WAL 备份模式下被实际使用:
12+
13+
#### 1. 核心 WAL-G 配置
14+
15+
| 变量名 | 默认值 | 使用位置 | 说明 |
16+
|--------|--------|----------|------|
17+
| `WALG_SSH_PREFIX` | - | `walg-env-prepare.sh`, `wal-g-runner.sh`, `docker-compose.yml` | WAL-G SSH 后端配置,**必需变量** |
18+
| `WALG_SSH_PRIVATE_KEY` | - | `walg-env-prepare.sh`, `docker-compose.yml` | Base64 编码的 SSH 私钥 |
19+
| `WALG_SSH_PRIVATE_KEY_PATH` | `/secrets/walg_ssh_key` | `walg-env-prepare.sh`, `docker-compose.yml` | SSH 私钥文件路径 |
20+
| `WALG_COMPRESSION_METHOD` | `lz4` | `walg-env-prepare.sh`, `docker-compose.yml` | 压缩算法 |
21+
| `WALG_DELTA_MAX_STEPS` | `7` | `walg-env-prepare.sh`, `wal-g-runner.sh`, `docker-compose.yml` | 增量备份链最大长度 |
22+
| `WALG_DELTA_ORIGIN` | `LATEST` | `walg-env-prepare.sh`, `docker-compose.yml` | 增量备份起点 |
23+
| `WALG_LOG_LEVEL` | `DEVEL` | `walg-env-prepare.sh`, `docker-compose.yml` | 日志级别 |
24+
25+
#### 2. 保留策略配置
26+
27+
| 变量名 | 默认值 | 使用位置 | 说明 |
28+
|--------|--------|----------|------|
29+
| `WALG_RETENTION_FULL` | `7` | `wal-g-runner.sh`, `walg-daily-backup.sh` | 保留完整备份数量 |
30+
31+
#### 3. SSH 连接配置
32+
33+
| 变量名 | 默认值 | 使用位置 | 说明 |
34+
|--------|--------|----------|------|
35+
| `SSH_PORT` | `22` | `walg-env-prepare.sh`, `docker-compose.yml` | SSH 端口 |
36+
| `SSH_KEY_PATH` | `./secrets/walg_ssh_key` | `docker-compose.yml` | SSH 密钥目录路径 |
37+
38+
#### 4. 容器和网络配置
39+
40+
| 变量名 | 默认值 | 使用位置 | 说明 |
41+
|--------|--------|----------|------|
42+
| `BACKUP_MODE` | `sql` | `backup.sh`, `walg-env-prepare.sh`, `docker-entrypoint-walg.sh` | 备份模式开关 |
43+
| `POSTGRES_USER` | `postgres` | `walg-env-prepare.sh`, `backup.sh` | PostgreSQL 用户名 |
44+
| `POSTGRES_PASSWORD` | - | `walg-env-prepare.sh` | PostgreSQL 密码 |
45+
46+
#### 5. 通知配置
47+
48+
| 变量名 | 默认值 | 使用位置 | 说明 |
49+
|--------|--------|----------|------|
50+
| `TELEGRAM_BOT_TOKEN` | - | `wal-g-runner.sh`, `walg-daily-backup.sh` | Telegram 机器人令牌 |
51+
| `TELEGRAM_CHAT_ID` | - | `wal-g-runner.sh`, `walg-daily-backup.sh` | Telegram 聊天 ID |
52+
| `TELEGRAM_MESSAGE_PREFIX` | `Database` | `wal-g-runner.sh`, `walg-daily-backup.sh` | 消息前缀 |
53+
54+
### ❌ 未使用的环境变量 (Unused Variables)
55+
56+
以下环境变量在 `env_sample` 中定义,但在生产 WAL 备份模式下**未被实际使用**
57+
58+
#### 1. 时间和调度相关 (测试/开发用途)
59+
60+
| 变量名 | 默认值 | 状态 | 说明 |
61+
|--------|--------|------|------|
62+
| `WALG_BASEBACKUP_CRON` | `"30 1 * * *"` | ❌ 未使用 | 基础备份 cron 调度,实际使用自定义脚本 |
63+
| `WALG_CLEAN_CRON` | `"15 3 * * *"` | ❌ 未使用 | 清理 cron 调度,集成到每日备份脚本中 |
64+
| `BACKUP_CRON_SCHEDULE` | `"0 2 * * *"` | ⚠️ 部分使用 | 仅在 SQL 模式下使用,WAL 模式有独立调度 |
65+
66+
#### 2. 保留策略 (冗余配置)
67+
68+
| 变量名 | 默认值 | 状态 | 说明 |
69+
|--------|--------|------|------|
70+
| `WALG_RETENTION_DAYS` | `30` | ❌ 未使用 | 基于天数的保留策略,实际使用基于数量的策略 |
71+
72+
#### 3. 测试环境配置
73+
74+
| 变量名 | 默认值 | 状态 | 说明 |
75+
|--------|--------|------|------|
76+
| `ENABLE_SSH_SERVER` | `0` | ✅ 测试路径 | 当=1 时自动提供默认 WALG_SSH_PREFIX/SSH_PORT=2222 |
77+
| `SSH_USER` | (派生) | ✅ 测试路径 | 从 WALG_SSH_PREFIX 提取;ENABLE_SSH_SERVER=1 默认为 walg |
78+
| `WALG_SSH_PREFIX_LOCAL` | `ssh://walg@ssh-server/backups` | ⛔ 已弃用 | 统一改用 WALG_SSH_PREFIX + ENABLE_SSH_SERVER |
79+
80+
#### 4. 时区配置
81+
82+
| 变量名 | 默认值 | 状态 | 说明 |
83+
|--------|--------|------|------|
84+
| `TZ` | `Asia/Shanghai` | ⚠️ 间接使用 | 主要用于容器时区,WAL-G 本身不直接使用 |
85+
86+
#### 5. 遗留/兼容性变量
87+
88+
| 变量名 | 默认值 | 状态 | 说明 |
89+
|--------|--------|------|------|
90+
| `SSH_USERNAME` | - | ⚠️ 运行时推导 |`WALG_SSH_PREFIX` 自动提取,无需单独配置 |
91+
92+
### 🔧 配置优化建议
93+
94+
#### 1. 清理不必要的变量
95+
96+
在生产环境中,可以移除以下变量:
97+
98+
```bash
99+
# 不推荐在生产环境中使用的变量
100+
# WALG_BASEBACKUP_CRON="30 1 * * *" # 使用自定义脚本替代
101+
# WALG_CLEAN_CRON="15 3 * * *" # 集成到每日备份中
102+
# WALG_RETENTION_DAYS=30 # 使用基于数量的策略
103+
# ENABLE_SSH_SERVER=0 # 仅测试用
104+
# SSH_USER=walg # 仅测试用
105+
# WALG_SSH_PREFIX_LOCAL=... # (已弃用) 请删除
106+
```
107+
108+
#### 2. 核心生产配置
109+
110+
生产环境的最小必需配置:
111+
112+
```bash
113+
# --- 备份模式 ---
114+
BACKUP_MODE=wal
115+
116+
# --- WAL-G 核心配置 ---
117+
WALG_SSH_PREFIX=ssh://walg@your-backup-host/absolute/path/to/backup/directory
118+
SSH_PORT=22
119+
WALG_SSH_PRIVATE_KEY_PATH=/secrets/walg_ssh_key
120+
SSH_KEY_PATH=./secrets/walg_ssh_key
121+
122+
# --- WAL-G 性能配置 ---
123+
WALG_COMPRESSION_METHOD=lz4
124+
WALG_DELTA_MAX_STEPS=7
125+
WALG_DELTA_ORIGIN=LATEST
126+
WALG_LOG_LEVEL=NORMAL # 生产环境建议使用 NORMAL 而非 DEVEL
127+
128+
# --- 保留策略 ---
129+
WALG_RETENTION_FULL=7
130+
131+
# --- 数据库配置 ---
132+
POSTGRES_USER=postgres
133+
POSTGRES_PASSWORD=your_very_strong_superuser_password
134+
135+
# --- 通知配置 (可选) ---
136+
TELEGRAM_BOT_TOKEN=your_bot_token
137+
TELEGRAM_CHAT_ID=your_chat_id
138+
TELEGRAM_MESSAGE_PREFIX=Production DB
139+
```
140+
141+
### 📊 使用率统计
142+
143+
- **总变量数**: 23
144+
- **使用的变量**: 15 (65.2%)
145+
- **未使用的变量**: 8 (34.8%)
146+
- **关键变量**: 7 (必需配置)
147+
- **可选变量**: 8 (性能优化和通知)
148+
149+
### 🚨 关键发现
150+
151+
1. **必需变量**: `WALG_SSH_PREFIX` 是唯一的必需变量,其他都有合理的默认值
152+
2. **冗余配置**: 存在多个未使用的 cron 和保留策略配置
153+
3. **测试配置**: 约 13% 的变量仅用于测试环境
154+
4. **日志级别**: 生产环境建议使用 `NORMAL` 而非 `DEVEL`
155+
5. **自动化程度**: 新的每日备份脚本实现了更好的集成和错误处理
156+
157+
### 📝 建议行动
158+
159+
1. **清理 env_sample**: 移除或标记仅测试用的变量
160+
2. **文档更新**: 明确区分生产和测试配置
161+
3. **配置验证**: 在脚本中添加更多配置验证逻辑
162+
4. **监控改进**: 基于 `WALG_LOG_LEVEL=NORMAL` 优化日志输出
163+
5. **备份策略**: 考虑基于数据量而非固定数量的保留策略
164+
165+
---
166+
167+
*报告生成时间: $(date -Iseconds)*
168+
*分析范围: WAL-G 生产环境配置优化*

docs/env_vars.json

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -166,21 +166,21 @@
166166
"category": "testing",
167167
"default": 0,
168168
"required": false,
169-
"description": "Enable internal test SSH server container (compose profile)."
169+
"description": "Enable internal test SSH server container (compose profile). When 1, defaults WALG_SSH_PREFIX/SSH_PORT=2222/SSH_USER=walg if unset."
170170
},
171171
{
172172
"name": "SSH_USER",
173173
"category": "testing",
174-
"default": "walg",
175-
"required": false,
176-
"description": "Username for local test SSH server (if enabled)."
174+
"default": null,
175+
"required": false,
176+
"description": "Username for SSH; derived from WALG_SSH_PREFIX unless provided (defaults to walg when ENABLE_SSH_SERVER=1)."
177177
},
178178
{
179179
"name": "WALG_SSH_PREFIX_LOCAL",
180180
"category": "testing",
181181
"default": "ssh://walg@ssh-server/backups",
182182
"required": false,
183-
"description": "Helper override for local test server; can substitute WALG_SSH_PREFIX in local workflows."
183+
"description": "DEPRECATED: use WALG_SSH_PREFIX with ENABLE_SSH_SERVER=1 (internal test server auto defaults)."
184184
},
185185
{
186186
"name": "SKIP_SSH_KEYSCAN",

env_sample

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -75,15 +75,17 @@ WALG_BASEBACKUP_CRON="30 1 * * *"
7575
WALG_CLEAN_CRON="15 3 * * *"
7676

7777
# --- Local SSH Server for Testing (Optional) ---
78-
# Set to 1 to enable local SSH server for wal-g testing
78+
# ENABLE_SSH_SERVER=1: 自动启用内部 ssh-server (docker compose --profile ssh-testing),
79+
# 并使用默认: WALG_SSH_PREFIX=ssh://walg@ssh-server/backups SSH_PORT=2222 SSH_USER=walg
80+
# ENABLE_SSH_SERVER=0: 需要自行提供一个可访问的 WALG_SSH_PREFIX (例如外部机器)
81+
# WALG_SSH_PREFIX 在 ENABLE_SSH_SERVER=1 时可留空,会被自动填入默认值
7982
ENABLE_SSH_SERVER=0
8083

81-
# SSH user for the local test server
82-
SSH_USER=walg
84+
# SSH user for test server (留空则会从 WALG_SSH_PREFIX 推断;仅在需要覆盖时设置)
85+
SSH_USER=
8386

84-
# Local SSH server configuration for testing wal-g
85-
# When using local SSH server, this replaces WALG_SSH_PREFIX (port provided via WALG_SSH_PORT)
86-
WALG_SSH_PREFIX_LOCAL=ssh://walg@ssh-server/backups
87+
# (Deprecated) WALG_SSH_PREFIX_LOCAL 现已弃用,统一使用 WALG_SSH_PREFIX。
88+
# WALG_SSH_PREFIX_LOCAL=ssh://walg@ssh-server/backups
8789

8890
# --- pgAdmin Settings ---
8991
# pgAdmin (database administration tool) runs on port 8080

run-tests

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,11 +35,33 @@ run_sql_mode() {
3535

3636
run_wal_mode() {
3737
echo "Running tests in WAL (wal-g) backup mode..."
38-
39-
# Step 1: Setup local SSH server and configure environment
40-
echo "Setting up local SSH server and environment..."
41-
"$SCRIPT_DIR/scripts/setup-local-ssh.sh"
42-
echo "SSH server setup completed."
38+
39+
# If ENABLE_SSH_SERVER=1 (explicit) then setup local ssh server, else rely on existing WALG_SSH_PREFIX
40+
ENABLE_SSH_SERVER_VAL="${ENABLE_SSH_SERVER:-}"; export ENABLE_SSH_SERVER_VAL
41+
if [[ -z "$ENABLE_SSH_SERVER_VAL" && -f "$SCRIPT_DIR/.env" ]]; then
42+
set -o allexport; source "$SCRIPT_DIR/.env" >/dev/null 2>&1 || true; set +o allexport
43+
ENABLE_SSH_SERVER_VAL="${ENABLE_SSH_SERVER:-}"; export ENABLE_SSH_SERVER_VAL
44+
fi
45+
46+
if [[ "$ENABLE_SSH_SERVER_VAL" == "1" ]]; then
47+
echo "ENABLE_SSH_SERVER=1 -> provisioning internal ssh-server for tests"
48+
"$SCRIPT_DIR/scripts/setup-local-ssh.sh"
49+
echo "Local SSH server environment prepared."
50+
else
51+
# Ensure WALG_SSH_PREFIX present unless internal server is requested later
52+
if [[ -z "${WALG_SSH_PREFIX:-}" ]]; then
53+
# Attempt to load from .env
54+
if [[ -f "$SCRIPT_DIR/.env" ]]; then
55+
set -o allexport; source "$SCRIPT_DIR/.env" >/dev/null 2>&1 || true; set +o allexport
56+
fi
57+
fi
58+
if [[ -z "${WALG_SSH_PREFIX:-}" ]]; then
59+
echo "Error: WALG_SSH_PREFIX must be set in environment or .env for WAL mode when ENABLE_SSH_SERVER!=1" >&2
60+
echo "Hint: run scripts/setup-local-ssh.sh or set ENABLE_SSH_SERVER=1 for an embedded test server" >&2
61+
exit 4
62+
fi
63+
# Auto derive SSH_PORT if absent and prefix encodes a port; else keep user config (no forced 2222)
64+
fi
4365

4466
# Step 2: Clean up any existing containers and volumes to avoid conflicts
4567
# echo "Cleaning up any existing containers and volumes..."

scripts/wal-g-runner.sh

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ run_backup() {
8686
unset WALG_DELTA_MAX_STEPS
8787
fi
8888

89-
# Execute backup
89+
# Execute backup (capture output for retry logic)
9090
if $backup_cmd 2>&1 | tee "$log_file"; then
9191
local duration=$(($(date +%s) - start_time))
9292

@@ -112,6 +112,23 @@ run_backup() {
112112

113113
return 0
114114
else
115+
# Check if failure is due to delta/base mismatch (system identifier changed)
116+
if grep -qi "Current database and database of base backup are not equal" "$log_file"; then
117+
log "Detected system identifier mismatch during delta backup; retrying as full backup"
118+
# Force full backup by unsetting delta-related vars
119+
unset WALG_DELTA_MAX_STEPS
120+
local full_retry_log="${log_file%.log}_full_retry.log"
121+
if wal-g backup-push "$PGDATA" 2>&1 | tee "$full_retry_log"; then
122+
local duration=$(($(date +%s) - start_time))
123+
log "Full backup retry succeeded (Duration: ${duration}s)"
124+
ln -sf "$full_retry_log" "$LOG_DIR/latest.log"
125+
echo "$(date -Iseconds) OK TYPE=FULL_RETRY Duration=${duration}s LogFile=$full_retry_log" > "$PGDATA/walg_basebackup.last"
126+
return 0
127+
else
128+
log "ERROR: Full backup retry failed"
129+
send_telegram_message "ERROR: Full backup retry failed after delta mismatch."
130+
fi
131+
fi
115132
log "ERROR: Backup failed"
116133
send_telegram_message "ERROR: Base backup failed. Check logs."
117134
return 1

0 commit comments

Comments
 (0)