-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add AWS Aurora PostgreSQL Auto-Discovery Support #5267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v3.0
Are you sure you want to change the base?
Add AWS Aurora PostgreSQL Auto-Discovery Support #5267
Conversation
Summary of ChangesHello @leekihyunn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances ProxySQL's capabilities by introducing comprehensive auto-discovery and management support for AWS Aurora PostgreSQL clusters. It eliminates the need for manual configuration and updates of server hostgroups, allowing ProxySQL to dynamically adapt to changes in Aurora PostgreSQL topology, such as failovers and new replica additions. This ensures high availability and optimal performance by automatically routing traffic to the correct instances and shunning those with excessive replication lag, mirroring the functionality already present for MySQL Aurora. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces AWS Aurora PostgreSQL auto-discovery support, achieving feature parity with the existing MySQL implementation. The changes are extensive, touching configuration, monitoring, and hostgroup management, and they closely follow the established patterns in the codebase, which is commendable for consistency.
My review has identified several critical security vulnerabilities related to the use of sprintf with inadequately sized buffers, which could lead to buffer overflows. I strongly recommend replacing these with snprintf and proper size calculation. Additionally, I've pointed out areas where modern C++ practices, such as using std::string instead of C-style char* for memory management, would improve code safety and maintainability. Other suggestions focus on improving code clarity and fixing minor logical issues.
| query = (char *)malloc(strlen(q) + strlen(_server_id) + strlen(domain_name) + 1024); | ||
| sprintf(query, q, _server_id, domain_name, aurora_port, _whid, _rhid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use of sprintf with a buffer size estimated with a magic number (1024) is a critical security vulnerability. A long _server_id or domain_name could cause a buffer overflow. You should calculate the required buffer size precisely and use snprintf. This unsafe pattern is repeated multiple times in update_aws_aurora_set_writer and update_aws_aurora_set_reader. Please fix all instances.
int query_len = snprintf(nullptr, 0, q, _server_id, domain_name, aurora_port, _whid, _rhid);
query = (char *)malloc(query_len + 1);
snprintf(query, query_len + 1, q, _server_id, domain_name, aurora_port, _whid, _rhid);| char* q2 = (char*)malloc(strlen(q1) + strlen(lut) + strlen(hse->server_id) + 32); | ||
| sprintf(q2, q1, wHG, hse->server_id, lut); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a critical buffer overflow vulnerability. The buffer size is estimated with a magic number (+ 32), which is not guaranteed to be sufficient for the length of lut and hse->server_id. This can lead to a buffer overflow. Please use snprintf with a correctly calculated buffer size to prevent this.
int q2_len = snprintf(nullptr, 0, q1, wHG, hse->server_id, lut);
char* q2 = (char*)malloc(q2_len + 1);
snprintf(q2, q2_len + 1, q1, wHG, hse->server_id, lut);| char *query=(char *)malloc(strlen(q)+strlen(o)+strlen(p)+256); | ||
| sprintf(query,q, writer_hostgroup, reader_hostgroup, active, aurora_port, p, max_lag_ms, check_interval_ms, check_timeout_ms, writer_is_also_reader, new_reader_weight, add_lag_ms, min_lag_ms, lag_num_checks, o); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a critical buffer overflow vulnerability. The buffer for the SQL query is allocated with a fixed-size addition (+256), which may not be sufficient if the escaped comment (o) or domain_name (p) are very long. This can be exploited. Please use snprintf to safely construct the query string by first calculating the required size.
int query_len = snprintf(nullptr, 0, q, writer_hostgroup, reader_hostgroup, active, aurora_port, p, max_lag_ms, check_interval_ms, check_timeout_ms, writer_is_also_reader, new_reader_weight, add_lag_ms, min_lag_ms, lag_num_checks, o);
char *query=(char *)malloc(query_len + 1);
snprintf(query, query_len + 1, q, writer_hostgroup, reader_hostgroup, active, aurora_port, p, max_lag_ms, check_interval_ms, check_timeout_ms, writer_is_also_reader, new_reader_weight, add_lag_ms, min_lag_ms, lag_num_checks, o);| char *address = (char *)malloc(strlen(_server_id) + strlen(domain_name) + 1); | ||
| sprintf(address, "%s%s", _server_id, domain_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sprintf with a manually calculated buffer size is a potential security risk. If the calculation is incorrect, it can lead to a buffer overflow. It is much safer to use snprintf, which prevents buffer overflows by respecting the provided buffer size.
size_t address_len = strlen(_server_id) + strlen(domain_name) + 1;
char *address = (char *)malloc(address_len);
snprintf(address, address_len, "%s%s", _server_id, domain_name);| int check_interval_ms; | ||
| int check_timeout_ms; | ||
| bool active; | ||
| bool __active; // temporary flag for tracking during regeneration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| char *domain_name; | ||
| char *comment; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new classes like PgSQL_AWS_Aurora_Info, consider using std::string for members like domain_name and comment instead of char*. std::string handles its own memory management, which prevents memory leaks and makes the code safer and easier to maintain. This would eliminate the need for manual strdup and free calls.
std::string domain_name;
std::string comment;| size_t stride = sizeof(pgsql_host_def_t); | ||
| memcpy(&tmp, arr + i * stride / sizeof(pgsql_host_def_t), sizeof(pgsql_host_def_t)); | ||
| memcpy(arr + i * stride / sizeof(pgsql_host_def_t), arr + j * stride / sizeof(pgsql_host_def_t), sizeof(pgsql_host_def_t)); | ||
| memcpy(arr + j * stride / sizeof(pgsql_host_def_t), &tmp, sizeof(pgsql_host_def_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shuffling logic using memcpy and stride is overly complex. The expression i * stride / sizeof(pgsql_host_def_t) simplifies to just i, making the calculation confusing. Using std::swap or a simple temporary variable would be much clearer and more idiomatic C++.
pgsql_host_def_t tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;| char* is_writer_str = PQgetvalue(res, i, 3); | ||
|
|
||
| float replica_lag = replica_lag_str ? atof(replica_lag_str) : 0.0f; | ||
| bool is_writer = (is_writer_str && (strcmp(is_writer_str, "t") == 0 || strcmp(is_writer_str, "true") == 0 || strcmp(is_writer_str, "1") == 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is_writer column is a boolean type from the aurora_replica_status() function. For PostgreSQL, PQgetvalue returns 't' for true and 'f' for false. The checks for "true" and "1" are unnecessary and make the code more fragile. It's better to rely on the documented behavior of PQgetvalue for boolean types.
bool is_writer = (is_writer_str && strcmp(is_writer_str, "t") == 0);| else { | ||
| for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin(); it != resultset->rows.end(); ++it) { | ||
| SQLite3_row* r = *it; | ||
| proxy_error("Incompatible entry in pgsql_aws_aurora_hostgroups will be ignored : ( %s , %s , %s , %s )\n", r->fields[0], r->fields[1], r->fields[2], r->fields[3]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message only prints the first 4 fields of an incompatible entry in pgsql_aws_aurora_hostgroups. The table has more columns, and this partial information may not be sufficient to identify the problematic row during debugging. Consider logging all fields or at least the primary key (writer_hostgroup) to make the log message more useful.
This commit implements Aurora PostgreSQL auto-discovery functionality, mirroring the existing Aurora MySQL implementation. The feature enables automatic detection and management of Aurora PostgreSQL cluster topology. Key features: - Auto-discovery using aurora_replica_status() function - Writer detection via session_id = 'MASTER_SESSION_ID' - Automatic failover handling with hostgroup reconfiguration - Replication lag-based server shunning - Dynamic server addition when new nodes join the cluster New tables: - pgsql_aws_aurora_hostgroups (configuration) - runtime_pgsql_aws_aurora_hostgroups (runtime) Configuration parameters (same as MySQL Aurora): - writer_hostgroup, reader_hostgroup - aurora_port (default: 5432) - domain_name, max_lag_ms, check_interval_ms, check_timeout_ms - writer_is_also_reader, new_reader_weight - add_lag_ms, min_lag_ms, lag_num_checks Files modified: - include/ProxySQL_Admin_Tables_Definitions.h: Aurora table definitions - include/proxysql_admin.h: incoming_aurora_hostgroups field - lib/Admin_Bootstrap.cpp: Admin table registration - lib/ProxySQL_Config.cpp: Config file parsing - lib/ProxySQL_Admin.cpp: Runtime load logic - include/PgSQL_HostGroups_Manager.h: PgSQL_AWS_Aurora_Info class - lib/PgSQL_HostGroups_Manager.cpp: Hostgroup management implementation - include/PgSQL_Monitor.hpp: Monitoring class definitions - lib/PgSQL_Monitor.cpp: Monitoring thread implementation
aecb7cb to
d1114d6
Compare
|
I've reviewed the SonarCloud warnings and code review feedback. These patterns intentionally follow the existing MySQL implementation for consistency:
|
…tgreSQL - Add populate_monitor_pgsql_server_aws_aurora_log() to display aurora log entries - Add populate_monitor_pgsql_server_aws_aurora_check_status() to display check status - Add pgsql_aws_aurora_hostgroups support in dump_table_pgsql() - Add runtime_pgsql_aws_aurora_hostgroups dump in save_pgsql_servers_runtime_to_database() - Remove verbose error logging to match MySQL Aurora behavior pattern
|
Added missing monitoring table population and runtime table dump functionality: Changes:
These additions complete the admin table query functionality for Aurora PostgreSQL monitoring. |
|
add to whitelist |
|
Hi @leekihyunn . The preview message "add to whitelist" is to trigger CI runs . Please note that for Aurora MySQL we have a complex simulator (you may find it searching for |
|
Hi @renecannao and @rahim-kanji, Thank you for reviewing this PR! If there's any additional work required, please let me know and I'll address it promptly. |
- Add PgSQL_Monitor_Connection_Pool class (equivalent to MySQL_Monitor_Connection_Pool) - Add MonPgSrvC class for per-server connection management - Implement get_connection/put_connection for connection reuse - Add purge_some_connections to limit connections per server (max 4) - Add purge_all_connections for cleanup on shutdown - Wrap verbose logging with #ifdef TEST_AURORA guards - Add print_pgsql_aws_aurora_status_entry for TEST_AURORA debugging - Add action counting (action_yes/no, enabling/disabling) in evaluate function
…ping Auto-discovered Aurora PostgreSQL servers were not being added to the ping monitoring target list, causing deleted/failed instances to remain in ONLINE status instead of being properly shunned. Root cause: - update_aws_aurora_set_writer() and update_aws_aurora_set_reader() were calling generate_pgsql_servers_table() to sync MyHostGroups to the internal pgsql_servers table - But update_table_pgsql_servers_for_monitor() was not called after, so the monitor's server list (pgsql_servers_to_monitor) was never updated with the new auto-discovered servers - MySQL implementation correctly calls update_table_mysql_servers_for_monitor() in these same locations Fix: Add update_table_pgsql_servers_for_monitor(false) calls in 4 locations: - update_aws_aurora_set_writer(): commit path (line 4979) - update_aws_aurora_set_writer(): auto-discovery path (line 5021) - update_aws_aurora_set_reader(): commit path (line 5107) - update_aws_aurora_set_reader(): auto-discovery path (line 5137) This ensures auto-discovered servers are properly ping monitored and will be shunned when they become unreachable.
MySQL does not log connection failures as errors (only SSL errors). Change PostgreSQL monitor to follow the same pattern by using proxy_debug instead of proxy_error for connection failures. This prevents excessive error logs (every 2 seconds) when a server is unreachable (e.g., deleted Aurora instance). Note: MySQL has DNS caching (DNS_Cache class, dns_lookup(), DNS resolver thread running every 60 seconds), while PostgreSQL does direct DNS lookup on every connection attempt. This causes PostgreSQL to log errors every 2 seconds (ping interval) vs MySQL's ~60 seconds.
This commit ensures Aurora PostgreSQL monitoring behaves identically
to Aurora MySQL monitoring by adding missing functionality:
1. Add runtime variable refresh in Aurora monitoring threads
- HG thread (PgSQL_monitor_AWS_Aurora_thread_HG): Now checks global
variable version on each loop iteration and calls refresh_variables()
when changes are detected, allowing runtime configuration changes
to take effect immediately without restarting the thread
- Main Aurora thread (PgSQL_monitor_aws_aurora): Added PgSQL_Thread
object initialization and variable refresh loop
2. Add monitor_enabled check to Aurora thread loops
- Both HG thread and main Aurora thread now check
pgsql_thread___monitor_enabled in their while conditions
- Setting pgsql-monitor_enabled=false now properly stops Aurora
monitoring threads (previously they would continue running)
3. Add checksum comparison to update_aws_aurora_set_writer()
- Before committing writer changes, compare checksums of current
and incoming server lists
- Only commit if there are actual changes, avoiding unnecessary
table regeneration and log noise
- Add verbose skip logging when hostgroup_manager_verbose > 1
These changes match the behavior of the MySQL Aurora implementation in:
- MySQL_Monitor.cpp: monitor_AWS_Aurora_thread_HG()
- MySQL_Monitor.cpp: MySQL_Monitor::monitor_aws_aurora()
- MySQL_HostGroups_Manager.cpp: update_aws_aurora_set_writer()
Benefits:
- Runtime variable changes (e.g., hostgroup_manager_verbose) are now
reflected immediately in Aurora monitoring threads
- pgsql-monitor_enabled=false properly stops Aurora monitoring
- Reduced unnecessary commits and log output when no actual changes occur
- Full consistency with MySQL Aurora monitoring behavior
Changes: 1. Remove noisy "Monitor connect failed" logging on every connection failure - Errors are recorded in pgsql_server_connect_log table - Shunning logic will log when max_failures is reached (matching MySQL behavior) 2. Add Aurora health check error logging (matching MySQL behavior) - Connection failure: "Error on AWS Aurora PostgreSQL check for <host>:<port>..." - Query failure: "Error on AWS Aurora PostgreSQL check for <host>:<port>... Query failed" - Previous incorrect comment stated "Not logging to match MySQL" but MySQL does log these errors
|
Aurora PostgreSQL Monitoring - Align with MySQL behavior (5 commits)These fixes were discovered while comparing logs between Aurora PostgreSQL and Aurora MySQL environments during various tests including instance addition, deletion, and other scenarios. 1. Add connection pool (
|
|
retest this please |
3 similar comments
|
retest this please |
|
retest this please |
|
retest this please |




Summary
This PR implements AWS Aurora PostgreSQL auto-discovery functionality, bringing feature parity with the existing MySQL Aurora implementation. ProxySQL can now automatically detect Aurora PostgreSQL cluster topology, identify Writer/Reader instances, and handle failovers seamlessly.
Issue #5249
Motivation
Aurora PostgreSQL users currently need to manually configure and update server hostgroups when cluster topology changes. This implementation enables:
aurora_replica_status()functionWhy This Matters
As PostgreSQL adoption continues to grow, especially with AWS Aurora PostgreSQL, there's an increasing need for a robust connection pooling and query routing middleware. While PgBouncer has been the traditional choice for PostgreSQL connection pooling, it lacks advanced features like:
With this Aurora PostgreSQL auto-discovery feature, ProxySQL can serve as a more capable alternative to PgBouncer for Aurora PostgreSQL users, offering:
I believe this feature will benefit many Aurora PostgreSQL users who need enterprise-grade connection management and query routing capabilities.
Changes Made
New Tables
pgsql_aws_aurora_hostgroups- Configuration table for Aurora PostgreSQL clustersruntime_pgsql_aws_aurora_hostgroups- Runtime configuration tablepgsql_server_aws_aurora_log- Aurora health check logspgsql_server_aws_aurora_check_status- Check status trackingpgsql_server_aws_aurora_failovers- Failover event loggingConfiguration Parameters
All parameters mirror the MySQL Aurora implementation:
writer_hostgroup/reader_hostgroup- Hostgroup assignmentsmax_lag_ms- Maximum replication lag threshold for shunningcheck_interval_ms- Health check intervalcheck_timeout_ms- Health check timeoutwriter_is_also_reader- Place writer in reader hostgroupnew_reader_weight- Weight for newly discovered readersadd_lag_ms/min_lag_ms/lag_num_checks- Lag estimation parametersaurora_port- Default port (5432)domain_name- Aurora cluster domain suffixCore Implementation
Monitoring (
lib/PgSQL_Monitor.cpp)PgSQL_monitor_AWS_Aurora_thread_HG()- Per-hostgroup monitoring threadPgSQL_monitor_aws_aurora()- Main Aurora monitoring coordinatorestimate_lag()- Replication lag estimation using multiple check samplesevaluate_pgsql_aws_aurora_results()- Process health check results and trigger actionspopulate_monitor_pgsql_server_aws_aurora_log()- Populate aurora log table for admin queriespopulate_monitor_pgsql_server_aws_aurora_check_status()- Populate check status table for admin queriesHostGroups Management (
lib/PgSQL_HostGroups_Manager.cpp)PgSQL_AWS_Aurora_Infoclass - Aurora cluster configuration storageupdate_aws_aurora_set_writer()- Set/update writer with auto-discoveryupdate_aws_aurora_set_reader()- Set/update reader with auto-discoveryaws_aurora_replication_lag_action()- Lag-based server shunning/unshunninggenerate_pgsql_aws_aurora_hostgroups_table()- Generate runtime configurationdump_table_pgsql()- Added support forpgsql_aws_aurora_hostgroupstable dumpAdmin (
lib/ProxySQL_Admin.cpp)save_pgsql_servers_runtime_to_database()- Added runtime aurora hostgroups dumpGenericRefreshStatistics()- Added triggers for aurora log/check_status table populationTechnical Details
Aurora Status Query
PostgreSQL Aurora uses
aurora_replica_status()function instead of MySQL'sinformation_schema.replica_host_status:Key Differences from MySQL Implementation
information_schema.replica_host_statusaurora_replica_status()MySQL_*/AWS_Aurora_*PgSQL_*/PgSQL_AWS_Aurora_*Writer Identification
Both MySQL and PostgreSQL Aurora use the same method:
session_id = 'MASTER_SESSION_ID'Example Configuration
Files Changed
include/ProxySQL_Admin_Tables_Definitions.hinclude/proxysql_admin.hincoming_aurora_hostgroupsfieldlib/Admin_Bootstrap.cpplib/ProxySQL_Config.cpppgsql_aws_aurora_hostgroupslib/ProxySQL_Admin.cppinclude/PgSQL_HostGroups_Manager.hPgSQL_AWS_Aurora_Infoclass definitionlib/PgSQL_HostGroups_Manager.cppinclude/PgSQL_Monitor.hpplib/PgSQL_Monitor.cppTesting
Tested with AWS Aurora PostgreSQL cluster:
aurora_replica_status()