PostgreSQL Backend Monitoring

PostgreSQL Monitor Module

Overview

ProxySQL aims to deliver a comprehensive monitoring solution for PostgreSQL servers. It leverages asynchronous connections and a multi-threaded architecture to efficiently perform ping and connection checks on a set of servers.

Solution Architecture:

Multi-threaded: Utilizes multiple worker threads to parallelize monitoring tasks.
Asynchronous Connections: Employs asynchronous PostgreSQL connections for efficient handling of multiple servers.
Connection Pooling: Maintains a pool of connections to reduce connection overhead.
Scheduler Thread: Manages task scheduling, configuration updates, and thread management.

Core Functionality:

Currently monitoring checks supported are:

Ping Checks: Regularly pings PostgreSQL servers to assess their responsiveness.
- Server Shunning: Automatically identifies and “shuns” unresponsive servers, preventing further connections to them.
- Latency Tracking: Tracks the average latency of successful ping operations for each server.
Connection Checks: Attempts to establish connections to servers to verify their availability and assess connection latency.

Any monitoring check also offers a common set of features:

Error Logging: Records ping and connection attempts, including success/failure status, latency, and error messages.
History Management: Maintains a configurable history of ping and connection logs, allowing for trend analysis.

If you are new to ProxySQL, refer to the User Guide section for a deeper and more guided explanation of the listed features.

Tables

The following tables contain the monitor log data and are managed by the monitor module:

radmin=# SHOW CREATE TABLE monitor.pgsql_server_connect_log;
          table           |                   Create Table
--------------------------+--------------------------------------------------
 pgsql_server_connect_log | CREATE TABLE pgsql_server_connect_log (         +
                          |     hostname VARCHAR NOT NULL,                  +
                          |     port INT NOT NULL DEFAULT 3306,             +
                          |     time_start_us INT NOT NULL DEFAULT 0,       +
                          |     connect_success_time_us INT DEFAULT 0,      +
                          |     connect_error VARCHAR,                      +
                          |     PRIMARY KEY (hostname, port, time_start_us))
(1 row)

radmin=# SHOW CREATE TABLE monitor.pgsql_server_ping_log;
         table         |                   Create Table
-----------------------+--------------------------------------------------
 pgsql_server_ping_log | CREATE TABLE pgsql_server_ping_log (            +
                       |      hostname VARCHAR NOT NULL,                 +
                       |     port INT NOT NULL DEFAULT 3306,             +
                       |     time_start_us INT NOT NULL DEFAULT 0,       +
                       |     ping_success_time_us INT DEFAULT 0,         +
                       |     ping_error VARCHAR,                         +
                       |     PRIMARY KEY (hostname, port, time_start_us))
(1 row)

Configuration Variables

The following variables can be configured to customize the monitoring behavior:

pgsql-monitor_enabled: (boolean) Enables or disables the monitoring system.
pgsql-monitor_threads: (integer) Specifies the number of worker threads to use for monitoring.
pgsql-monitor_ping_interval: (integer, milliseconds) Sets the interval between ping checks.
pgsql-monitor_ping_timeout: (integer, milliseconds) Sets the timeout for ping checks.
pgsql-monitor_ping_max_failures: (integer) Specifies the maximum number of consecutive ping failures before a server is shunned.
pgsql-monitor_connect_interval: (integer, milliseconds) Sets the interval between connection checks.
pgsql-monitor_connect_timeout: (integer, milliseconds) Sets the timeout for connection checks.
pgsql-monitor_username: (string) Specifies the username used for connecting to PostgreSQL servers.
pgsql-monitor_password: (string) Specifies the password used for connecting to PostgreSQL servers.
pgsql-monitor_history: (integer, seconds) Sets the maximum age of ping and connection logs to keep.
pgsql-monitor_read_only_interval: WIP
pgsql-monitor_read_only_timeout: WIP
pgsql-monitor_read_only_max_timeout_count: WIP

Configuration Method:

The previously specified variables can be configured via either Configuration File or via Admin Interface.

User Guide

If you are new to ProxySQL, a good starting points for getting familiar with several core concepts that are shared for MySQL and PostgreSQL servers:

Multi-Layer Configuration System: https://proxysql.com/documentation/configuring-proxysql/
Admin CLI Interface: https://proxysql.com/documentation/the-admin-schemas/
Backend Server Configuration: https://proxysql.com/documentation/backend-server-configuration/

Once familiar with these basic concepts, let’s configure ProxySQL with a minimal setup making use of it’s monitoring capabilities for PostgreSQL.

First, lets create and analyze an example configuration file:

# Location to store the database file, logs and generated data.
datadir="$PATH"

# Administration variables
admin_variables=
{
    # Credentials to logging to `ProxySQL` Admin CLI Interface - https://proxysql.com/documentation/the-admin-schemas/.
    # IMPORTANT: Remember that default 'admin:admin' credentials can only be used from localhost. These are
    # sample defaults and should be changed for your configuration.
    admin_credentials="admin:admin;radmin:radmin"
    # Port for accessing Admin interface using MySQL Protocol
    mysql_ifaces="0.0.0.0:6032"
    # Port for accessing Admin interface using `PostgreSQL` Protocol
    pgsql_ifaces="0.0.0.0:6132"
}

pgsql_variables=
{
    # Number of threads used to serve traffic
    threads=2
    # Maximum global number of client connections for `PostgreSQL` interface
    max_connections=2048
    # A defaultly imposed query delay
    default_query_delay=0
    # The default query timeout for backend queries
    default_query_timeout=36000000
    # Retries to perform when failing to create a connection for serving traffic
    connect_retries_on_failure=10
    # Minimal timeout used to detect incoming/outgoing traffic via the `poll` syscall
    poll_timeout=2000
    # Interface to be used for serving traffic
    interfaces="0.0.0.0:6133"
    # Server version that `ProxySQL` will advertise to the client
    server_version="16.1"
    # Timeout to use when creating backend server connections
    connect_timeout_server=3000
    # History to be retained for Monitoring purposes on 'monitor' schema tables
    monitor_history=600000
    # Interval used for Monitor Connect checks
    monitor_connect_interval=60000
    # Timeout used for Monitor Connect checks
    monitor_connect_timeout=600
    # Interval used for Monitor Ping checks
    monitor_ping_interval=8000
    # Timeout used for Monitor Ping checks
    ping_timeout_server=1000
    # Create stats about the received commands - https://proxysql.com/documentation/stats-statistics/#stats_mysql_commands_counters
    commands_stats=true
}

ProxySQL uses this configuration file to initialize a database file in the supplied datadir. It’s very important to remember that this file will only be used in an initial bootstrap, i.e. no database file is found, in an initial startup or in a reload startup, for more information see Configuration Life Cycle.

It’s also important to remember that the configuration file isn’t recommended for complex configurations, it should only be used as a way to bootstrap ProxySQL with minimal/basic configuration. The recommended way to supply/deploy your configuration is via the database file. See Configuration file vs Database file. For more details about the configuration file semantics please check Configuration File docs.

We store this configuration in a file (proxysql.cnf) and then we proceed to launch ProxySQL. In this example we would be launching ProxySQL as a foreground process:

proxysql --clickhouse-server --sqlite3-server --idle-threads -f -c $CONFIG_PATH/proxysql.cnf -D $DATADIR_PATH

Once launched we are going to login to ProxySQL Admin interface, and check our current configuration. Remember that we have the option of doing this from a MySQL client and from a PostgreSQL client.

From MySQL:

mysql --prompt="admin> " -uradmin -pradmin -h127.0.0.1 -P6032
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 8.0.11 (ProxySQL Admin Module)

Copyright (c) 2000, 2022, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

admin>

From PostgreSQL:

PGPASSWORD='radmin' psql -h127.0.0.1 -p6132 -Uradmin
psql (16.3, server 16.1)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

radmin=#

Let’s follow using the psql client for the purpose of this guide. Let’s check the databases which exposes us the current ProxySQL modules and configuration:

radmin=# show databases;
 seq |       name       |        file
-----+------------------+----------------------------
 0   | main             |
 2   | disk             | /$DATADIR/proxysql.db
 3   | stats            |
 4   | monitor          |
 5   | stats_history    | /$DATADIR/proxysql_stats.db
 6   | myhgm            |
 7   | monitor_internal |
(7 rows)

For a how to on how to perform an initial servers and users configuration you can refer to the guide first time PostgreSQL configuration. We are going to configure just one server, our monitoring user and verify that monitoring is working as expected. Later, we will also configure a user for serving traffic and check that we can perform queries to our monitored server.

First, let’s insert our PostgreSQL server, for doing that, we could always check the table definition if we don’t remember the required fields:

radmin=# SHOW CREATE TABLE pgsql_servers;
     table     |                                                        Create Table
---------------+----------------------------------------------------------------------------------------------------------------------------
 pgsql_servers | CREATE TABLE pgsql_servers (                                                                                              +
               |     hostgroup_id INT CHECK (hostgroup_id>=0) NOT NULL DEFAULT 0,                                                          +
               |     hostname VARCHAR NOT NULL,                                                                                            +
               |     port INT CHECK (port >= 0 AND port <= 65535) NOT NULL DEFAULT 5432,                                                   +
               |     status VARCHAR CHECK (UPPER(status) IN ('ONLINE','SHUNNED','OFFLINE_SOFT', 'OFFLINE_HARD')) NOT NULL DEFAULT 'ONLINE',+
               |     weight INT CHECK (weight >= 0 AND weight <=10000000) NOT NULL DEFAULT 1,                                              +
               |     compression INT CHECK (compression IN(0,1)) NOT NULL DEFAULT 0,                                                       +
               |     max_connections INT CHECK (max_connections >=0) NOT NULL DEFAULT 1000,                                                +
               |     max_replication_lag INT CHECK (max_replication_lag >= 0 AND max_replication_lag <= 126144000) NOT NULL DEFAULT 0,     +
               |     use_ssl INT CHECK (use_ssl IN(0,1)) NOT NULL DEFAULT 0,                                                               +
               |     max_latency_ms INT UNSIGNED CHECK (max_latency_ms>=0) NOT NULL DEFAULT 0,                                             +
               |     comment VARCHAR NOT NULL DEFAULT '',                                                                                  +
               |     PRIMARY KEY (hostgroup_id, hostname, port) )
(1 row)

Let’s insert our first server entry:

radmin=# INSERT INTO pgsql_servers (hostgroup_id, hostname, port, max_connections, comment) VALUES (0, '127.0.0.1', 15432, 500, 'first_server');
INSERT 0 1
radmin=# SELECT * FROM pgsql_servers;
 hostgroup_id | hostname  | port  | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms |   comment
--------------+-----------+-------+--------+--------+-------------+-----------------+---------------------+---------+----------------+--------------
 0            | 127.0.0.1 | 15432 | ONLINE | 1      | 0           | 500             | 0                   | 0       | 0              | first_server
(1 row)

Now we should configure our monitoring user, but first we are going to check our current monitoring config, and to LOAD TO RUNTIME our current changes, so we can check the monitoring errors that would take place with a misconfigured monitoring user.

radmin=# SHOW VARIABLES LIKE 'pgsql-%monitor%';
          Variable_name          |  Value
---------------------------------+---------
 pgsql-monitor_enabled           | true
 pgsql-monitor_ping_max_failures | 3
 pgsql-monitor_ping_timeout      | 1000
 pgsql-monitor_username          | monitor
 pgsql-monitor_password          | monitor
 pgsql-monitor_threads           | 2
 pgsql-monitor_history           | 600000
 pgsql-monitor_connect_interval  | 60000
 pgsql-monitor_connect_timeout   | 600
 pgsql-monitor_ping_interval     | 8000
(10 rows)

Our PostgreSQL instance is unlikely to have configured this credentials for a monitoring user, let’s try to promote our changes to the servers to runtime and inspect the monitor tables for errors:

radmin=# LOAD PGSQL SERVERS TO RUNTIME;
LOAD
radmin=# SELECT * FROM runtime_pgsql_servers;
 hostgroup_id | hostname  | port  | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms |   comment
--------------+-----------+-------+--------+--------+-------------+-----------------+---------------------+---------+----------------+--------------
 0            | 127.0.0.1 | 15432 | ONLINE | 1      | 0           | 500             | 0                   | 0       | 0              | first_server
(1 row)

Immediately we can see ProxySQL error log start getting populated with errors:

2024-09-30 19:13:07 PgSQL_Monitor.cpp:463:handle_pg_event(): [ERROR] Monitor connect FAILED   error='FATAL:  password authentication failed for user "monitor"'
2024-09-30 19:13:11 PgSQL_Monitor.cpp:463:handle_pg_event(): [ERROR] Monitor connect FAILED   error='FATAL:  password authentication failed for user "monitor"'
2024-09-30 19:13:19 PgSQL_Monitor.cpp:463:handle_pg_event(): [ERROR] Monitor connect FAILED   error='FATAL:  password authentication failed for user "monitor"'
2024-09-30 19:14:15 PgSQL_Monitor.cpp:463:handle_pg_event(): [ERROR] Monitor connect FAILED   error='FATAL:  password authentication failed for user "monitor"'

And so they do the relevant tables from monitor schema:

radmin=# SELECT * FROM monitor.pgsql_server_ping_log LIMIT 4;
 hostname  | port  |  time_start_us   | ping_success_time_us |                        ping_error
-----------+-------+------------------+----------------------+-----------------------------------------------------------
 127.0.0.1 | 15432 | 1727716391082567 | 0                    | FATAL:  password authentication failed for user "monitor"
 127.0.0.1 | 15432 | 1727716399134066 | 0                    | FATAL:  password authentication failed for user "monitor"
 127.0.0.1 | 15432 | 1727716407185601 | 0                    | FATAL:  password authentication failed for user "monitor"
 127.0.0.1 | 15432 | 1727716415237181 | 0                    | FATAL:  password authentication failed for user "monitor"
(4 rows)

radmin=# SELECT * FROM monitor.pgsql_server_connect_log LIMIT 4;
 hostname  | port  |  time_start_us   | connect_success_time_us |                       connect_error
-----------+-------+------------------+-------------------------+-----------------------------------------------------------
 127.0.0.1 | 15432 | 1727716387031634 | 0                       | FATAL:  password authentication failed for user "monitor"
 127.0.0.1 | 15432 | 1727716447443701 | 0                       | FATAL:  password authentication failed for user "monitor"
 127.0.0.1 | 15432 | 1727716507856513 | 0                       | FATAL:  password authentication failed for user "monitor"
 127.0.0.1 | 15432 | 1727716568269249 | 0                       | FATAL:  password authentication failed for user "monitor"
(4 rows)

Now let’s configure a monitoring user in our PostgreSQL instance. ProxySQL doesn’t allow to set right now a custom database for the monitoring user, so for now you should create a custom database this user. This is likely to change in near the future, stay tuned for updates:

$ PGPASSWORD='postgres' psql -h127.0.0.1 -p15432 -Upostgres
psql (16.3, server 16.4 (Debian 16.4-1.pgdg120+1))
Type "help" for help.

postgres=# CREATE USER 'proxymon' WITH PASSWORD 'proxymon';
postgres=# CREATE DATABASE 'proxymon';

Let’s change the monitoring variables in ProxySQL:

radmin=# SET pgsql-monitor_username='proxymon';
UPDATE 1
radmin=# SET pgsql-monitor_password='proxymon';
UPDATE 1
radmin=# LOAD PGSQL VARIABLES TO RUNTIME;
LOAD

Errors should stop being written in ProxySQL error log and monitoring tables should show operations succeeding:

radmin=# SELECT * FROM monitor.pgsql_server_ping_log ORDER BY time_start_us DESC LIMIT 4;
 hostname  | port  |  time_start_us   | ping_success_time_us | ping_error
-----------+-------+------------------+----------------------+------------
 127.0.0.1 | 15432 | 1727717357683130 | 322                  |
 127.0.0.1 | 15432 | 1727717349580977 | 372                  |
 127.0.0.1 | 15432 | 1727717341529226 | 314                  |
 127.0.0.1 | 15432 | 1727717333477716 | 353                  |
(4 rows)

radmin=# SELECT * FROM monitor.pgsql_server_connect_log ORDER BY time_start_us DESC LIMIT 4;
 hostname  | port  |  time_start_us   | connect_success_time_us |                       connect_error
-----------+-------+------------------+-------------------------+-----------------------------------------------------------
 127.0.0.1 | 15432 | 1727717293219532 | 9882                    |
 127.0.0.1 | 15432 | 1727717232806648 | 0                       | FATAL:  password authentication failed for user "monitor"
 127.0.0.1 | 15432 | 1727717172394225 | 0                       | FATAL:  password authentication failed for user "monitor"
 127.0.0.1 | 15432 | 1727717111981411 | 0                       | FATAL:  password authentication failed for user "monitor"
(4 rows)

We can can also check the current server status from the same interface via runtime_pgsql_servers:

radmin=# SELECT * FROM runtime_pgsql_servers;
 hostgroup_id | hostname  | port  | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms |   comment
--------------+-----------+-------+--------+--------+-------------+-----------------+---------------------+---------+----------------+--------------
 0            | 127.0.0.1 | 15432 | ONLINE | 1      | 0           | 500             | 0                   | 1       | 0              | first_server
(1 row)

Let’s create some unexpected errors, and check that ProxySQL detects the issue and SHUNN our server due to monitoring errors. We can do this by simply blocking the traffic to our server IP (if using docker we could just pause our docker container):

$ sudo iptables -I OUTPUT 1 -p tcp -d 127.0.0.1 --dport 15432 -j DROP

$ PGPASSWORD='radmin' psql -h127.0.0.1 -p6132 -Uradmin
psql (16.3, server 16.1)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

radmin=# SELECT * FROM monitor.pgsql_server_ping_log ORDER BY time_start_us DESC LIMIT 4;
 hostname  | port  |  time_start_us   | ping_success_time_us |     ping_error
-----------+-------+------------------+----------------------+---------------------
 127.0.0.1 | 15432 | 1727717953708241 | 0                    | Operation timed out
 127.0.0.1 | 15432 | 1727717945656487 | 0                    | Operation timed out
 127.0.0.1 | 15432 | 1727717937604824 | 0                    | Operation timed out
 127.0.0.1 | 15432 | 1727717929553071 | 0                    | Operation timed out
(4 rows)

As a response to this, we should see in the error log an entry notifying that the server has been shunned due to unresponsiveness:

2024-09-30 19:37:37 PgSQL_Monitor.cpp:1236:shunn_non_resp_srvs(): [ERROR] Server 127.0.0.1:15432 missed 3 heartbeats, shunning it and killing all the connections. Disabling other checks until the node comes back online.

We can also verify that the server status have changed in the runtime_pgsql_servers table:

radmin=# SELECT * FROM runtime_pgsql_servers;
 hostgroup_id | hostname  | port  | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms |   comment
--------------+-----------+-------+---------+--------+-------------+-----------------+---------------------+---------+----------------+--------------
 0            | 127.0.0.1 | 15432 | SHUNNED | 1      | 0           | 500             | 0                   | 1       | 0              | first_server
(1 row)

If we revert this scenario, monitoring checks should start working again:

$ sudo iptables -D OUTPUT 1
$ PGPASSWORD='radmin' psql -h127.0.0.1 -p6132 -Uradmin
psql (16.3, server 16.1)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

radmin=# SELECT * FROM monitor.pgsql_server_ping_log ORDER BY time_start_us DESC LIMIT 4;
 hostname  | port  |  time_start_us   | ping_success_time_us | ping_error
-----------+-------+------------------+----------------------+------------
 127.0.0.1 | 15432 | 1727718171204999 | 370                  |
 127.0.0.1 | 15432 | 1727718163153198 | 410                  |
 127.0.0.1 | 15432 | 1727718155101422 | 375                  |
 127.0.0.1 | 15432 | 1727718147049801 | 356                  |
(4 rows)

Yet if we inspect the runtime table, our server SHUNNED condition won’t have changed:

radmin=# SELECT * FROM runtime_pgsql_servers;
 hostgroup_id | hostname  | port  | status  | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms |   comment
--------------+-----------+-------+---------+--------+-------------+-----------------+---------------------+---------+----------------+--------------
 0            | 127.0.0.1 | 15432 | SHUNNED | 1      | 0           | 500             | 0                   | 1       | 0              | first_server
(1 row)

This is not because our server is still unavailable, but because ProxySQL updates this table in a lazy way. Only when traffic is routed to the server, it will be bring back ONLINE and the value will change from this table. For this, we are going to configure a user for PostgreSQL and ProxySQL that let’s us exercise traffic through the server.

First we create a user in our PostgreSQL database and a database that will be target by ProxySQL connections:

stgres=# CREATE USER 'proxyuser' WITH PASSWORD 'proxypass';

Let’s create the user for ProxySQL, first let’s see the table definition as for pgsql_servers:

radmin=# SHOW CREATE TABLE pgsql_users;
    table    |                                         Create Table
-------------+-----------------------------------------------------------------------------------------------
 pgsql_users | CREATE TABLE pgsql_users (                                                                   +
             |     username VARCHAR NOT NULL,                                                               +
             |     password VARCHAR,                                                                        +
             |     active INT CHECK (active IN (0,1)) NOT NULL DEFAULT 1,                                   +
             |     use_ssl INT CHECK (use_ssl IN (0,1)) NOT NULL DEFAULT 0,                                 +
             |     default_hostgroup INT NOT NULL DEFAULT 0,                                                +
             |     transaction_persistent INT CHECK (transaction_persistent IN (0,1)) NOT NULL DEFAULT 1,   +
             |     fast_forward INT CHECK (fast_forward IN (0,1)) NOT NULL DEFAULT 0,                       +
             |     backend INT CHECK (backend IN (0,1)) NOT NULL DEFAULT 1,                                 +
             |     frontend INT CHECK (frontend IN (0,1)) NOT NULL DEFAULT 1,                               +
             |     max_connections INT CHECK (max_connections >=0) NOT NULL DEFAULT 10000,                  +
             |     attributes VARCHAR CHECK (JSON_VALID(attributes) OR attributes = '') NOT NULL DEFAULT '',+
             |     comment VARCHAR NOT NULL DEFAULT '',                                                     +
             |     PRIMARY KEY (username, backend),                                                         +
             |     UNIQUE (username, frontend))
(1 row)

Let’s now insert the user and promote the changes to RUNTIME:

radmin=# INSERT INTO pgsql_users (username,password,default_hostgroup,max_connections,comment) VALUES ('proxyuser','proxypass',0,400,'first_user');
INSERT 0 1
radmin=# LOAD PGSQL USERS TO RUNTIME;
LOAD

Let’s perform a connection to ProxySQL and a dummy query, this should update our server status in runtime_pgsql_servers table:

$ PGPASSWORD='proxypass' psql -h127.0.0.1 -p6133 -Uproxyuser -dpostgres
psql (16.3, server 16.1)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

postgres=> do $$ BEGIN NULL; END $$;
DO

$ PGPASSWORD='radmin' psql -h127.0.0.1 -p6132 -Uradmin
psql (16.3, server 16.1)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

radmin=# SELECT * FROM runtime_pgsql_servers;
 hostgroup_id | hostname  | port  | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms |   comment
--------------+-----------+-------+--------+--------+-------------+-----------------+---------------------+---------+----------------+--------------
 0            | 127.0.0.1 | 15432 | ONLINE | 1      | 0           | 500             | 0                   | 0       | 0              | first_server
(1 row)

As we can see, our server state is now reflected back as ONLINE.