AWS Aurora is a MySQL compatible proprietary solution developed by AWS for the cloud

It differs from MySQL in several aspects, but for the purpose of this document we will focus only on one aspect: replication.
In Aurora, a writer/master can have multiple replicas: replicas aren’t updated using MySQL protocol replication, but by using a proprietary mechanism developed by AWS, where changes are replicated at the storage layer.
Without going into the technical details, what is important to highlight is that:

  • replication cannot be monitored using queries normally used with MySQL protocol replication like SHOW SLAVE STATUS
  • replication is normally in the order of few milliseconds, and rarely above seconds
  • Aurora exports metrics and status using new tables, where it is possible to identify the current writer/master, the replicas, and their replication lag
  • Aurora supports auto-provisioning of new replicas, for example in case of the failure of existing replicas
  • Aurora supports automatic-failover

ProxySQL introduces a new monitor algorithm to monitor the status of AWS Aurora clusters which is able to detect:

  • current topology : writer/master and replicas
  • failovers
  • replication lag of all the replicas
  • failed replicas
  • new replicas

New configuration table: mysql_aws_aurora_hostgroups

In order to monitor AWS Aurora clusters, clusters need to be configured in a new table: mysql_aws_aurora_hostgroups.

Table structure:

Multiple fields define an AWS Aurora Cluster:

  • writer_hostgroup is the hostgroup that will be assigned to the writer/master . This is also the primary key of the table, which means that every writer_hostgroup represents a single cluster
  • reader_hostgroup is the hostgroup that will be assigned to all the replicas in a given cluster . There is a UNIQUE constraint on this column to ensure that multiple clusters aren’t configured with the same reader_hostgroup
  • active allows to toggle the monitoring of a cluster on and off. For example, it is possible to configure a cluster but not monitor it (yet)
  • aurora_port is the port that Aurora uses to accept connections. All nodes in a cluster use the same port. By default this is port 3306, that is the same default port of MySQL
  • domain_name : in the AWS Aurora internal table only hostnames are listed. ProxySQL, however, needs the fully qualified domain name in mysql_servers . When a new server is added in mysql_servers, ProxySQL will add domain_name to the server hostname to obtain the FQDN . For example, if hostname is serverA and domain_name is .abcde.us-east-1.rds.amazonaws.com , in mysql_servers table the hostname serverA.abcde.us-east-1.rds.amazonaws.com will be added. Note that domain_name must start with a dot, and table mysql_aws_aurora_hostgroups enforces this constraint.
  • max_lag_ms : replicas that have a replication greater or equal than max_lag_ms milliseconds are automatically disabled from the cluster until their lag returns below the configured threshold. Also note that from empirical results (not documented by AWS) a failed node will have a replication lag of 600000ms , that is also the maximum value of max_lag_ms: the failed node will be automatically disabled
  • check_interval_ms defines how frequently ProxySQL’s monitor will check the status of the cluster
  • check_timeout_ms defines the timeout for a check
  • writer_is_also_reader : this setting defines if writer/master will also be configured as part of the reader hostgroup
  • new_reader_weight : this is the weight that will be assigned to new auto-discovered replicas
  • add_lag_ms , min_lag_ms and lag_num_checks : because replication lag is pulled at regular intervals as defined in check_interval_ms and not in real time, the last measured replication lag could not be accurate. These 3 variables perform some tweaks on the last value of replication lag:
    • if replication lag is less than min_lag_ms, assume a read of min_lag_ms
    • to any value read, add add_lag_ms
    • if lag_num_checks is greater than 1, the current replication lag is computed as the highest of the last lag_num_checks reads
  • comment is any text that an administrator can assign to an Aurora cluster, for example to note what the cluster is used for

Following the same convention of other configuration tables, the configuration currently loaded at runtime is available in table runtime_mysql_aws_aurora_hostgroups , that is the runtime configuration of mysql_aws_aurora_hostgroups .

Monitor tables

Monitor module introduces 3 new tables with regards to AWS Aurora:

Table mysql_server_aws_aurora_log

When an Aurora cluster is enabled, Monitor connects to all the hosts in the cluster (in rotation), querying one node every check_interval_ms milliseconds. Monitor will retrieve cluster information from Aurora table information_schema.replica_host_status .

ProxySQL’s table monitor.mysql_server_aws_aurora_log stores the information retrieved from information_schema.replica_host_status (SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, replica_lag_in_milliseconds) , together with additional information:

  • hostname/port: the node where the check was performed
  • time_stamp_us : when the check was performed
  • success_time_us : if the check was successful, how long did it take
  • error: if the check was not successful, what was the error message
  • estimated_lag_ms : estimated lag is based on reading replica_lag_in_milliseconds and applying configured add_lag_ms , min_lag_ms and lag_num_checks

Note that ProxySQL determines which one is the writer/master based on SESSION_ID : if the value is MASTER_SESSION_ID , the server specific in SERVER_ID is the master. All the other servers with SESSION_ID not equal to MASTER_SESSION_ID are replicas.

Table mysql_server_aws_aurora_check_status

While table mysql_server_aws_aurora_log is a log table, table mysql_server_aws_aurora_check_status provides some aggregate information.

  • writer_hostgroup + hostname + port defines a specific server
  • last_checked_at shows the last time this server was checked
  • checks_tot counts the number of checks executed against this server
  • checks_ok counts the number of successful checks
  • last_error shows the last error returned by this server (if any)

Table mysql_server_aws_aurora_failovers

This table records all the detected failovers.

Every time a failover happens, a new row is inserted into table mysql_server_aws_aurora_failovers :

  • writer_hostgroup represents the Aurora cluster
  • hostname shows the newly promoted master
  • inserted_at is the timestamp of the failover

Routing and replication lag

Since ProxySQL is able to monitor the replication lag of replicas in milliseconds, it is able to determine which replicas reads should be sent to, that is the replicas that are up to date within the configured thresholds.

Most of the configuration is performed in table mysql_aws_aurora_hostgroups, specifically on columns max_lag_ms, add_lag_ms, min_lag_ms, and lag_num_checks .

If a query id is configured to be sent to a replica, the suitable replicas are determined based on these cluster settings.

Furthermore, the client is able to specify a more restrictive threshold for max_lag_ms sending a comment with setting max_lag_ms . For example, the Aurora cluster could be configured with mysql_aws_aurora_hostgroups.max_lag_ms=1000 , but a client could send a query like:

In this case, only replicas with replication lag less or equal than 20ms will be considered to process the query.

Hold request when no suitable node is found

If no replicas are suitable to execute the queries (for example, if none has a lass less than 20ms replication lag), ProxySQL is able to hold the requests until it is safe to execute the query.

For example, if max_lag_ms specified by the client is 20, but the replica has a lag of 50ms, ProxySQL can wait 30ms before executing the query on the replica.

Writer in reader hostgroup

If mysql_aws_aurora_hostgroups.writer_is_also_reader is enabled, then the writer is also configured in the reader hostgroup.

This creates some interesting consequences if clients specify a very small value for max_lag_ms : because the writer always has a lag of 0 , ProxySQL may send traffic to it instead of waiting for replicas to catch up.

Similarly, it is possible that a user does not want to use the writer no matter the lag of the replicas if replicas exist.

ProxySQL has a new global variable that controls this behavior: mysql-aurora_max_lag_ms_only_read_from_replicas

The variable defines the minimum number of replicas that need to be present in order to ignore the writer in the reader hostgroup, if the client specifies a value of max_lag_ms in the query it sends.