At Pismo, we use the Amazon Aurora database with read replicas to enhance performance across multiple regions. Recently, we had an issue with one of our read replicas. A failure left that specific database instance hung for 15 minutes until we restarted it. Analysing this incident, we learned a few strategies to increase database resilience. In this article, we will discuss one of them and show how we validated it with chaos engineering using the Gremlin tool.
Doing our post mortem after the incident, we found error logs in Amazon CloudWatch that helped us find the root cause. Here is an example:
[Warning] InnoDB: A long semaphore wait… This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. Attempting to collect some information that could help diagnose the problem.
We use ProxySQL in some services with multiple goals. The main one is helping our APIs separate SELECT commands from writing commands (like CREATE, INSERT, and DELETE), splitting them between the read replicas and the primary database instance. So we can use fewer resources from our master node.
We do this with MySQL Query Rules in ProxySQL. We have a hostgroup configured for writing operations (the primary database instance) and another for reading operations (the replicas). When a SELECT comes as part of a transaction, ProxySQL redirects the query to the read hostgroup.
When the incident happened, our read hostgroup had only one instance. Since this server was hung, it blocked the execution of almost every SELECT command from our APIs. To prevent this from happening again, we changed some configurations to use our primary Aurora instance to execute SELECTs too, but with a lighter workload. Hence, ProxySQL first tries to route the SELECT command to a read replica. After a few unsuccessful attempts, it starts routing the commands to the primary Aurora instance.
To test our solution, we created a scenario in the Gremlin chaos engineering tool. We configured it to generate a 5-second latency in a read replica from every pod/container in our ProxySQL pipeline. Our experiment generated error messages like this on Amazon CloudWatch:
MySQL_Monitor.cpp:1140:monitor_read_only_thread(): [ERROR] Timeout on read_only check for merchants-1.co79t18at2o3.sa-east-1.rds.amazonaws.com:3306 after 1002ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout.
Error: timeout on creating new connection: Can’t connect to MySQL server on ‘merchants-1’
ProxySQL began to detect failures in this replica and removed it from its list of online servers. As we expected, it then started to route SELECT commands to the write hostgroup.
The chaos test showed that the configuration change that we made in ProxySQL was quite effective. It prevents the system from going down when a database replica is hung. Provided that the primary server remains active, end-users would probably not even suspect that something is not working properly.
Read more — chaos engineering:
How chaos can make your systems more resilient