How Amazon RDS Aurora MySQL Cross-Region Replication Really Works Under the Hood

RDS Aurora MySQL in AWS provides an in-built feature to create a cross-region read replica of a database. This is easily accessible from the console as shown below:

This article describes at a high-level the basic logistics of how replication really happens.

First things first:

Amazon RDS Aurora MySQL cross-region replication uses native MySQL binlog replication.

How MySQL Binlog Replication Works

The MySQL replication feature allows a server – the master – to send all changes to another server – the slave – & the slave tries to apply all changes to keep up-to-date with the master. Replication works as follows:

  1. Whenever the master’s database is modified, the change is written to a file, the so-called binary log, or binlog. This is done by the client thread that executed the query that modified the database.
  2. The master has a thread, called the dump thread, that continuously reads the master’s binlog & sends it to the slave.
  3. The slave has a thread, called the IO thread, that receives the binlog that the master’s dump thread sent, & writes it to a file: the relay log.
  4. The slave has another thread, called the SQL thread, that continuously reads the relay log & applies the changes to the slave server.

What should I set the binlog retention hours to?

Setting the binlog retention hours in turn means you are allowing the slave (read replica) enough time to handle the changes on the master. So setting it to say 24 hours means allowing the Read Replica to catch up with the master for a maximum of 24 hours without losing the binlog. So if there’s a sudden surge in the changes on a database during high workload time it will have more binlog entries that the slave needs to process. You’ll have to test the ideal retention hours for your use case based on your workload if the replica is able to catch up with the changes on the master instance.

Note: Binlog replication is a single threaded process, whereas changes on the master instance are multi-threaded.

You can also monitor the Aurora Replica Lag metric or run a few commands to check the replica lag of the Read Replica.

What impact might setting the binlog retention hours to a very large value have on database performance?

Setting larger value for binlog retention hours will have no impact on the database performance. The only impact it has, will be to increase the Cluster Storage & you will have to monitor the storage & if it increases the storage to a very high value you can reduce the binlog retention hours accordingly. You can also monitor the size of the binlog files by running show binary logs.

Would high retention hours cause large replication lag?

No. High retention hours has no relation to the replication lag. Replication will start as soon as there are changes in the binary log files on the master. Whenever there are new changes that are coming in on the master they will be logged in form of events in the binlog file which will be in turn sent to the read replica through the dump thread. So it won’t accumulate any binlogs & will work as an ongoing replication on a continuous basis. So increasing the retention hours will have no impact on the replication.

Is binlog copy from source to target incremental?

Yes, it will continuously apply the changes to the read replica as soon as they come in on the master. Irrespective of the size of the binlogs that have accumulated, RDS will always copy just the latest binlogs for the changes that haven’t been applied to the target & apply them.

Crash Recovery

Enabling binary logging on Aurora directly affects the recovery time after a crash, because it forces the DB instance to perform binary log recovery.

The type of binary logging used affects the size and efficiency of logging. For the same amount of database activity, some formats log more information than others in the binary logs.

The amount of binary log data affects recovery time. If there is more data logged in the binary logs, the DB instance must process more data during recovery, which increases recovery time.

Aurora does not need the binary logs to replicate data within a DB cluster or to perform point in time restore (PITR).

Crash Recovery — Aurora User Guide