It’s True! A Tale of Migrating to Amazon Aurora with no Interruption of Service

Photo via Visual huntPhoto via Visual hunt

Nulab Account’s Amazon RDS recently migrated from MySQL to Amazon Aurora. I’d like to briefly introduce how we made the migration possible. 

Reasons for the Migration

Our Nulab Account regulates authentication for all our services: BacklogCacoo and Typetalk. In the event that authentication is not functioning properly, our services will be unable to operate. Therefore, the Nulab Account must always be available in order for authentication to function at all times.

We were previously using RDS for MySQL but considered switching to Amazon Aurora from the moment it was released, especially since it features interchangeability with MySQL. Much has been said about Aurora’s merits, but it was its availability and scalability that appealed to us even more than its performance advantages.

Availability

A huge merit, for us, was Amazon Aurora’s high tolerance of disk failure and its high-level replication, described in ”High Availability and Replication” section in the FAQ.

Amazon’s Multi-AZ MySQL uses DNS, so it can take quite a few minutes to go from a failure occurrence to the completion of the fail-over. Amazon Aurora, even without a cluster, can recover from failure within 15 minutes. With a multi-node cluster, it is said to complete the fail-over process within one minute.

This leads to high availability.

Scalability

You can expand disks with RDS for MySQL, but it is difficult to complete this action while running the system.

As stated in ”Hardware and Scaling” section in the FAQ, you can extend Amazon Aurora’s storage without stopping or affecting database performance.

This is a huge benefit.

Conclusion

Aurora does have some disadvantages when compared to the RDS for MySQL—there are less instance types to choose from, the price is slightly higher, and there are several other minor points—but in the end, we did not find any real reason to choose MySQL over Aurora. We soon decided to migrate our Nulab Account to Amazon Aurora.

The migration process

Architecture and Availability of Nulab Account

The Nulab Account is implemented as JAVA servlet running as an instance on EC2. We had been using RDS for MySQL as the database. 

As previously mentioned, we must offer authentication at all times. To achieve this, the database was designed so that if it becomes read-only, authentication will continue to work even if other functions are affected.

Considering migration steps

What we considered most important when migrating the database was that we keep the authentication function running.

We first migrated our staging environment from MySQL to Aurora. By trial and error there, we were able to make a migrate strategy. We discovered there was no need to make significant changes in the application itself. 

We migrated the data from MySQL to Aurora using the replication function, following the user’s guide: ”Migrating Data to Amazon Aurora DB Cluster.” To do this, we made an Aurora instance from a snapshot of MySQL. Then we migrated the MySQL data to Aurora by setting up replication with MySQL as the master and Aurora as the slave.

After checking that the replication worked correctly, we made the actual switch of the database that connects to the applications. One thing that we had to keep in mind was that there could be data inconsistency between MySQL and Aurora if new data is written between the time we migrated the data and the time we switched the connected database. We avoided this data inconsistency by temporarily stopping application use but not stopping the authentication function. We decided to announce a maintenance time, make the MySQL database read-only during that time (keeping authentication active), and migrate the data in that window. 

Step 1: Make read replicas of MySQL and stop that replication to prevent the binary log from being deleted.

Step 2: Start Aurora.

Step 3: Construct replication with MySQL as master and Aurora as slave.

Step 4: Switch MySQL to read only, temporarily disabling some functions.

Step 5: Switch the access point to Aurora, application by application. Once the switch is made, all functions become active again.

Step 6: Stop MySQL.

We decided to migrate our database without stopping our applications, using the above steps.

For steps 4 and 5, we set a one-hour maintenance window—in which some functions become unavailable—and planned to carry out the migration during that hour. In case of an error, we took a snapshot right after switching to read-only in step 4. From step 5, data writing to Aurora began. If a major problem occurred and forced us to fall back on MySQL, we would need to abandon the data written to Aurora since the beginning of step 5. We made the decision not to go back after step 5 even if a problem occurred.

We notified our users via our website.

Here are more details of each step. The below diagram shows the state before migrate.

Step 1: Make ‘read replicas’ of MySQL and stop replication to prevent the binary log from being deleted

For the replication migrating data from MySQL to Aurora, we used the mysql.rds_set_external_master command to manually perform a replication, outside the RDS management system.

RDS for MySQL usually retreats binary logs to s3 and deletes them, every five minutes on average. When replicating within the RDS management, it checks that the replication is up to date before deleting the log, but when replicating outside the RDS management, it will delete binary logs without checking the replicated data. 

In that case, replication may not work.

To solve this problem, as stated in ”Replication Between Aurora and MySQL or Between Aurora and Another Aurora DB Cluster” on ”Replication with Amazon Aurora,” if you make another read replica under RDS management and then stop that replication, the master will keep the binary log without deleting it. Following the procedure in the above document, we made a MySQL read replica and, running the mysql.rds_stop_replication command, temporarily stopped that replication.

As the binary log would not be deleted, we had to pay attention to the remaining storage space. Since we had not stored a large amount of data in this database, there was not a large data increase.

We then needed to check the position in the binary log of the ‘read replica’ since we needed this information to set up the replication in Aurora.

Master_Log_File and Read_Master_Log_Pos positions are important in the binary log.

Step 2: Start Aurora

From the “Migrate Latest Snapshot” in the “Instance Actions” menu of the management console, we made an Aurora instance from the MySQL read replica. To start that instance based on the latest snapshot, we got the MySQL read replica snapshot immediately before doing this. We also set the ‘security group’ and ‘parameter group.’ This Aurora instance becomes the write node. Next, start up the read node from the management console and construct the cluster.

Step 3: Construct replication with MySQL as master and Aurora as slave

On Aurora, you set MySQL as master by mysql.rds_set_external_master command. As a parameter for that, specify the binary log position that you checked earlier.

Next, start the replication.

After a while, MySQL and Aurora will have the same data.

Step 4: Switch MySQL to read-only

When the maintenance time started, the first thing we did was change the read_only variable of the parameter group from the management console, switching MySQL into read-only mode. This stopped any writing to MySQL or Aurora, and thus some application functions stopped.

Step 5: Switch the application access point to Aurora one by one.

Application by application, we switched the database to which the application connects, from MySQL to Aurora. As each one switched to the active server, all it’s functions became active again.

The maintenance time could now come to an end.

All the applications were now connected to Aurora.

We then canceled the replication from MySQL to Aurora, using the mysql.rds_reset_external_master command.

 

Step 6: Stop MySQL

Finally, we stopped the MySQL instance. We waited for a few days before carrying out this action.

Now the database migration was complete.

Overall, there were no significant issues and we felt the migration process went smoothly.

 

Changing the Applications

Changing behavior when database is read-only

Our application would show an error page if the database was read-only when the application attempted to write to the database.
To clearly change to read-only, we made a change to the applications before the migration. The application would show as under maintenance instead of showing an error page.

Changing JDBC driver

When connecting from a Java application to Aurora, you can realize a fast fail-over if you use MariaDB Connector/J for a JDBC driver. We changed the JDBC driver after the database migration was completed.

MariaDB Connector/J

The conventional Multi-AZ of RDS has ‘hot standby’ construction. When fail-over occurs, the cluster end-point returns the IP addresses of the master and slave. As this has a DNS base, it is affected by TTL, so as long as the application refers to cluster end-point we cannot avoid prolonging time that the application cannot access the database during the fail-over.

MariaDB Connector/J is a JDBC driver for MariaDB and MySQL. It has an incorporated mechanism for fail-over, and it works with Aurora clusters. This driver materialises a fast fail-over by recounting a node end point for the JDBC URL, as below, rather than a cluster end point.

Aurora can construct a cluster of one writer and several readers. You can set the writer and readers by the global variable innodb_read_only. When cluster detects a failure in a write node, it immediately changes the innodb_read_only value of one correctly functioning reader, making that note the writer.

If this driver detects a failure when issuing a query, it checks for the writer node by innodb_read_only, switches the node that issues query, and re-issues the query to that node.

Even if fail-over occurs, the driver immediately switches the access point, so the application continues its processing without being interrupted by the fail-over.

For details, please check the documentation on mariadb.com.

Some points to consider

We used 1.2.3 for MariaDB Connector/J version and found some points that need attention.

The following statement is on MariaDB website:

It’s originally based on the Drizzle JDBC code, and with a lot of additions and bug fixes.

This driver is originally based on the Drizzle ( http://www.drizzle.org/ ) JDBC driver. It’s code base is different from MySQL Connector/J, so it has a different parameter system and some behavior differences in detail.

At present, T getObject(int i, Class<T> type), a ResultSet interface added by JDBC 4.1, always returns ‘null.’ Nulab Account has some places where we had mapping of Java enum to MySQL VARCHAR. When getting the value from ResultSet, we used the aforementioned method, so we needed amendment there.

Also, there seemed to be an issue with a few left-over threads when closing the application.

https://mariadb.atlassian.net/browse/CONJ-61

As a solution for this, there is a prepared method called unloadDriver in the org.mariadb.jdbc.Driver class. You can close the application correctly by calling this method after unloading the driver. In the Nulab Account, the following code is called from the contextDestroyed method of ServletContextListener.

Insuring quality by a unit test

Even though the production environment switched to Aurora, developers continued to use MySQL. Also, the CI environment by Jenkins still uses MySQL. Even though Aurora features MySQL compatibility, we do not know how far it applies.

We decided to insure that there was no errors with SQL by issuing a unit test using the staging environment migrated to Aurora. In this server, application works in the same way as in production environment, but in addition to that, we made it a slave of Jenkins and set build execution once a day.

If there was any lack of interchangeability between MySQL and Aurora, we could detect it with this test.

So far, the test has never failed. We feel that Aurora’s quality is very high in terms of MySQL compatibility.

Summary

You can see Amazon’s excellence in many aspects of Aurora.

We have not had a database failure or fail-over after the migration. Up until now, we have not yet experienced the merits of Aurora fail-over in the production environment. However, in the staging environment, we have confirmed that application is not affected when a fail-over occurs.

The high scalability and availability of Aurora causes higher scalability and availability of the system as a whole.

We believe Aurora’s compatibility with MySQL is truly remarkable. Even when infrastructure changes, we can continue with our operation and development without being too concerned.

I hope this article helps those who are considering migrating to Amazon Aurora!


 

Nulab is hiring engineers who love databases.

Work better, together.

Collaboration tools for modern teams

View Products