Picture this: You’re on a business trip and need to extend your stay for a client meeting. With a few taps, you adjust your travel plans seamlessly in the Navan app. Success hinges on the app’s ability to process your request in real time and without error. Behind this seamless experience, the reliability and scalability of the software’s database infrastructure are essential.
At Navan, it’s our mission to help ensure that travel and expense management runs like clockwork for users worldwide. Sometimes that means making changes. And the decision to proactively transition our MySQL database to Amazon Aurora was a strategic move, in order to meet speed and reliability demands head-on.
Here’s the story of how we seamlessly migrated our production fleet.
Navan is committed to making travel and expense easy for everyone, everywhere. Choosing the right database platform to make that happen is like selecting the right infrastructure for a bridge — it needs to withstand heavy traffic and must be built to last.
To achieve the goal of being the go-to travel solution worldwide, the Navan site reliability engineering (SRE) team is driven to design and build a cloud infrastructure that is focused on scalability, resilience, and flexibility. Databases, a critical component of the infrastructure, are no exception.
Amazon Aurora is a database service known for its resilience and flexible scaling mechanisms, making it a good choice for Navan’s demanding workloads.
What is Amazon Aurora?
Amazon Aurora is a global-scale relational database management system (RDBMS) built for the cloud with full MySQL and PostgreSQL compatibility. Aurora’s MySQL database engine takes advantage of its high-performance and distributed storage subsystem. The underlying storage grows automatically and can expand to a maximum of 128 tebibytes (TiB). Aurora also automates database clustering and replication, which are typically among the most challenging aspects of database administration.
Here are the reasons why Navan chose Amazon Aurora:
Before we dive into how we migrated to Aurora, let’s take a quick look at Navan’s MySQL infrastructure:
Amazon Web Services offers two deployment options for Aurora: serverless and provisioned.
While serverless initially seemed appealing — it offers auto-scaling capacity, cost efficiency with burst workloads, and a lack of manual management — the results in our staging environment were less than ideal:
Ultimately, we chose the provisioned deployment, which better aligned with Navan’s consistent traffic patterns.
Due to the criticality of the database infrastructure, Navan’s SRE team set up a few requirements for the upgrade process so that the service level objectives (SLOs) and service level agreements (SLAs) were maintained.
Preparation started in late 2023, and we had several milestones to reach before moving to production:
To meet the availability standards, we implemented a strategy that allowed for checkpoints and rollbacks throughout the process.
First, we created a special replica to copy a snapshot of the production database into the Aurora cluster. Once the snapshot was transferred, we set up live replication between the source and the Aurora cluster.
During the cutover maintenance, we used a brief moment when there were no writes to the database to record the transaction log position. This log position was used to set up replication from Aurora back to the old RDS database, allowing us to roll back if needed.
Here are the steps we took, as diagrammed above:
Our plan worked flawlessly, and now, Navan is set to scale our database operation as the company expands in the future. All of our learnings from this four-month project will pave the way for data infrastructure growth and future upgrades.