Why Navan Moved to Amazon Aurora for Scalability

Discover why Navan chose Aurora to tackle its scalability and performance needs — and pulled off the switch without any disruptions.

Picture this: You’re on a business trip and need to extend your stay for a client meeting. With a few taps, you adjust your travel plans seamlessly in the Navan app. Success hinges on the app’s ability to process your request in real time and without error. Behind this seamless experience, the reliability and scalability of the software’s database infrastructure are essential.

At Navan, it’s our mission to help ensure that travel and expense management runs like clockwork for users worldwide. Sometimes that means making changes. And the decision to proactively transition our MySQL database to Amazon Aurora was a strategic move, in order to meet speed and reliability demands head-on.

Here’s the story of how we seamlessly migrated our production fleet.

Why Navan Chose Amazon Aurora

Navan is committed to making travel and expense easy for everyone, everywhere. Choosing the right database platform to make that happen is like selecting the right infrastructure for a bridge — it needs to withstand heavy traffic and must be built to last.

To achieve the goal of being the go-to travel solution worldwide, the Navan site reliability engineering (SRE) team is driven to design and build a cloud infrastructure that is focused on scalability, resilience, and flexibility. Databases, a critical component of the infrastructure, are no exception.

Amazon Aurora is a database service known for its resilience and flexible scaling mechanisms, making it a good choice for Navan’s demanding workloads.

What is Amazon Aurora?

Amazon Aurora is a global-scale relational database management system (RDBMS) built for the cloud with full MySQL and PostgreSQL compatibility. Aurora’s MySQL database engine takes advantage of its high-performance and distributed storage subsystem. The underlying storage grows automatically and can expand to a maximum of 128 tebibytes (TiB). Aurora also automates database clustering and replication, which are typically among the most challenging aspects of database administration.

Here are the reasons why Navan chose Amazon Aurora:

Seamless Scalability: Aurora’s architectural foundation separates compute and storage, with a distributed storage layer across three availability zones (AZs) that maintain six data copies. This structure scales effortlessly to accommodate demand.
Enhanced Availability: Aurora offers high availability (HA), disaster recovery (DR), and automatic failover — features that minimize traditional database management overhead. The recovery time objective (RTO), or maximum time it takes to restore normal operations, is typically measured in minutes, but with Aurora, the data loss tolerance is measured in seconds.
Global Reach: Aurora Global Database enables fast local reads and replicates data across regions without an impact on performance. It forwards SQL writes from secondary to primary regions, eliminating the need for custom read/write mechanisms.

Understanding Navan’s MySQL Infrastructure

Before we dive into how we migrated to Aurora, let’s take a quick look at Navan’s MySQL infrastructure:

Production Fleet: 50+ Amazon Web Service (AWS) Relational Database Service (RDS) instances, spanning development, staging, prime, and production environments across AWS us-west-2 and eu-central-1 regions.
Data and Queries: 5+ TB of data stored, handling 20k+ queries per second across 50+ database clusters.
High Availability: Production clusters were configured for high availability with a primary plus replicas.

Serverless vs. Provisioned

Amazon Web Services offers two deployment options for Aurora: serverless and provisioned.

Serverless automatically scales based on demand and handles variable workloads without manual intervention, making it cost-efficient for unpredictable usage.
Provisioned, on the other hand, requires manual configuration of capacity and is suited for stable, predictable workloads.

While serverless initially seemed appealing — it offers auto-scaling capacity, cost efficiency with burst workloads, and a lack of manual management — the results in our staging environment were less than ideal:

Higher Baseline Cost: Approximately twice as much as provisioned
Delayed Auto-Scaling: Slow response to unpredictable workloads
Unexplained Failovers: Occurred during auto-scaling, leading to application errors
Limited Tuning Control: Minimal ability for manual adjustments
Suboptimal Support: Less effective support from AWS for serverless

Ultimately, we chose the provisioned deployment, which better aligned with Navan’s consistent traffic patterns.

Preparing for the Journey

Due to the criticality of the database infrastructure, Navan’s SRE team set up a few requirements for the upgrade process so that the service level objectives (SLOs) and service level agreements (SLAs) were maintained.

Rollback Capability: We needed to be able to roll back without disrupting service, because no matter how comprehensive testing and validation are, unexpected things always happen.
Global Customer Consideration: Our databases serve customers worldwide. To minimize potential impact, upgrades needed to occur when users were less likely to use Navan services.
Transparency in Upgrades: We wanted the upgrade process to be transparent to our applications, meaning no code changes were required so that developers could focus on their deliverables.

Preparation started in late 2023, and we had several milestones to reach before moving to production:

Infrastructure as Code (IaC): Built IaC resources for Aurora to implement best practices for version compatibility and system configurations across the board.
Development and Staging Clusters: Upgraded development and staging clusters to Aurora and ensured our applications and continuous integration and continuous delivery/deployment (CI/CD) pipelines were compatible with the new infrastructure.
Communication with Teams: Informed the application teams of our migration plans and timelines.

Let the Migration Begin

To meet the availability standards, we implemented a strategy that allowed for checkpoints and rollbacks throughout the process.

First, we created a special replica to copy a snapshot of the production database into the Aurora cluster. Once the snapshot was transferred, we set up live replication between the source and the Aurora cluster.

During the cutover maintenance, we used a brief moment when there were no writes to the database to record the transaction log position. This log position was used to set up replication from Aurora back to the old RDS database, allowing us to roll back if needed.

Here are the steps we took, as diagrammed above:

Launch a migrate replica RDS and new Aurora cluster
Stop the replication from the old RDS primary to migrate replica
Run MySQL from migrate replica to the new Aurora cluster
Establish live replication from old RDS primary to new Aurora cluster
During the cutover, move old RDS primary to replicate from new Aurora cluster so that a rollback chain is in place as a backup

Our plan worked flawlessly, and now, Navan is set to scale our database operation as the company expands in the future. All of our learnings from this four-month project will pave the way for data infrastructure growth and future upgrades.

Share this article

How Navan Seamlessly Migrated to Amazon Aurora for Superior Scalability

Michelle Chen