This blog post is about how ChartMogul retired its last pieces of infrastructure on DigitalOcean, marking its migration to AWS as complete.

The journey was not your regular AWS migration as it involved moving our infrastructure from classic VMs to containers orchestrated by Kubernetes.

In a series of articles, we will share our experiences about:

Life With DigitalOcean

Since our inception in 2014 and up to mid-2021, our entire infrastructure has run on DigitalOcean droplets (self-managed cloud virtual machines). We needed a cloud provider to get us off the ground quickly, reliably, and cost-effectively.

DigitalOcean made a lot of sense and was a great choice. We are where we are because of them. That choice gave us the freedom to focus on product building without worrying about scalability and infrastructure complexity – aspects that typically kick in at a later stage.

Every aspect of our infrastructure was provisioned, configured, and managed in-house. We used configuration management and Infrastructure as Code tools (Saltstack and Terraform) to manage things.

We kept growing over the years, and by 2019 we found ourselves looking at a fleet of somewhere around 50 machines in constant need of management, software updates, security patches, and so on. And with new projects in our pipeline, we expected our compute power needs to double by the end of 2020.

Why Move and Why Now?

As great of a choice DigitalOcean was, our organic growth was pushing the boundaries of our setup over the years. We faced challenges with multiple areas, some fixable and preventable, some not.

Various Failures

AMS2 Region Deprecation

Our DigitalOcean region (AMS2) was announced as “soon to be retired”, meaning limited support. We could not secure additional resources on-demand, and executing simple tasks usually meant long planning and wasted resources.

Simple things such as upgrading a Postgres version and provisioning a new machine to perform a task were becoming impossible to do.

Limited Hardware Choices

Being in the subscription analytics space means data-intensive operations, large volumes, and the ability to often scale accordingly.

Modern machines with more extensive hardware resources were only available in other regions. Network performance degradation was a frequent occurrence, and we soon realized that migrating to a different region was our best bet.

Lack of Modern Cloud Features and Managed Services

The volume of operational work to maintain our infrastructure to keep up with the growth rate (and deal with tech debt simultaneously) increased.

We had to take a hard look at our setup and understand whether moving into a different DigitalOcean region or a new cloud provider was the best choice.

Should We Stay or Should We Go?

We started looking into the benefits of staying with DigitalOcean and simply moving to a new region – a more leisurely, quicker, cheaper, less painful option.

But at the same time, we treated this move as an opportunity to modernize parts of our stack in service of expected user growth and an increased rate of progress.

By the end of our assessment, we realized that specific must-have requirements would be hard to achieve by staying and simply switching regions. The most important ones were:

This list of requirements along with the challenges listed in the previous section tipped the scale in favor of switching providers.

Why AWS?

Choosing a new cloud provider to power ChartMogul infrastructure was a long journey. We researched the market and discovered many tradeoffs and advantages a new provider could bring to the table.

Our options were Amazon Web Services (AWS), Google Cloud (GCP), and Azure. Ultimately, we decided to go with AWS. We list some of the main reasons below.

Team Expertise

We were already using some AWS services in production (e.g., S3 for storing incremental Postgres backups). More importantly, a few of our engineers had prior professional experience using various AWS services extensively in production systems.

Scalability

Data Security and Compliance

Data security has always been top of mind. Over the years, AWS security capabilities have grown substantially.

The number of new services AWS developed around data security covers most of our needs in the container/Kubernetes space.

They play nicely with well-established services such as private VPC isolation, fine-grain control of policies, and IAM roles.

Compliance-wise, we plan to become SOC II certified asap, and we found AWS compliance programs to be an advantage that can help fast-track that journey.

Managed Services

Postgres is at the heart of what we do at ChartMogul, and we’ve typically spent a lot of time actively managing our database fleet of machines to support our growth.

High availability and reliability of databases were becoming growing concerns, so we decided to evaluate multiple offers from major cloud providers with managed PostgreSQL. AWS RDS was the clear winner.

Managed Kubernetes was another major factor to consider, and this was head to head with Google Cloud (GCP). Google’s managed Kubernetes (GKE) felt better than what AWS had at the time, but comparing RDS to CloudSQL wasn’t close feature-wise.

Nowadays it seems that AWS is catching up with EKS however; We benefit from great RDS features such as snapshots flexibility, backup durability (with SLA), read replicas for Postgres, painless upgrades, dedicated IOPS, Cloudwatch metrics, Performance Insights, and the list goes on.

The Insane Number of AWS Services

At the time of writing, AWS offers over 200 services. Most of them give you the ability to get instant access to managed services from so many areas like compute, databases, data analytics, data warehousing, serverless, and storage.

Our engineering teams can now leverage top-notch integrations to solve core problems quickly and prioritize buy vs. build where it makes sense.

Disaster Recovery

AWS cloud is an essential part of our Disaster Recovery plan. That’s because instances are easy to spin up, we can promote RDS read-replicas to primary at the click of a button, snapshots are a breeze, we can host in multiple regions, and we have a top-notch integration with our IaC tool of choice.

AWS Credits

We secured $100k worth of credits through the AWS Startup program. We were able to plan, test, and complete our migration without considerable expenses.

Migration to AWS

Our migration from DigitalOcean to AWS was a ten-month-long journey. The entire effort was backed up by volunteers from all of our engineering teams and driven by a DevOps engineer, a backend engineer, and our head of engineering.

Some things involved trial & error. We tried multiple ways of:

A perfect plan was in place, and everything looked good to go on paper, but we learned the hard way that things will not always go to plan.

At times, our near-zero downtime migration goal was at serious risk, and back to the drawing board we went.

Perseverance, drive, and fantastic team effort helped us overcome the challenges we faced.

Careful planning did wonders too; Given our capacity, we established early on that breaking down the actual migration into three stages (or days) would work best.

Week Prior D-Day

The Day Before D-Day

D-day: Flicking the switch

At this point, we were running our production workload on the shiny new infrastructure! We finished the whole thing in 10 hours (we initially estimated 8 hours – not too bad).

Challenges With AWS

The biggest struggle was with the DMS service (AWS managed service to move databases into RDS).

It was not as easy to use as advertised. In our case with Postgres, it was not helpful. Eventually, we developed a custom way of moving data into AWS.

We also came to the hard realization that moving databases with zero downtime to AWS with webhook support is complicated. We developed a custom approach to support this setup.

More on these custom approaches in future articles.

Future Articles in the Series

Look out for future articles documenting our migration journey from DigitalOcean to AWS. We will touch on topics such as: