How we saved a client from cloud disaster

As cloud and DevOps specialists, we often get hired to plan and execute a migration of our client's technical platform, from on-premise locations to a cloud platform. This work involves different projects stages to make sure a tech stack can be migrated to the cloud, and give the client more value than the on-prem solution did. Value in this context can be measured in metrics such as:

Increased time-to-market value.
Ability to scale the technical platform.
Improved control over running costs.

In order to successfully migrate on-prem solutions to the cloud, it's important to know how your existing tech stack will perform on a cloud platform, and what options one have to make a successful migration.

This is the story of how we helped our client out of a failed migration.

The project

I boarded the project, as a cloud specialists, with the mandate to review and to suggest improvements to the client's technical platform, running in AWS. The technical stack was a result from a typical lift&shift operation, with services migrated from an on-prem location to Ec2 instances in AWS. In short, this resulted in two AWS accounts, one for development and test and one for production use, with a self-managed OpenShift platform and lots of services running on Ec2 instances.

The client had used another consultancy firm for the migration from their on-premise locations to AWS, and the work had been carried out by primarily one cloud architect at this firm. When this person suddenly was pulled from the project, my job was to take his place.

The analysis

As a cloud specialist in this situation, the first thing you do, is to create a birds-eye view of how the platform is designed and implemented. You typically look at things such as IAM and security design, VPC and networking, IaC and team workflows etc. When you have a fair idea of how the architecture looks, you proceed by drilling down into details, like resource management, cost effectiveness and how services run on the platform. The process of understanding how a technical platform in AWS works, can be anything from pretty straightforward, to close to impossible, depending on a number of factors.

For this particular project, there was almost no documentation in place. This, in combination with the fact that most of the stack was set up manually, on Ec2 instances, without any IaC in sight, made it very challenging to understand how things worked. Unfortunately, this is how most projects look after a lift&shift operation, in my experience.

Of course, if the tech people does not fully understand how the systems work, this often leads to daily operations taking much more time than it should, it is basically less effective than it could be. For instance, all certificates in the entire system, had to be managed manually, and there were lots of them. Weekly, the developers had to send multiple support tickets, often related to systems failing because of certificate management issues.

Another problem with doing lift&shift to the cloud, in general, is the cost. Lift&shift often means taking a silo-like application from a bare bone (or cheap VM), and putting it on a Ec2 instance (if you are on AWS). I have seen examples where clients have gone from a $250 / month bare bone server, to a $3000 / month Ec2 instance, because of resource requirements on the particular instance. This project was no exception, and part of our work as cloud specialists, is to suggest architectures that will lower the cost of resources in the cloud. This can, for example, mean that you rewrite a service to let different parts of if handle the heavy-lifting, migrate this to AWS Lambda serverless service, and pay only for actual seconds the service operates. This will often bring costs down from thousands of dollars, to a much more manageable cost.

However, despite high costs and difficult operation processes, the following issues made me raise a concern to the company management:

Disaster recovery - we had literally no way to recover the systems if that should be necessary.
A very complex network setup, with lots of traffic going across VPCs, without peering.

The plan

As a consultant with 20+ years of experience, I know that in a multi-national corporation, a concern from a simple DevOps guy, is not going to stir up any big changes at the executive level, at least not for a while. So, I progressed with making a plan for a disaster recovery, if we should need it, that basically meant preparing to migrate the whole existing tech stack, from its existing AWS accounts, to new accounts, basically greenfielding the whole thing if shit should hit the fan, so to speak.

The greenfielding plan included moving from the current OpenShift platform, which was poorly configured without any managed solution backing it, to a more suitable (and much easier to manage) Fargate in ECS. Enabling applications and databases to communication over private networks, building robust CI/CD pipelines for provisioning infrastructure and cloud services, off-loading services with serverless technology such as Lambdas etc. were also part of the greenfield project.

The incident

Despite my warnings, efforts to prevent a major incident, was not prioritized by the management. One afternoon, the nodes of the OpenShift cluster, in the production environment, suddenly started to die, one by one. Despite spending hours debugging what was wrong with the cluster, reading logs, going through implemented changes and doing all sorts of system tests, we couldn't find out what was causing the problems. Most of the client's services were down, or not working correctly, so time was of essence.

Much later, we managed to come up with a theory why this incident happened, and it was caused by a number of weaknesses, that could all be tracked down to the migration from the client's on-premise locations, to AWS.

The fix

As mentioned, I had already prepared new VPCs and everything around building the infrastructure for running the services in ECS, so we were able to spin up a brand new platform in AWS, within a few hours. I had also prepared configurations for the applications running on OpenShift, so that these could, with relatively little pain, be migrated to run on ECS. Even if we were able to move the applications to a new and happy place, there were still a number of databases, stand-alone monolithic services running on Ec2 instances, and even a few cloud services, that we didn't have time to prepare to move. With the help of VPC peering and years of experience working with cloud platforms, we were able to make all the services communicate with each other, across multiple AWS accounts and networks.

During the following weeks, we proceeded to clean up the client's development and production environments, by moving more services over to the new accounts. This work included migrating databases from self-managed databases running on Ec2 instances, to managed and serverless services. We also reviewed and took out parts of the monoliths that consumed a large chunk of the running AWS costs, and migrated these to AWS Lambdas, which resulted in drastically reduced Ec2 costs.

With what was basically a new and improved cloud architecture, we were also able to design and implement processes for cost management, to be able to control costs better in the future. For example, by introducing IaC and automating provisioning of cloud services, we could implement processes to scheduling downsizing of not only services running in ECS, but also databases and other cloud services, very easily.

Going from a situation where we handled multiple system issue reports every week, the number of issues dropped to zero, after we greenfielded the platform. In addition, we brought down costs by almost 50%, for the organization´s whole cloud operation, so this was a successful rescue mission, in my book.

The lesson

Migrating to a cloud platform is not an easy task. Unfortunately, in my experience, many organizations does not invest enough in the migration process, and risk to end up in a failed lift&shift situation, like our client in this article. In this particular case, our client got into problems because of, quite frankly, bad advice from the consulting firm that initiated the migration process. Because of the complex nature of migrating most on-prem systems to a cloud vendor, it is utterly important to have a deep knowledge cloud architecture and the requirements of the systems that are being migrated.

A successful migration project involves the whole organization, at some level, from management and product people, to developers and DevOps professionals. As cloud specialists, our role is often to take ownership of the process, to plan and facilitate, and make sure that the end result meet the organization´s requirements. With all actors and components working together, we can make sure the transition to the cloud is a successful investment.