Architecture for cloud cost optimization

Search Google for phrases such as "aws cost optimization" and you will get a trillion results with articles listing the most common techniques for reducing costs in AWS. Typically, these are actions like rightsizing, instance reservation, utilization monitoring and a couple of other common cost saving techniques, all reasonably easy to understand, without being a cloud specialist.

For instance, if we take rightsizing as an example, it basically means: "Make sure your AWS services, such as Ec2, RDS or EBS volumes, match your workload's requirements for performance and capacity". Pretty simple to grasp, at least at a conceptual level. Another example of a common cost optimization task, is scheduling, which means "Pay for what you need, when you need it". A bit more complex to implement, but still easy to understand the concept.

The problem is, that most of the "low hanging fruit" of cost optimization techniques, does not necessarily mean drastically reduced costs. In this article I explain some of the reasons why, and how important a cloud native architecture is, to run a cost effective cloud platform.

Definition of cost effectiveness

In general, there are a few different ways to define how cost effective an IT-operation in the cloud is. There are the obvious ones, that I already mentioned, i.e. the tools and mechanisms that a cloud vendor offers, to be able to reduce running costs, such as Ec2 bound costs (traffic, storage, operational minutes) and costs related to services like RDS, where you pay for requests and storage. In addition to these, there are other ways, other metrics such as:

Continuous delivery and time to market
A cloud platform offers a very high delivery efficiency.
Scalability
High scalability may not only be highly valued by some organizations, but also a hard requirement.
Availability
Possibly one of the top reasons to migrate to the cloud, is to have hundreds of managed services available.
Flexibility
At least the biggest cloud vendors offers good cloud computing solutions for both small and large IT-operations.
DevOps instead of ops
Move platform operations into the teams by letting developers manage their own cloud services. Rich APIs makes it possible to easily interact with services on a cloud platform.
Automation and IaC
I think most will agree that CLI tools and Terraform have made automation and scripting a lot easier, than working with the huge configuration scripts needed to configure systems in the pre-cloud era.

I have worked with organizations with on-premise operations, where it took one week to get a load balancer in place, or we didn't have Kubernetes clusters with node auto-scaling, resulting in constant monitoring and fear of running out of nodes for our pods. If you go from architectures like that, to a cloud vendor, you have already made your IT-operations more cost effective, in some aspects.

It is important to remember, that the services provided by a cloud vendor, will almost always be more expensive if you look at cost per hour, or per MB or any other metric we can measure cost in, than an on-premise provider offering a bare-bone server or VM. To spin up an Open Source version of MongoDB on a bare metal server on Linode, is dirt cheap. To provision an equivalent, managed service in AWS, i.e. DocumentDB, is going to cost over $200 per month, for three replicas of the smallest instance. Still, the DocumentDB service from AWS can still be the most cost effective choice, depending on an organization's operational ability, and functional requirements.

Take this blog for example. It runs on containers in ECS and uses a DocumentDB to store articles and comments. We could have used a cheaper, relational database in RDS, but a document database is very suitable for storing articles with comments, and de-normalized data in general. So, a document database was one of the requirements when developing this blog. Other requirement was high availability, redundancy and backup. As a small business we have limited time and resources to spend developing and monitoring stuff we don't get paid for, so a managed service is the obvious choice. By our company's definition, using a more "pricey", managed services from AWS, is more cost effective, if we take necessary cost optimization measures.

Cost of cost optimization

I would like to just briefly touch down on this topic, i.e. how much is the cost of doing a cost optimization project, and is it worth it? I often hear this concern when working with clients, and the question of whether the cost management project is worth the investment, is of course valid. However, that taking cost optimization measures should end up being more expensive than what an organization will save, is almost never true. To understand why, we need to also understand how cloud services accumulates cost, and what the main cost drivers are.

Take a look at this illustration, describing the cost related to cloud operations, for a small product company, throughout a cost optimization project:

cost-optimization-phases-illustration

_{* Accumulated costs: What's on the bill from the cloud vendor + other costs directly related to manage cloud operations, such as hired consultants.
* Phase 1: Mapping the organization's needs and requirements, planning for redesigning architecture and implementing a cloud native design.
* Phase 2: Implementing cost optimization measures, enabled by phase 1.}

The organization is expanding their operations, and as a result, accumulated cloud costs raise. Total cloud costs is predicted to reach $15000 by September.
The company initiates a cost optimization project, by hiring a cloud specialist consultant + allocating 30% of one developer's time, to assist the cost optimization project.
When the cost optimization project starts, total accumulated cloud costs will go up, but by the end of phase 1, costs are starting to drop.
By the end of phase 2, costs have dropped drastically. There are now tools in place to monitor and control cloud costs.
By the end of the project, cost management tools and knowledge is handed over to the team.

Because of the nature of the financial model of cloud computing, without any measures taken, costs will go up also without any noticeably changes done to the organization's IT operations (i.e. no new product deployed, increased number of customers etc.). This is because many of the services are billed by used storage space and traffic, and cloud operations will notice a slight increase in cost, even without any changes to the platform. Of course, most IT operations are never in this "stalled" state, which means that in reality new stuff is being deployed, new databases provisioned, backups accumulated, new technical solutions is tested etc.

This is why the argument that a cloud optimization project (if done correctly) can end up costing more than it saves, is almost never true. In fact, cost optimization, when working with cloud computing, is a key factor, and should be part of the functional and technical architecture.

Architecture for optimal cost effectiveness

Believe me when I say that an effective cost optimization strategy for cloud computing, is impossible without a correct architecture. As I mentioned above, there are definitions of the meaning of "cost effective", and it varies for different organizations, based both functional and technical requirements. However, it is always possible to design a cloud native architecture, without sacrificing qualities such as flexibility, scalability, security etc. Always!

Traffic as the main cost driver

Let's have a look at the following illustration, describing a scenario from a project we worked on some time ago (the example is simplified, the real case was a bit more complex):

without-vpc-peering

Our client had a functional requirement that required data to be shared from production environment to the stage environment (for simplicity, I have called these Stage VPC and Production VPC), several times a day. The data source on in the production environment was on a public subnet, and traffic between the two VPCs was routed through a NAT gateway. As the amount of data that needed to be transferred grew, so did the data transfer costs (the NAT is an expensive service in AWS).

The next illustration shows how we lowered the data transfer costs by 90%:

with-vpc-peering

By setting up a VPC peering between the Stage VPC and the VPC in production, we managed to wipe out all costs accumulated by traffic through the NAT. We had to do some changes technical changes, such as changing CIDR on one VPC, etc. This is an obvious solution when sharing data between two different VPCs, but often it requires changes to the technical systems, at some levels. For instance, you may need to rewrite configuration for APIs etc.

Pay for what you need, when you need it

Maybe the most effective cost optimization measure, is scheduling of cloud services. Scheduling basically means shutting down or pausing services when they are not in use. One requirement for being able to schedule services effectively, is to use automation tools and IaC, which most DevOps teams are pretty familiar with, these days. But, except for the obvious need for automation and some way to deploy configuration by code, the most complex impediments for organizations to be able to effectively schedule their environments, you find on an architectural level.

A few years ago I was working on a large cost optimization project, with a very complex system architecture, and lots of data flowing between different domains in the organization. For instance, the client had multiple Kafka consumers running in Kubernetes, loading data into the system from Kafka topics, from other domains within the organization. The flow of data between domains, proved to be one of the biggest obstacles in the process of implementing measures for cost optimization. When we scaled down the consumer pods in Kubernetes, between 18:00 and 07:00, the consumers had to process all the messages that had not been consumed during the time that the pods were down. The autoscaler kicked in and scaled up the deployment with multiple pods, to be able to consume the waiting messages as efficiently as possible. In the end, with this approach we didn't save any costs on the Kubernetes cluster.

This is an example of an architecture that is not compatible with scheduling cloud services, due to system interdependencies. In this particular case, the solution was to implement processes to sync scheduling of the Kafka producers, in the other domains, with scheduling of the consumer pods. This turned out to be a pretty complex exercise, because of the organizational structure of the company.

A monolith in the cloud

Another very common scenario, when migrating on-premise systems to a cloud vendor, is trying to make a monolithic application play ball in the cloud. First, let me make it clear that a monolithic approach to system development, is not necessarily a bad thing. In fact, it could be the best architecture for many applications. To be able to run a monolith on a cloud platform, in a cost efficient way, requires some considerations that is often not necessary on-premise. The following should be considered:

Does the monolith utilize a lot of hardware resources (CPU and memory)?
Are the processes in the monolith long running and CPU expensive, or requiring burst performance?
Could short lived processes be extracted and moved to serverless services such as Lambdas?
What are the estimated costs for running the monolith in a container vs. a VM (Ec2)?

monolith-to-lambda

A common scenario is to have a monolith implementing functionality for doing bulk operations for importing and exporting data. Instead of having to provision a very expensive Ec2, because of burst requirements for the bulk operations, it can be much more cost efficient to extract the import and export processes to a serverless technology, such as AWS Lambda. The illustration above shows how one monolith is split up into Lambdas and Fargate. There are many cases where this is NOT a cost efficient architecture, but it serves as an example where you can save a lot of money by adapting a system architecture to be more cloud native.

Conclusion

In this article I have emphasized the correlation between system architecture and cloud costs, or rather how a certain details in a system architecture can make it difficult to implement efficient cost optimization measures. In most of the cost optimization projects we do for clients, we see that the biggest and most complex blockers for keeping costs down, are often anchored in details in the underlying architecture, which may or may not be trivial to change. An architecture that was acceptable on an on premise platform, can have bits and pieces that is considere technical dept, when the system is running on a cloud platform. This is why it is utterly important to add cost optimization as a variable, when you design a system for cloud.