Things I wish I knew before I married my Cloud Vendor

“The Cloud”. We´ve all heard about it, surely, but how much do we know about it, really? And, more importantly, what is our role in relation to The Cloud?.
For some of us, The Cloud is just a space where we store things online… You know, like photos, they are ‘up there’, safely stored (hopefully) and ready for me to use whenever and wherever I want. That was me, about 7 years ago. Just a regular cost accountant, with some inkling of and interest in the depths of technology. Then, one day, I saw a job advert that changed everything for me: “Cost Optimisation Consultant”. The description suggested that the role was about understanding and managing the costs of technology, more specifically the cost of “The Cloud”. In other words, what I was reading was the junction where my cost accounting background intersected with my interest in software and technology. That was “my call”, and where my FinOps career started.
I must say, that road I took, not an easy and well paved motorway. More likely, this path I´m on feels like those patchy country roads (“B” roads if you are in the UK). The world of technology is ever expanding, fast and can be overwhelmingly complex when you come from a different field - like finance and accounting -. However, I did learn some tricks along the way and, while I'm no expert, I decided to start a series of blog posts to both share my experience and to try to pave the way to those who are thinking of taking this path or have just started it. So hope on, fasten your seatbelt, and join me on this ride (oh, yeah, and bring some takeaway coffee too).
So for this first article, let’s start at the beginning. One of the first things you’ll discover when you start your Cloud Journey and tune in into some of those free training that each cloud provider offers: Purchase Models.
Broadly speaking, there are two ways you can pay for the services you consume from your Cloud Provider: On-demand (pay for what you use) and Commitments (pay for what you think you are going to use). While the former boasts flexibility, the latter teases 30% to 70% lower rates. The catch, you wonder? you will have to stay married to your cloud services for 1 to 3 years (depending on your plan). Should you renounce your vows, you will find yourself in a disadvantageous divorce and having to pay for the remaining of your commitment, whether you use it or not.
For many of us, those who are in a position to advise on these purchase options, here is where our night terrors come in. How can we be sure that we are making the right decision, mitigating risks and maximising discounts on behalf of the company? Let’s rip the bandaid off: you will never have full guarantees. There are plenty of stakeholders within the organization who make technology decisions day in and day out that could (and will) affect your bill. That said, there are ways of managing that risk while you grow in expertise and confidence. So I decided to break our options down into levels.
Level 1 (quick & easy) - Spend per hour
Whether you are new to this, or the company itself has only been operating for a short period of time (in which speed of flexibility was more important than optimizing for cost), your safest place to start is a commitment based on your hourly spend. In other words, you may not know in which region you will be operating, or what of type or size of compute power you are going to use, or the account you will need the discounts in, but if you know roughly how much you will be spending in Compute (and that’s an important caveat) then you can commit to a certain hourly spend and you will benefit from discounts for it.
Of course, there will be small difference between different cloud providers, but at large, the hourly spend commitment works in the following way:
They apply to compute usage (a.k.a virtual machines), including serverless compute and containers. Note AWS will also have a dedicated one for Machine Learning, and Azure will cover anything that runs on underlying virtual machines.
They are based on the spend per hour, not a day, not a month, hour. So if at any given hour you don't reach your committed spend, you’ll still pay for your commitment.
They come in 1 or 3 years commitments that can be paid all upfront, partially upfront, or monthly (no upfront). The more you commit, the higher the discount, but it will be in the range of ~30% to ~70%.
They apply across accounts (there is a gotcha here that I´ll explain below in the “watch out for” section), regions, compute types, compute sizes and compute services
They will be distributed and applied automatically by the cloud provider to maximize savings.
These types of discounts are known as “Savings Plans” in AWS and Azure or “Spend-based flexible Committed use discount (CUDs)” in GCP (bit wordy GCP, isn't it?). The way you can get started is by analysing your historic compute usage (ideally at an hourly level) and see what your baseline is for how much you spend at an on-demand rate. Alternatively, all of the aforementioned providers will offer you a recommendation which is usually fairly accurate.
Watch out for
While all this sounds simple and easy (and to be fair, it kinda is), there are a few things to consider before purchasing:
First and foremost, always talk to your engineers and developers, understand their plans and make sure you are aware of any big upcoming changes that may come down the line and may impact your utilization (especially if they are planning to reduce the compute size or number of instances, if they are planning to increase, you can always buy more).
Always remember: these apply to compute and only compute hours (not database engines, no storage, not data transfer charges, etc). So even if you are within a compute service i.e AWS EC2, Azure VMs, GCP Compute Engine, the discount will apply only to the hourly compute charge, not their SSD attached to them, or the data transfer charges or anything else but the compute.
If you are in a multi-account environment, make sure you enable discount sharing across your accounts, otherwise the discount will be limited to the account you purchased it in. Also, in most case scenarios it is best to purchase at a ‘payer account’ level (or any account that you don't have workloads running on) to make sure that the discounts flow downstream and apply in a way that maximizes savings (otherwise it may apply first to all compute usage in a given account, before distributing the remaining savings across the organization)
I would strongly advise to do your own scenario analysis at an hourly level before making any purchase, and compare that against the providers’ recommendation. Not only it would help you better understand your business, but also it will allow you to test out different options and find the sweet spot between savings and waste. Also it’s a good sanity check against your provider (which, ultimately is always trying to sell you something).
Read, and then read again before you purchase which services are covered by your provider´s discount.
Pro tips:
GCP may have an advantage here, as they have a wider range of services (beyond compute) for what they offer a spend-based discount. Whether you are already locked in with GCP, or are just starting and evaluating your options across providers, keep this in mind and see how it may impact your bottom-line.
Compute are -normally- one of the top services you will spend money on, making this a great first target to lower your monthly bill.
Talk to finance, and see and understand your business needs. If your company has a healthy cash flow, then they may prioritize discounts rate over cash spend and go for all upfront options. Or it may be that they are tight for money, in which case you may opt for a more modest discount and avoid a one-off lump payment.
If I were starting with my first FinOps actions in a company, and if I´m really uncertain about future consumption, I would rather buy smaller commitments at shorter intervals instead of one big one every year (or every three years). This way you are lowering your risk and have regular expirations coming up over the 12 to 36 months period where you can re-evaluate if you need to renew your commitment or not. A good place to start is one commitment per quarter. A mix of 1 and 3 years commitments can also help you reduce the risk of changing usage patterns.
I said it before and I’ll say it again here, never make a decision (let alone a purchase) in isolation. Talk to developers, engineers, managers even directors about their plans in the horizon that your commitment will cover. If they are planning, for whatever reason, to significantly drop their compute usage you will want to know. This could be either a big workload that is being decommissioned, or even a migration to a different cloud provider for strategic reasons.
Level 2 - Hourly spend commitment for a resource and service (measured in units)
If you are interested in a bit of history, these kinds of commitments/discounts were the first iteration of discounts, before the hourly commitments described above. But because they weren’t (nor they are) easy to manage, cloud providers needed to come up with an easier way to handle discounts. That’s it with the history lesson, let's dive in.
In a way, this discount is similar to the ones in our Level 1 above: they apply at an hourly level, come in with a 1 or 3 years lock, can be purchased all upfront, partial upfront or no upfront and will have discounts of between ~30% to ~70%. The difference lies in the flexibility in which they apply, with this kind of commitments being locked to a region, service and even family within a service (more nuance to this explained below).
Generally speaking, there won't be many cases in which you would choose this option for compute discounts. One exception could be if you decided to engage with a third party provider who offers autonomous commitments management. Then, they will actively monitor and exchange commitments in near real time to match your demand. When you will want to use these types of commitments is when you start targeting savings in services other than compute (i.e.: databases) as, at the time of writing this article, these are the only discounts available.
This kind of discount is known as Reserved Instances (AWS), Reservations (Azure) or Resource Based Committed Use Discount (GCP), and the premise is that you will commit to spend a set amount of dollars per hour, for a specific service, in a specific region, within a specific family size. Straightaway you can see that a lot of flexibility goes out of the window. Now you need to know which region you are staying and, more importantly, which instance family you will be using (which in a fast-paced field like technology, with innovation being measured in days rather than years, it can feel a bit restrictive).
The specific of what is covered will change from one provider to another, but at large, you can leverage these reservations for services like compute (albeit not advised), databases, cache, search & analytics, among others. Same as before, the discounts can be shared across accounts within an organization, provided that you have enabled such feature first, and it’s best if you purchase them at a higher level (i.e payer account) to maximize savings -unless you are in GCP, in which case you will be forced to purchase them at a project level, why? I don´t know-. There is, however, one important concept to understand here which is the concept of “normalized usage” or “footprint”.
Because Reservations are locked to a family within a service (a specific type of machine within the service) rather than the size of the machine, you will need to understand what your purchase equals to. To make it a bit clearer, Cloud Services come in a variety of shapes and forms. For many of them, that means that you can choose “families” (i.e “m5” in AWS RDS) which will tell you the tech specs of a resource, such as the type of processor. Then, you can choose the size within that family (number of vCPUs and Memory size). With that in mind, what you need to understand is that when you purchase reservations you will be purchasing a relative unit of measure within a family. Let's try with a made up example:
Day 1: you are running one Virtual Machine called “vmX1.small” at $100/hour.
Day 2: you decided to buy reservations to cover your usage of “vmX1.small” which will give you a 30% discount. You will start saving $30/hour, and spending $70/hour. This covers 100% of your usage
Day 3: one of the developers tells you that they are needing to scale because of increasing demand. They are now running on “vmX1.large” costing $200/hour. You check the relative units of measure and realise that 1 small unit = 0.5 large units. Automatically, you are now covering 50% of your current usage with discounts, meaning that your discount didn’t change (you are still saving $30/hour), and you will be spending $170/hour for your compute.
Day 4: you decide you want to increase your reservation to cover all of your usage. You purchase the same reservation as day 2: 1 instance of “vmX1.small”. You now have a total of 2 “vmX1.small” which equals 1 “vmX1.large”.
It sounds complex, but believe me, soon enough you will get the hang of it (and I bet you will have snappy spreadsheets with formulas to automate these conversions).
Watch out for
Generally speaking, everything we considered in Level 1 applies here to: talk to developers, you will save only on the compute hours, enable cross account discount sharing, do your own hourly analysis
Make sure you understand your Cloud Provider restrictions about the reservations. What locks you are agreeing to (region, instance family, account, etc) and how the units of conversion work. Note that not all services may have these units of conversion (which we will explore in the next level) - Note that some services i.e.: AWS OpenSearch will not allow family flexibility, forcing you to choose the right instance family and size from the get go.
Some providers have grace periods to return or refund reservations. Be aware of what these are and make sure to leverage them if you feel like you’ve made a mistake or the circumstances changed
Different providers may implement different degrees of flexibility depending on the options. AWS will offer Convertible Reserved instances, which allows you to split, group and change reservation configurations, and Azure will ask you to choose size flexibility when setting up your reservation. Make sure you understand these little subtleties of your provider so you don’t regret your purchase
For GCP, this kind of commitments discounts can only be purchased at a Project level (rather than a billing account level). This means that if you have a centralized FinOps team (or sole practitioner) who manages purchases, they will need the right level of access to the specific project(s).
Level 3 - Hourly spend commitment for a resource and service (measured in capacity)
When you reach this level, things start to get complicated. While the time horizon remains the same (1 or 3 years commitments) what you are committing to moves from units (instances) to capacity. This means that now you need to understand at what levels of processing your infrastructure performs before making a decision for the future. Forecasting this capacity can be quite tricky and customer-dependent. It is one thing to predict how many databases (instances) you are going to need for your application, but it’s a different one when you are trying to estimate how many reads and writes your customers are going to perform at an hourly level.
For services like DynamoDB (AWS) and CosmosDB (Azure) this ‘capacity’ can be measured in different units: throughput, reads, writes, etc. and may fluctuate much more than a simple instance running for days. For someone like me who comes from a finance background, this sounds straightaway more complex. Now I need to understand more abstract things that I feel I can’t just go and count myself. This is why collaboration is key, and you need to forge partnership with your engineers and developers. Even after you grasp the basics of these variables, they will be the best people to advise you on what levels of utilization are to be expected.
As before, historic data will be key, and analyzing previous consumptions at an hourly level can help you estimate a safe baseline to commit to. What works for me when I'm preparing for purchasing capacity is to break down my last 4 months of usage by the day and the hour, set a variable field to the right of my data that would represent my commitment, and use conditional formatting to highlight the hours of the day where I would be over-committing. To me, this is a first step and visual aid to try and find the sweet spot between waste and discounted rates. Once I have a ballpark figure, I can fine tune it and do the proper measurement of spend and savings to maximize benefits. #spreadsheetlife
Watch out for
AWS will not let you pay all upfront for DynamoDB reservations, so you will have to do partial upfront instead. Keep this in mind for cash flow purposes.
Be aware of your deployment configuration. Some of these services have an ‘on demand’ vs ‘provisioned’ option that cannot be easily changed. Discounts are likely to apply only to your ‘provisioned’ usage.
Equally important, and as I said before, make sure you understand which portion of your service you are cover, and what restrictions you have (i.e.: AWS only offers discounts for Standard tables)
Silly as it may sound, make sure you understand the units you are comparing against your on demand pricing. Sometimes - AWS - the reserved price is per 000´s of capacity units, while the on demand price is per capacity unit.
You may need to purchase different reservations for each capacity type i.e.: read and write capacity.
Pro tips
When doing scenario analysis aim to understand the real savings, net of waste for overcommitting at certain hours
Because the 3 years discount is higher, it may be a good option to tier your purchases. This means committing for a longer term to the safer baseline, and using shorter commitments for more reservations that will waste some hours a day.
Needless to say, talk to your developers and engineers to understand capacity needs and future plans.
Understand your business needs and be aware of ALL pricing options. It may be more convenient to switch to infrequent access tiers rather than to reserve capacity.
Hidden level - Capacity reservation with a twist
There are some analytics query services (AWS Athena - Azure Data Lake - GCP BigQuery) that will allow you to create short-term commitments. The way each cloud vendor implemented these discounts is not uniform, but definitely worth exploring to both reduce your costs and make them more predictable.
If you have ever tried to predict/forecast analytics cost, you may have struggled with their variability. Normally, on-demand pricing models for these kinds of services are based on the total data scanned (measured in TB). Whether you are running ad-hoc reports for clients, or scanning data that varies in sizes, the reality is that trying to control these costs and preventing them from spiralling can be a real challenge. The solution that cloud providers offer is to commit to the compute power to run those analytics instead. However, this is in a much smaller time frame (hourly or monthly).
So, at this level, we put the focus on workload management and resource allocation. You will configure your processing power (Processing Units in AWS, Analytics Units in Azure, Pools in GCP) and assign your workloads to it to consume from them. Then, you will know how much you will be spending by the hour for the service.
Having an hourly, predictable rate for your analytics that can be dropped at short-notice sounds ideal and low-risk. However, the catch is that now you are managing the resources that power your analytic queries. You will need to thoroughly assess your workloads requirements and manage them efficiently to avoid latency or response times that are beyond your SLAs (specially if they are customer-facing). At the same time, you also need to understand your use cases. If you are only running ad-hoc queries, very sporadically, it is unlikely that you will benefit from these commitments. You will need to have a steady flow of analytics to justify having dedicated resources to run them.
Pro tips
This may be a good option if you are in a very siloed organization. It may be that, individually, none of the teams have a reason to leverage these kinds of commitments. However, convined, you may be able to create the right workload groups
Be very wary of the latency you may be introducing to the system. Communicate this well to your stakeholders and make sure that, together, you make the best possible decision
Monitor your resource allocation frequently to make sure that you are not wasting resources.
Closing thoughts
One of the pillars of FinOps is collaboration. Hopefully, you have noticed that concept coming through in this article. Cloud technology is extremely decentralized so trying to make a decision that will affect multiple teams across the organization cannot be a 1-person job.
There are plenty of opportunities to start saving money quite easily, but timing is key. If you are going to buy commitments, you will want to be running in a stable enough environment. That means, if you know for sure that you are wasting capacity, and overprovisioning resources (very common after a ‘lift-and-shift’ to the cloud, then you may want to address that first, and commit later.
Last, but not least, use that comment section below to let me know your thoughts, corrections if you have any, and needs for future articles. Ultimately, I want to build something that helps people close the gap between finance and technology, so the more feedback I get, the more useful I can (try to) be.

Thank you for reading!

Cloud FinOps Journey

Search This Blog

Things I wish I knew before I married my Cloud Vendor

Labels

Comments

Post a Comment

Popular posts from this blog

From Excel to AWS: A Beginner’s Guide to Cloud FinOps Fundamentals

From Excel to AWS Part 2: Software Architecture Essentials for FinOps