(Don’t) Let It Burn

On infrastructure costs-reduction.

Cloud services have been on the rise for more than a decade now and there is a good reason for that. After all – the value proposition of these services is something which is hard to ignore. In essence they promise you that:

You won’t need to own the infrastructure nor maintain it.
The infrastructure will scale together with your needs (e.g. – being ‘elastic’), so you only pay for what you actually use.
Because of the two above you can now:
1. Focus on what you do best, without having to worry about infrastructure ‘nonsense’
2. It will cost you less

That’s a great value proposition, right?

Indeed, it sounds awesome and makes a lot of sense. And this is why cloud services – including services which provide you with computing power, storage and communication channels have become the norm for almost all of the hi-tech companies out there. If you are working in a hi-tech company I bet you heard one of these terms countless times:

AWS, GCS, Azure, BigQuery, S3, Kafka, Spark and so forth…

Now, here is the thing:

All (or almost all) of these services are charging you per actual usage (processing time, storage size, messages per second, etc..). This fact essentially means that bad architecture decisions or just a bad code can very easily generate a spike in the costs.

Ok… but what’s new in this statement? It has always been the case..

No… it’s not.

It may sound trivial if you were ‘matured’ into this reality, but it doesn’t necessarily have to be this way. For example – in the past, when a company maintained its own servers then you could abuse their processing power as much as you’d like, and would cost you about the same (give or take the electricity bill).

I’m just noting this point now and we’ll get to it later.

Hence (going back to the business model of these services) – you need to understand that your company’s costs are very sensitive to bad programming/coding.

I will also add that since their biz model is based on usage – it is not in their best interest to encourage reduced usage. They most likely won’t provide you with easy means to alert you on spikes and they won’t encourage you or provide you with the means to put countermeasures in case something doesn’t make sense business wise in terms of usage. Bad programming is actually good for their business. [just to clarify that I’m not hinting that something is being done on purpose. There is simply no business incentive from their side to invest in this].

The state of software engineering

Another thing that is happening in the market is that it’s becoming very hard to find good software engineers.

Why?

I’m observing two main reasons for that:

The trivial one – the market is very hot now when the money is cheap and capital is much more available then it used to be 10 years ago. Tons of new startups are emerging and they all need working hands. What’s happening now is an order of magnitude worse of what happened in the dot-com bubble (in the early 2000).
The rise of cloud services and the micro-services architecture are slowly transforming true software engineering into a lost art. Today, much of the written code is about integrating these services so they can talk with one another, resulting in relatively ‘dumb programming’. Algorithms, proper modeling and distributed programming are ‘beasts’ which are becoming less and less common.

What I’m observing and hearing (from friends, co-workers, people I mentor and my own company) is that due to the lack of quality engineers in the market they often have no choice, but to compromise on lower quality engineers as their main working force. The luckiest have one or two talented individuals in their teams.

Now, what happens when you mix an architecture which is sensitive to bad coding with bad coders?

Yep – no surprises here – a huge and unnecessary spike in infrastructure costs.

And that’s on top of the usual – more bugs, longer development cycles and non-maintainable code – which results in additional deferred costs as well.

Those extra costs are non-negligible. They shorten your company’s runway by a significant margin. In fact – here is my prediction –

I predict many more startups are going to close their gates due to software engineering issues than ever before. If the main reasons for closing companies in the past were mismatch between the product and the customer needs, lack of funding, inflection points in the market and too many execution mistakes by the founders – then this era will introduce a whole new reason for companies shutting down:

Low quality software which results in being too late to the market, or never making it to the market all-together.

Now, in this post I’m not going to focus on the delays caused by bad coding. I would like to focus on how you can help with reducing infrastructure costs.

Eh? Sorry… me? I’m not a developer. I didn’t cause this mess.

True. You didn’t cause it. You are not accountable for that.

But still – you want your company to be successful. Aren’t you?

Assuming you do – there are some things you can do if you are a product manager or an entrepreneur.

Minimizing costs as a product manager

Your impact exists where the following features are getting into the works:

Features which involve intensive processing (CPU power)
Features which involve intensive data transfers or DB activities
Features which involve handling big amount of http requests frequently

Here is how you can help:

1) Joining the engineering design reviews of your features

You might raise an eyebrow now. You might tell me: “Hey, Nati – you emphasized more than once that we must not get involved with the ‘how’ (how the feature is going to be implemented). So – what’s going on?”

You’re 100% right.

But I’m not asking you to get directly involved. I’m asking you to minimize the risks of spiking infrastructure costs by providing some product related feedback.

Being more specific:

Prepare in advance and bring to the meeting all the data & predictions regarding the expected loads, as far as you can tell. E.g. – how many requests or how many transactions are expected per minute/hour, what should be the expected batch size and so forth. Provide this data whenever required, or ‘push’ it proactively if you believe the engineers are working under false assumptions.
Provide insights on the characteristics of the expected load. For example – ‘90% of the urls received are going to be irrelevant to us, so better to filter them out as the first step in the pipeline’. The engineers may miss that without you.
Just sit there quietly (feel free to work during the meeting) and answer their questions when they rise. Yeah – they should have asked these questions during the spec review, but in reality some things only come up at a later stage.
If you have an engineering background and you spot an engineering mistake – call it out. Yes, it’s their right to ignore you or disagree with you if they wish, but there is a good chance they won’t. You may have just saved a huge amount of money to your company, or just a couple of sprints.

By doing the above – you’re not getting involved directly with their design, but rather providing invaluable input that can save a lot of development time, or just resources – which will result in reduced costs.

Only join design meetings for the features belonging to the group above, and only if you suspect the developers may get it all wrong. If you have a competent team and you don’t foresee much risks – skip these meetings altogether and let them do their job.

2) Make sure your specs (PRDs) address the required performance.

Put KPIs around that. For example: “The overall infrastructure costs must not increase by more than 0.5% following the introduction of this feature”.

Put warning signs around sensitive requirements where you have concerns that it may be implemented in a very inefficient way. For example: [after describing an operation] “This operation must be executed in less than 1 second”.

Minimizing costs as an entrepreneur

As an entrepreneur your impact on cost savings is mainly at the early days of the company. If you neglect it early on – it’d be much harder to change later down the road.

Here is how you can help even if you’re not the CTO:

Hiring – it goes without saying that you must always push towards hiring A+ engineers. If you come across one, and you don’t have the budget – then let someone else go. These people are so rare that you must grab them at all costs. Seriously. If you are not the hiring manager then communicate to the hiring manager that you are willing to make big sacrifices in order to get the top talent. And never agree to a hiring plan that doesn’t include an actual test of skills because some people can tell very nice, but untrue, stories about themselves.
Tech stack – betting on the right tech stack has a HUGE impact on your company going forward. So make sure it’s not a bet but rather a wise decision. Naturally, this is for the CTO to decide. If you are the CTO – then be professional. Never bet on a bleeding edge technology that just came out unless you really have no other choice. Pick up libraries and languages which have a huge amount of support and open source libraries that have been tested by thousands others. Don’t be a pioneer here. I know it’s tempting – but don’t go there. It’s your company and the bets are already high enough by merely being a startup. Don’t add more unknowns, sexy as they may sound. If you are not the CTO – then make sure the above is considered by the CTO. You are one of the founders – so your word should mean something. You must also remind the CTO of the various business considerations that may affect the tech stack. For example: “Recall that by the end of the first year we’re expected to support 1M transactions per hour. Each transaction goes through a 5 stages pipeline and needs to complete processing within 3 seconds”.
Understand the cost structure and watch the infrastructure costs overtime (or assign someone to do it for you). Make sure the costs are only increasing if the business is growing as well, and warn the CTO if this is not the case.
Think outside the box. Going back to the point I raised above – cloud services are very useful for many scenarios, but in your specific case – it might not be the way to go. It doesn’t mean you need to maintain your own farm of servers, but maybe leasing some powerful dedicated servers with fast SSD drives might be all you need. When I was in Newsfusion we managed to carry a huge amount of algorithmic processing + serving hundreds of thousands of users using 4 dedicated servers which cost us $2K/mo in total. Today I see a similar amount of work being done by other companies using cloud services and it costs them about 10X then that… something to consider…

That wraps up the post for today.

If you found this post/series useful – let me know in the comments. If you think others can benefit from it – feel free to share it with them.

Thank you, and until next time 🙂

Liked it? Why not share it?

Back to the knowledge base