Oh… We Need an Alert for That…

Any product which finds its way to production needs some level of monitoring. After all – it’s serving users & customers, and you’d like to know as soon as something stops working – Ideally – before the customer finds out…

At first – you may raise an eyebrow. Why should you care? The engineering team needs to place their own monitoring system and make sure everything is always up, right?

On some level, you are correct. But what many product managers miss is that there are two layers of monitoring & alerts – infrastructure and business ones. Those two are different, and one of them is totally owned by you, whether you like it or not.

Let’s go over them briefly.

Infrastructure monitoring & alerts

This is the layer which you are traditionally less involved with. By ‘infrastructure’ I refer to all services and processes which are running constantly and making sure the product is available and working without any critical issues.

You can divide its scope of responsibility to two:

Hardware and operating system issues. For example, raising alerts when a process or a server is down, when a machine is out of memory, when there is no more disk space, when the backup disk is full, etc..
Bugs. Meaning – failing to deliver on a designed behavior. Examples – a scheduled report to a customer that didn’t trigger, failing to process a message queue, not handling request parameters properly, concurrency issues and more.. Not all bugs were born the same, of course, and not all of them need to be treated as production issues. This decision, actually, does fall in your court, but we’ll get to it in a second.

Business alerts & monitoring

This layer is often overlooked or not well understood. Most of the people I met believe that if they have a proper monitoring in place for what I defined as the ‘infrastructure’ layer – then they are safe. I’d argue that they are wrong – and it’s not enough. Why?

There is a (somewhat known) phrase which says: “The operation was successful, but the patient died”. In our context it means that all the engineering components are working as designed, but nevertheless, the overall business metrics and KPIs are going in the wrong way.

Examples:

The revenues generated by your platform for customer X went down by more than 20% in the last 3 days.
The users engagement with your app was reduced significantly during the last week
No transactions were made for one of your product’s categories in the last 24 hours
The overall complaint rate for your product has crossed a red line.

At the end of the day, we want the ‘patient to live’ and everything else is just means to achieve that. Sometimes, though, people are confusing the means with the desired results. They will claim: “What do you want from my team? Everything is working as you wanted”. From their perspective – they are right. You need to be a responsible adult and reflect to everyone that although all the systems are working fine – something is broken on the business level and needs to be addressed ASAP.

This is where the business monitoring and alerts come in.

How could it be that there are no bugs and yet the KPIs are down?

Oh… plenty of reasons. Here are the main ones:

One or more of the business assumptions were wrong and the users are not responding to the product as expected.
Bad configuration for one of the customers
Your code integrates with third party software and it doesn’t behave as expected (bug, API changes, etc..)
Something fundamental in the ecosystem was changed (for example – Apple’s IDFA changes from October 2020)

Wait, I’ve built plenty of dashboards that track the KPIs. Isn’t it enough?

Unless you can afford a daily routine that goes over all the dashboards and drill down to all of your customers and check their vitals – then the answer is ‘no’.

For most of the people I know and for most products I’m aware of – this is not feasible. I mean – you might be able to pull it off for a day or two – but making it a routine is a no-go since it’s too time-consuming.

I will raise an exception, though. For some of the B2C products – it may be possible to skim through a well designed business dashboard and get the ‘health’ of the business in just a few minutes. This is because all of your users are probably falling in the same ‘group’ so it’s either ‘everything is cool’ or ‘everything is terrible’.

But for the rest of us out there – we’ll need to deploy some sort of business monitoring & alerting system.

Is this a priority? What’s the risk of not having such a system?

For various reasons, some of which are objective, but some are just a result of ignorance – the task of designing and building a monitoring system usually doesn’t receive the priority it deserves and may be very well absent from the roadmap .

The end result is that your customers are becoming your ‘business QA’.

Did it ever happen to you that a customer reached out to you (or worse – to your CEO) and asked why the product stopped working, or only half-functioning for them – while you were totally clueless about this?

If it happened to you – then you know how it looks and how it feels: you look like an amateur and it feels quite embarrassing.

I am not saying that such a monitoring layer needs to be your first priority. Nope.

Your first priority needs to get to a product market fit with your product, and then to focus on growth. However, in parallel, when your product is starting to get some traction by enough users or meaningful customers – it is the time to start prioritizing such a system. You should consider this as an effort to buy trust from your users, as the end result is that the product looks much more reliable (well… at least in their eyes..) because you spot the issues and fix them before the users identify it.

However, if you keep postponing this you do risk your product’s reputation. The more you postpone the more you are blind to negative business trends that are going on and you risk customers’ churn.

There is also a personal risk involved. The more complaints the business guys are receiving from customers about negative trends that should have been discovered by your team – the less professional you and your team look. You know that bad reputation is hard to recover from – so consider this as well.

Ok. I’m convinced. How do I design such a system?

You start by looking at the various KPIs that matter (and not only on the north star) and think what would be the threshold for a negative trend of this KPI that you would like to be alerted on.

For example:

For many companies – the gross revenues is an important KPI they track. If your product directly correlates to the generation of revenues – then certainly you’d like to be alerted when there is a negative for more than 3 days.

Another example – let’s say your area of responsibility is the reporting dashboards for the customer. Clearly your product doesn’t correlate directly with generation of revenues. So what type of trends you’d like to be aware of in such a product?

I’d assume that one of the KPIs that matter to you is the usage of various reporting tables. If the usage of a particular screen is trending down significantly for a period of time – then you’d like to be alerted about this. Not so?

Hence, it all starts with the business KPIs that matter to you. Map them.

Then – as for any monitoring system – there are a few parameters for each alert that you’ll need to define:

The time window to check (e.g. – the last 3 days)
The matching period to compare to (e.g. – the same days a week before)
The threshold (e.g. – less than 20% / 30% / whatever)
Scope (e.g. – one customer / a segment of customers / all)

For example – you can define that the monitoring system needs to trigger an alert in the following case:

If the revenues for customer X are down by more than 20% for 3 days in a row compared to the same days a week before – then an alert needs to be triggered.

In the same breath – you can define an additional alert for the overall revenues of all customers (a bigger scope). Both of them are useful and can reside side by side.

If you decide to skip the alert for individual customers then you may miss issues with some bad configuration (for example) for one of your customers that don’t apply to the rest.

Therefore, I always recommend adding both alerts that cover the overall trends and alerts for individual customers. The additional effort is usually negligible.

Aside from the KPIs, you can define business alerts for any other use cases which may risk:

Customers churn
Significant loss of revenues

What are the pitfalls I should watch for?

The main pitfalls – as you’ll discover it yourself are the same pitfalls which characterize any alerting system. Here they are:

Having too many alerts, to the point that they become meaningless and nobody watch them anymore
Having a wrong threshold that causes either too many false alarms or missing important events

As for the first – my recommendation is to start with the ‘must haves’ and grow organically from there. By ‘organically’ I mean to wait for a real life event to occur and then add it to the system so future similar events will be captured ahead of time.

As for the second – you need to understand that this is going to happen anyway for the first couple of weeks after releasing the alert – no matter how much research you put into defining the thresholds. This is why I recommend not spending too much time researching what would be the best setting for a given threshold. Just set a number that sounds reasonable as the threshold, and make it sure it’s easy to modify it based on reality. Once the alert is live – if you get too many false alarms – increase the threshold. If you get too few – reduce it. Now, ‘getting too few’ may take some time to discover, so therefore it’s better to put a lower threshold first and increase it given false alarms.

Are there any ready-made tools so the engineering team won’t need to spend their time on this?

Could be. But as of this time of writing – I’m not aware of any such tools. There are plenty of monitoring platforms out there, but none which are tuned towards business KPIs monitoring, as far as I can tell (if you know of – let me know in the comments).

I believe the reason for that is that such a monitoring framework is highly dependent on the specific business KPIs and on the small implementation details of how the data is structured and collected.

It means that you’ll need to go through the whole feature-delivery-process (you can read about it here). Gather the requirements (the internal stakeholders are the business guys and you, the product team), write a spec, do a spec review and hand it over to the engineering for implementation.

I recommend Slack or any other internal messaging platform you use for reporting the alerts. For example – a dedicated channel in Slack for biz–alerts.

One thing to understand is that this is a ‘breathing system’ on its own. It will go through changes and extend as time goes by and more business cases are becoming known. Therefore, it should be designed as such.

The product manager role in the infrastructure alerts

Before we wrap it up – I want to address what you should care about when it comes to the infrastructure alerts.

As I see it – the general instructions to the engineering team should be something in that spirit:

“Make sure that:

The hardware is online and accessible
The servers have enough resources and it’s not near exhaustion (less than 70% on the average with peaks below 90%)
The system processes and services are running and accessible. If a process dies for whatever reason – it needs to recover automatically
The data is backed up and the backups are tested periodically
The costs of the infrastructure make sense and there are no weird spikes or bad trends

“

I probably missed a bullet or two – but I’m sure you can fill what’s missing. Those general instructions are general enough to fit most of the companies.

Your main responsibility here, though, is regarding the second part – the bugs for which we want to be alerted. This is more tricky.

The overall approach here is to aim mainly for production issues – bugs which are severe enough to cause loss or revenues or customers’ churn.

This is also an ongoing effort as alerts will be added by the engineering team as time goes by.

That wraps up the post for today.

If you found this post/series useful – let me know in the comments. If you think others can benefit from it – feel free to share it with them.

Thank you, and until next time 🙂

Liked it? Why not share it?

Back to the knowledge base