CodingBlocks

Site Reliability Engineering – Monitoring Distributed Systems

May 23 '22

We haven’t finished the Site Reliability Engineering book yet as we learn how to monitor our system while the deals at Costco as so good, Allen thinks they’re fake, Joe hasn’t attended a math class in a while, and Michael never had AOL.

The full show notes for this episode are available at https://www.codingblocks.net/episode185.

News

Thank you for the reviews! just_Bri, 1234556677888999900000, Mannc, good beer hunter
- Want to help out the show? Leave us a review!
Great post from @msuriar who was an actual SRE at Google and had great feedback on our episode on toil and lots of other stuff! (suriar.net)
Some other great recommendations from @msuriar:
- Slack’s Outage on January 4th 2021 (slack.engineering)
- Post-Incident Review on the Atlassian April 2022 outage (Atlassian)
Great episode on All The Code featuring Brandon Lyons and his journey to Microsoft. (ListenNotes.com)

Couldn’t resist posting this:

Costco thinks that its fans are going to buy these shirts. Pfft, no way. But if it were on coupon… from pics

Survey Says

Which book should we finish?

Site Reliability Engineering
Designing Data Intensive Applications
Domain Driven Design
Clean Code
Clean Architecture
The Imposter's Handbook
The DevOps Handbook
ANY BOOK. JUST FINISH ONE!
I actually like that you leave some of it for me to read on my own.
Just move on to another book. Ain't nobody got time to go back to those old books.

vote

Monitor Some of the Things

Terminology

Monitoring – Collecting, processing, and aggregating quantitative information about a system.
White-box monitoring – Monitoring based on metrics exposed by a system, i.e. logs, JVM profiling, etc.
Black-box monitoring – Monitoring a system as a user would see it.
Dashboard – Provides a summary view of the most important service metrics. May display team information, ticket queue size, high priority bugs, current on call engineer, recent pushes, etc.
Alert – Notification intended to be read by a human, such as tickets, email alerts, pages, etc.
Root cause – A defect, that if corrected, creates a high confidence level that the same issue won’t be seen again. There can be multiple root causes for a particular incident (including a lack of testing!)
Node and machine – A single instance of a running kernel.
- Kernel – The core of the operating system. Generally controls everything on the system, always resident in memory, and facilitates interactions between the system hardware and software. (Wikipedia)
- There could be multiple services worth monitoring on the same node that could be either related or unrelated.
Push – Any change to a running service or it’s configuration.

Why Monitor?

Cover of the "Site Reliability Engineering" book from O'Reilly

The famous “SRE Book” from Google

Some of the main reasons include:
- To analyze trends,
- To compare changes over time, and
- Alerting when there’s a problem.
- To build dashboards to answer basic questions.
- Ad hoc analysis when things change to identify what may have caused it.
Monitoring lets you know when the system is broken or may be about to break.
- You should never alert just if something seems off.
  - Paging a human is an expensive use of time.
  - Too many pages may be seen as noise and reduce the likelihood of thorough investigation.
  - Effective alerting systems have good signal and very low noise.

Setting Reasonable Expectations for Monitoring

Monitoring complex systems is a major undertaking.
- The book mentions that Google SRE teams with 10-12 members have one or two people focused on building and maintaining their monitoring systems for their service.
  - They’ve reduced the headcount needed for maintaining these systems as they’ve centralized and generalized their monitoring systems, but there’s still at least one human dedicated to the monitoring system.
  - They also ensure that it’s not a requirement that an SRE stare at the screen to identify when a problem comes up.
Google has since moved to simpler and faster monitoring systems that provide better tools for ad hoc analysis and avoid systems that try to determine causality
- This doesn’t mean they don’t monitor for major changes in common trends.
SRE’s at Google seldom use tiered rule triggering.
- Why? Because they’re constantly changing their service and/or infrastructure.
- When they do alert on these dependent types of rules, it’s when there’s a common task that’s carried out that is relatively simple.
It is critical that from the instant a production issue arises, that the monitoring system alert a human quickly, and provide an easy to follow process that people can use to find the root cause quickly.
Alerts need to be simple to understand and represent the failure clearly.

Symptoms vs Causes

A monitoring system should answer these two questions:
- What is broken? This is the symptom.
- Why is it broken? This is the cause.
The book says that drawing the line between the what and why is one of the most important ways to make a good monitoring system with high quality signals and low noise.
An example might be:
- Symptom: The web server is returning 500s or 404s,
- Cause: The database server ran out of hard-drive space.

Black-Box vs White-Box

Google SRE’s use white-box monitoring heavily, and much less black-box monitoring except for critical uses.
- White-box monitoring relies on inspecting the internals of a system.
- Black-box monitoring is symptom oriented and helps identify unplanned issues.
Interesting takeaway for the white-box monitoring is this exposes issues that may be hidden by things like retries.
A symptom for one team can be a cause for another.
White-box monitoring is crucial for telemetry.
- Example: The website thinks the database is slow, but does the database think itself is slow? If not, there may be a network issue.
Benefit of black-box monitoring for alerting is black-box monitoring indicates a problem that is currently happening, but is basically useless in letting you know that a problem may happen.

Four Golden Signals

Latency – The time it takes to service a request.
- Important to separate successful request latency vs failed request latency.
- A slow error is worse than a fast error!
Traffic – How much demand is being placed on your system, such as requests per second for a web request, or for streaming audio/video, it might be I/O throughput.
Errors – The rate of requests that fail, either explicitly or implicitly.
- Explicit errors are things like a 500 HTTP response.
- Implicit might be any request that took over 2 seconds to finish if your goal is to respond in less than 2 seconds.
Saturation – How full your service is.
- A measure of resources that are the most constrained, such as CPU or I/O, but note that things usually start to degrade before 100% utilization.
  - This is why having a utilization target is important.
- Latency increases are often indicators of saturation.
- Measuring 99% response time over a small interval can be an early signal of saturation.
- Saturation also concerns itself when predicting imminent issues, like filling up drive space, etc.

Resources we Like

Links to Google’s free books on Site Reliability Engineering (sre.google)
Slack’s Outage on January 4th 2021 (slack.engineering)
Post-Incident Review on the Atlassian April 2022 outage (Atlassian)
Great episode on All The Code featuring Brandon Lyons and his journey to Microsoft. (ListenNotes.com)

Tip of the Week

Prometheus has configurations that let you tune how often it looks for metrics, i.e. the scrape_interval. Too much and you’re wasting resources, not enough and you can miss important information and get false alerts. (Prometheus)
There’s a reason WordPress is so popular. It’s fast and easy to setup, especially if you use Webinonly. (Webinonly.com)
Looking for great encryption libraries for Java or PHP? Check out Bouncy Castle! (Bouncy Castle)
Big thanks to @bicylerepairmain for the tip on the running lines of code in VS Code with a keyboard shortcut. The option workbench.action.terminal.runSelectedText is under File -> Preferences -> Keyboard Shortcuts. (Stack Overflow)
Need to see all of the files you’ve changed since you branched off of a commit? Use git diff --name-only COMMIT_ID_SHA HEAD. (git-scm.com)
- Couple with Allen’s tip from episode 182 to make it easier to find that starting point!

Episode source

ALTERNATE UNIVERSE DEV

CodingBlocks

Site Reliability Engineering – Monitoring Distributed Systems

Sponsors

News