

We haven’t finished the Site Reliability Engineering book yet as we learn how to monitor our system while the deals at Costco as so good, Allen thinks they’re fake, Joe hasn’t attended a math class in a while, and Michael never had AOL.
The full show notes for this episode are available at https://www.codingblocks.net/episode185.
Sponsors
- Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.
- Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
News
- Thank you for the reviews! just_Bri, 1234556677888999900000, Mannc, good beer hunter
- Want to help out the show? Leave us a review!
- Great post from @msuriar who was an actual SRE at Google and had great feedback on our episode on toil and lots of other stuff! (suriar.net)
- Some other great recommendations from @msuriar:
- Slack’s Outage on January 4th 2021 (slack.engineering)
- Post-Incident Review on the Atlassian April 2022 outage (Atlassian)
- Great episode on All The Code featuring Brandon Lyons and his journey to Microsoft. (ListenNotes.com)
Couldn’t resist posting this:
Survey Says
Which book should we finish?
- Site Reliability Engineering
- Designing Data Intensive Applications
- Domain Driven Design
- Clean Code
- Clean Architecture
- The Imposter's Handbook
- The DevOps Handbook
- ANY BOOK. JUST FINISH ONE!
- I actually like that you leave some of it for me to read on my own.
- Just move on to another book. Ain't nobody got time to go back to those old books.
Monitor Some of the Things
Terminology
- Monitoring – Collecting, processing, and aggregating quantitative information about a system.
- White-box monitoring – Monitoring based on metrics exposed by a system, i.e. logs, JVM profiling, etc.
- Black-box monitoring – Monitoring a system as a user would see it.
- Dashboard – Provides a summary view of the most important service metrics. May display team information, ticket queue size, high priority bugs, current on call engineer, recent pushes, etc.
- Alert – Notification intended to be read by a human, such as tickets, email alerts, pages, etc.
- Root cause – A defect, that if corrected, creates a high confidence level that the same issue won’t be seen again. There can be multiple root causes for a particular incident (including a lack of testing!)
-
Node and machine – A single instance of a running kernel.
- Kernel – The core of the operating system. Generally controls everything on the system, always resident in memory, and facilitates interactions between the system hardware and software. (Wikipedia)
- There could be multiple services worth monitoring on the same node that could be either related or unrelated.
- Push – Any change to a running service or it’s configuration.
Why Monitor?
- Some of the main reasons include:
- To analyze trends,
- To compare changes over time, and
- Alerting when there’s a problem.
- To build dashboards to answer basic questions.
- Ad hoc analysis when things change to identify what may have caused it.
- Monitoring lets you know when the system is broken or may be about to break.
- You should never alert just if something seems off.
- Paging a human is an expensive use of time.
- Too many pages may be seen as noise and reduce the likelihood of thorough investigation.
- Effective alerting systems have good signal and very low noise.
- You should never alert just if something seems off.
Setting Reasonable Expectations for Monitoring
- Monitoring complex systems is a major undertaking.
- The book mentions that Google SRE teams with 10-12 members have one or two people focused on building and maintaining their monitoring systems for their service.
- They’ve reduced the headcount needed for maintaining these systems as they’ve centralized and generalized their monitoring systems, but there’s still at least one human dedicated to the monitoring system.
- They also ensure that it’s not a requirement that an SRE stare at the screen to identify when a problem comes up.
- The book mentions that Google SRE teams with 10-12 members have one or two people focused on building and maintaining their monitoring systems for their service.
- Google has since moved to simpler and faster monitoring systems that provide better tools for ad hoc analysis and avoid systems that try to determine causality
- This doesn’t mean they don’t monitor for major changes in common trends.
- SRE’s at Google seldom use tiered rule triggering.
- Why? Because they’re constantly changing their service and/or infrastructure.
- When they do alert on these dependent types of rules, it’s when there’s a common task that’s carried out that is relatively simple.
- It is critical that from the instant a production issue arises, that the monitoring system alert a human quickly, and provide an easy to follow process that people can use to find the root cause quickly.
- Alerts need to be simple to understand and represent the failure clearly.
Symptoms vs Causes
- A monitoring system should answer these two questions:
- What is broken? This is the symptom.
- Why is it broken? This is the cause.
- The book says that drawing the line between the what and why is one of the most important ways to make a good monitoring system with high quality signals and low noise.
- An example might be:
- Symptom: The web server is returning 500s or 404s,
- Cause: The database server ran out of hard-drive space.
Black-Box vs White-Box
- Google SRE’s use white-box monitoring heavily, and much less black-box monitoring except for critical uses.
- White-box monitoring relies on inspecting the internals of a system.
- Black-box monitoring is symptom oriented and helps identify unplanned issues.
- Interesting takeaway for the white-box monitoring is this exposes issues that may be hidden by things like retries.
- A symptom for one team can be a cause for another.
- White-box monitoring is crucial for telemetry.
- Example: The website thinks the database is slow, but does the database think itself is slow? If not, there may be a network issue.
- Benefit of black-box monitoring for alerting is black-box monitoring indicates a problem that is currently happening, but is basically useless in letting you know that a problem may happen.
Four Golden Signals
-
Latency – The time it takes to service a request.
- Important to separate successful request latency vs failed request latency.
- A slow error is worse than a fast error!
- Traffic – How much demand is being placed on your system, such as requests per second for a web request, or for streaming audio/video, it might be I/O throughput.
-
Errors – The rate of requests that fail, either explicitly or implicitly.
- Explicit errors are things like a 500 HTTP response.
- Implicit might be any request that took over 2 seconds to finish if your goal is to respond in less than 2 seconds.
-
Saturation – How full your service is.
- A measure of resources that are the most constrained, such as CPU or I/O, but note that things usually start to degrade before 100% utilization.
- This is why having a utilization target is important.
- Latency increases are often indicators of saturation.
- Measuring 99% response time over a small interval can be an early signal of saturation.
- Saturation also concerns itself when predicting imminent issues, like filling up drive space, etc.
- A measure of resources that are the most constrained, such as CPU or I/O, but note that things usually start to degrade before 100% utilization.
Resources we Like
- Links to Google’s free books on Site Reliability Engineering (sre.google)
- Slack’s Outage on January 4th 2021 (slack.engineering)
- Post-Incident Review on the Atlassian April 2022 outage (Atlassian)
- Great episode on All The Code featuring Brandon Lyons and his journey to Microsoft. (ListenNotes.com)
Tip of the Week
- Prometheus has configurations that let you tune how often it looks for metrics, i.e. the
scrape_interval
. Too much and you’re wasting resources, not enough and you can miss important information and get false alerts. (Prometheus) - There’s a reason WordPress is so popular. It’s fast and easy to setup, especially if you use Webinonly. (Webinonly.com)
- Looking for great encryption libraries for Java or PHP? Check out Bouncy Castle! (Bouncy Castle)
- Big thanks to @bicylerepairmain for the tip on the running lines of code in VS Code with a keyboard shortcut. The option
workbench.action.terminal.runSelectedText
is under File -> Preferences -> Keyboard Shortcuts. (Stack Overflow) - Need to see all of the files you’ve changed since you branched off of a commit? Use
git diff --name-only COMMIT_ID_SHA HEAD
. (git-scm.com)- Couple with Allen’s tip from episode 182 to make it easier to find that starting point!