DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Project: One App — Three Probes — Real Failures

Project: One App — Three Probes — Real Failures

1
Comments
3 min read
Ring 0 Deployment Safety Protocol (Post-CrowdStrike)

Ring 0 Deployment Safety Protocol (Post-CrowdStrike)

1
Comments 1
2 min read
How a Kubernetes Autoscaling Incident Took Down Our API — and How I Now Debug It in Minutes

How a Kubernetes Autoscaling Incident Took Down Our API — and How I Now Debug It in Minutes

Comments 1
2 min read
Kubernetes In-Place Pod Resize

Kubernetes In-Place Pod Resize

Comments
3 min read
Datadog: Observability Lessons from 50+ AWS Apps

Datadog: Observability Lessons from 50+ AWS Apps

4
Comments
7 min read
Lessons in Testing, Performance, and Legacy Systems from /dev/mtl 2025

Lessons in Testing, Performance, and Legacy Systems from /dev/mtl 2025

Comments
7 min read
Turning block/goose into an AI SRE Agent

Turning block/goose into an AI SRE Agent

1
Comments
3 min read
Rightsizing Kubernetes Requests with the In-Place Vertical Pod Autoscaler

Rightsizing Kubernetes Requests with the In-Place Vertical Pod Autoscaler

2
Comments
3 min read
AWS Security Series: AWS Access Key is Compromised. Now What? An Incident Response Playbook.

AWS Security Series: AWS Access Key is Compromised. Now What? An Incident Response Playbook.

Comments
3 min read
Kubernetes Is Not a Container Platform (And That Changes Everything)

Kubernetes Is Not a Container Platform (And That Changes Everything)

Comments
1 min read
What is performance engineering: A Gatling take

What is performance engineering: A Gatling take

Comments
8 min read
Announcing Reliability Delta: Clear, Objective Insight into Whether Your Release Made Your System Better or Worse

Announcing Reliability Delta: Clear, Objective Insight into Whether Your Release Made Your System Better or Worse

Comments
4 min read
What 100+ Production Incidents Taught Me About System Design

What 100+ Production Incidents Taught Me About System Design

9
Comments 5
5 min read
Production Canary Architecture (what actually guarantees zero downtime)

Production Canary Architecture (what actually guarantees zero downtime)

3
Comments
3 min read
Utilizing the Go 1.25 Flight Recorder with tracing middleware

Utilizing the Go 1.25 Flight Recorder with tracing middleware

1
Comments
6 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.