DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Spegel, Pixie, and Why :latest Is Evil

Spegel, Pixie, and Why :latest Is Evil

Comments
4 min read
Project: One App — Three Probes — Real Failures

Project: One App — Three Probes — Real Failures

1
Comments
3 min read
Ring 0 Deployment Safety Protocol (Post-CrowdStrike)

Ring 0 Deployment Safety Protocol (Post-CrowdStrike)

1
Comments 1
2 min read
How a Kubernetes Autoscaling Incident Took Down Our API — and How I Now Debug It in Minutes

How a Kubernetes Autoscaling Incident Took Down Our API — and How I Now Debug It in Minutes

Comments 1
2 min read
Kubernetes In-Place Pod Resize

Kubernetes In-Place Pod Resize

Comments
3 min read
Datadog: Observability Lessons from 50+ AWS Apps

Datadog: Observability Lessons from 50+ AWS Apps

4
Comments
7 min read
Lessons in Testing, Performance, and Legacy Systems from /dev/mtl 2025

Lessons in Testing, Performance, and Legacy Systems from /dev/mtl 2025

Comments
7 min read
Turning block/goose into an AI SRE Agent

Turning block/goose into an AI SRE Agent

1
Comments
3 min read
Rightsizing Kubernetes Requests with the In-Place Vertical Pod Autoscaler

Rightsizing Kubernetes Requests with the In-Place Vertical Pod Autoscaler

2
Comments
3 min read
AWS Security Series: AWS Access Key is Compromised. Now What? An Incident Response Playbook.

AWS Security Series: AWS Access Key is Compromised. Now What? An Incident Response Playbook.

Comments
3 min read
Kubernetes Is Not a Container Platform (And That Changes Everything)

Kubernetes Is Not a Container Platform (And That Changes Everything)

Comments
1 min read
What is performance engineering: A Gatling take

What is performance engineering: A Gatling take

Comments
8 min read
Announcing Reliability Delta: Clear, Objective Insight into Whether Your Release Made Your System Better or Worse

Announcing Reliability Delta: Clear, Objective Insight into Whether Your Release Made Your System Better or Worse

Comments
4 min read
What 100+ Production Incidents Taught Me About System Design

What 100+ Production Incidents Taught Me About System Design

9
Comments 5
5 min read
Production Canary Architecture (what actually guarantees zero downtime)

Production Canary Architecture (what actually guarantees zero downtime)

3
Comments
3 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.