Around IT In 256 Seconds

#20: Chaos engineering

Oct 26 '20

We tend to focus on testing happy paths and expected edge cases. But how do you make sure that your system can survive minor infrastructure and network failures, as well as application bugs? Especially in microservice or serverless environment, where there are tons of moving parts. I've seen too many times systems that fail miserably because some minor dependency was malfunctioning. For example, you have a tiny service that displays a small social widget on your website. When that service is down, the rest of the website should work. But without proper care and testing, you may end up with global HTTP 503 failure. Code reviews and unit tests are fine, but the ultimate test is... turning off that service on production. And making sure the rest actually works. This is called chaos engineering.

Believe it or not, many organizations do practice deliberately injecting faults into production. Now, turning off a service's instance on production is probably the easiest test you can conduct. The client must catch an exception and handle the failure gracefully. Sometimes by retrying, hoping to reach another healthy instance. Sometimes by returning a fallback value that's less relevant or up-to-date. Ideally, the end-user should not realize one of the services is down. Of course, that would mean that a failed service is not needed at all and can be shut down forever. So in practice, we expect visible, but insignificant degrade in service quality.

Get the new episode straight to your mailbox: https://256.nurkiewicz.com/newsletter

Episode source

Around IT In 256 Seconds Follow

#20: Chaos engineering

Around IT In 256 Seconds