Release It! Second Edition
A fascinating exploration into software patterns designed to make software more reliable.
A long time has passed since I last looked through this book, but it really stood out to me as a must read for software engineers. The topics/concepts that stood the test of time:
- Applying Back Pressure
- The Circuit Breaker Pattern
- Stability Anti-Patterns
While I won’t get into the stability anti-patterns here, it is full of lessons on how software will fail in production and lead to down time. After setting the stage for what will go wrong in production, you learn how to design and build resilient software systems through better software patterns (discussed here) and organization/process changes.
The reason I loved this book – and the reason this might be my favorite software engineering book to date – is the battle tested advice and stories behind why these patterns are necessary. Software is never bug free, and teaching readers to design their systems to account for that fact is an invaluable lesson. Can’t recommend this book enough.
Early on in my career I found these software patterns to be the most novel, and worth sharing:
Shed Load and Apply Back Pressure
Whether you are deploying services to be consumed by the rest of the world, or internally in a service oriented architecture, there comes a risk that your consumers will ask too much of you. That might manifest as more requests than you can handle or a downstream service you rely on is currently in a degraded state. The problem remains the same: You can’t always respond to every request that comes in. So given that problem, that begs the question: How should your service behave when you determine you can’t meet the demand? Typically you will find that services completely fall over and stop responding to requests.
A better approach is to be proactive and design a strategy for managing load when you know you can’t keep up.
Nygard highlights two approaches:
- Load Shedding
- Applying Back Pressure
Which approach do you choose and when?
Load Shedding
When you are serving the outside world, the recommended approach is to shed load – especially if you are running multiple instances behind a load balancer. If you want to provide a good user experience for your customers, you want to avoid waiting for requests that are hanging. Instead of that slow response (which might eventually fail) direct your customers to instances of the service that have room to handle the request more efficiently.
If your service starts taking too long to respond (as defined by your SLA), it’s time to react and start shedding load. A common approach is to start returning 503s to the load balancer so that the load balancer knows to send future requests to other instances of your service.
Back Pressure
When running in a service oriented architecture (aka microservices), you might be better off implementing back pressure to help manage load before resorting to load shedding. With back pressure, instead of dropping requests you communicate to your clients that they need to wait before sending requests. This concept is explained best from the context of a system that uses asynchronous messaging in a producer consumer fashion. Producers in the system will add messages to the queue in an asynchronous fashion. When the consumer works through that queue, and determines that it can’t handle the volume of requests, it can apply back pressure to the system and block producers from adding more messages to the queue by slowing down the rate of new messages getting added. By slowing down the producers, the system can hopefully catch up. A degraded user experience is better than a complete outage.
While that is one of the more elegant approaches at implementing back pressure, back pressure can also be implemented by blocking request threads when too many requests come in or by responding with an error message indicating that the message was dropped.
The Circuit Breaker Pattern
As the name suggests, the circuit breaker pattern was conceptually borrowed from electrical engineering. When applied to software, the circuit breaker pattern is supposed to manage the dangers of operations that have a high likelihood of timeouts and/or failures. It acts as a gateway to these operations, with the added benefit of tracking the performance of these operations. If the operations are not living up to the SLA, the circuit breaker can “trip” and fail open so that operations gated by the circuit breaker will fail fast. There are several reasons why this is a good thing. If communicating with a service that is getting slower and slower, a circuit breaker will prevent you from a) adding more load to a struggling system and b) prevent wasting time on operations that have a high likelihood of timing out and failing. After “a while” (Note: There are super interesting implementation examples/strategies in the book that I won’t get into here) the circuit breaker will start letting requests through in a “half-open” state to test whether the likeliness of success has improved. If the circuit breaker likes what it sees, it can “close” and resume sending all requests through again.