Chaos Engineering

Introduction “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Priciples of Chaos Engineering Introduction In modern, rapidly evolving distributed systems, components fail all the time. These failures can be complex as they can cascade across systems. System weaknesses such as latency, race conditions, byzantine failures etc can be exacerbated in the face of large traffic volumes. Chaos engineering is key to discovering how these complex failures may affect the system and then validating over time that the weaknesses have been overcome. ...

August 13, 2019 · (updated December 21, 2024) · 6 min · Pradeep Loganathan

Retry Pattern

The retry pattern is an extremely important pattern to make applications and services more resilient to transient failures. A transient failure is a common type of failure in a cloud-based distributed architecture. This is often due to the nature of the network itself (loss of connectivity, timeout on requests, and so on). Transient faults occur when services are hosted separately and communicate over the wire, most likely over a HTTP protocol. These faults are expected to be short-lived. Repeating such a request that has previously failed could succeed on a subsequent attempt. ...

August 10, 2018 · (updated August 31, 2022) · 3 min · Pradeep Loganathan

Bulkhead Isolation

Bulkhead Isolation pattern is used to make sure that all our services work independently of each other and failure in one will not create a failure in another service. Techniques such as single-responsibility pattern, asynchronous-communication pattern, Circuit Breaker and Retry pattern help achieve the goal of stopping a failure propagating throughout the whole application. To implement this pattern, all our services need to work independently of each other and failure in one should not create a failure in another service. We also need to partition the overall system into a few isolated groups, so that any failure in one partition does not percolate to others. Containerization and microservices are an option for having partitioned and isolated systems. ...

July 8, 2018 · (updated August 31, 2022) · 1 min · Pradeep Loganathan