CLOUD DESIGN PATTERNS

Retry pattern

4 min readNov 25, 2022

Retries deal with transient errors in distributed systems by transparently retrying failed operations.

Problem context

Cloud environments and the services that operate in them may be subject to temporary glitches. Therefore, these systems need to be sensitive to them and be able to react accordingly. Faults can be of various types, such as loss of connection to the service network, inaccessibility of the service itself, client or server errors, network overloading, etc.
An appropriate approach to solving this problem is to implement self-correction solutions, that is, to define mechanisms to automatically respond (e.g., resubmit requests after a certain period of time) to the occurring faults to ensure the system's stability. One such mechanism is retries.

Retries are similar to a powerful medicine — useful in the right dose, but can cause significant damage when used too much. — AWS Builder Library

Scenarios

Solving these transient faults is through various strategies for retrying requests.

Cancellation

The first of these is a cancellation. This type of action is performed for situations in which we can expect that further attempts will also result in an error. Then simply repeating does not make sense because the problem must be diagnosed and fixed first, and only then can the system be resumed. An example of such a situation can be, for example, the use of incorrect parameters in the query sent to the service.

Immediate retry

The second of these is retry. And here, such an approach can be performed immediately or after some time in the form of a delay. Immediate retry is characteristic for rare errors, the nature of which is unknown or unique.
This is a selfish strategy due to the possible occupation of too many service resources as servers to which we send requests. Repeated requests are executed immediately, and thus, their number can increase very quickly.

Backoff

On the other hand, retrying after a delay, so-called backoff, is a way to solve problems such as those associated with the network, where using a delay allows the network load to be unloaded, after which you can retry. This is generally the recommended method, as it is not characterized by such aggressive requesting of external services so that we do not consume more resources than we actually need.

Here the question may arise of how to choose the delay time. The most adequate answer to this is: depending on the nature of the error.

Methods of increasing this time can be e.g., linear increasing, exponentially increasing, etc. In addition, implementations usually limit their backoff to a maximum value, the so-called capped backoff, so as not to implement repetitions indefinitely.

Logging

Of course, in parallel to the strategy of retrying the execution of requests and other activities which did not execute correctly in the system, there should be logging of occurring events in the system. Thanks to this, properly collected logs allow further analysis of the specifics of errors, the frequency of theiroccurrence, and the factors analysis affecting its occurrence.

Multilayered architecture issues

Most of the systems architectures are multilayered, and there are dependencies between the layers and calls to individual services located in different layers are dependent on each other. As a result, when we execute retries within a given layer, which in turn must execute retries to other layers, etc., the efficiency of the system decreases significantly. It can eventually lead to clogging. Therefore, it should be remembered that it is best to call the retries in one place of the architecture without deviating to the rest of the layers/servers, etc., if possible, of course.

Conclusions

The use of this pattern makes sense mainly in distributed systems. If the errors in communication can be resolved relatively quickly, it makes sense to use this pattern. However, if not, the constant repetition of requests may cause resources to be wasted and consumed on activities in the system which are doomed to failure anyway.
It should also be remembered that retry is not an antidote for systems in which errors in communication with services are due to inadequate scaling of services. This means that if a service does not respond to requests because it is overloaded, the solution is not to apply retry pattern in the hope that the traffic will decrease. In this case, the loaded service should be scaled up/out to meet the load requirements.

References

[1] Retry Pattern

[2] Timeouts, retries, and backoff with jitter

About the CDP Series

The number of solutions being built in the cloud is increasing by the day. As a result, solutions to common problems are shaping up in the form of design patterns. They ensure the reliability, security, and scalability of systems.

This series has been created to explain the commonly used cloud design patterns in a simple and understandable way.
I hope that everyone will find something valuable here.