Cloud Design Patterns

Throttling pattern

6 min readNov 11, 2022

It is actually controlling the usage of resources by an application instance, a particular tenant, or a whole service over time.

Photo by Kevin Kandlbinder on Unsplash

Problem context

Cloud applications are exposed to varying loads depending on the service model. It can be handling requests sent to your SaaS applications, hosting services if it is PaaS or running VMs in your setup in IaaS.

One of the examples is a diversified load when comes to application. It is often time-dependent, i.e., on business days, at the beginning and end of the day/weekdays/months, etc., and can be significantly increased, which means that the application must be susceptible to this and be able to handle increased traffic. In addition, the system must, in the meantime, be able to keep the requirements of SLAs and ensure that no system node hijacks the use of the available resources.

To meet these expectations, cloud infrastructure must be able to handle varying levels of traffic and the associated costs in order to ensure the greatest possible availability and service level agreements.


Let’s look at sample scenarios to help us understand the problem described earlier.

One-sided throttling

The fundamental way to control infrastructure load is to implement throttling on one side:

  • the provider — side,
  • the consumer — side.
Of course, to be clear, the icon of a person on the consumer side does not necessarily mean that a person is actually using it. It could also be an external service sending requests to the provider.

The former assumes that the provider is throttled and SLAs are not enforced. The client has complete discretion in using the provider’s resources, which may cause an excessive load on the provider and the need to reject or queue requests. Inadequate management on the provider's part will result in a degradation of service performance and affect customer experience.

The second case, i.e., consumer-side throttling, is a bit different because it requires a proper understanding of the requirements for the requests sent by clients to the service and setting the SLAs at appropriate levels. There is a risk that the sum of the established maximum client rates will be greater than the provider can handle. Then the service could be overloaded at its peak, so it is essential to understand the characteristics of the clients’ service usage to meet their requirements by scaling up/out the architecture appropriately.

Application side-only throttling

As the system architecture is often based on a number of services communicating internally, this communication process is also subject to proper communication handling. Considering this, the design and monitoring stages are critical to ensure that the entire traffic flow is maintained.

Let’s consider the situation shown below:

We have an application consisting of several components. Consumers send different requests at different frequencies. Understanding the system we have designed, we know that a single request sent to our system triggers internal communication with the services on the backend. This communication is specified by a specific range of requests for each service. In addition, each service can handle a certain maximum amount of traffic every second of its operation, which is defined as MAX RATE.

Let’s think about how the system behaves with two consumers. We know that each client sends, at any given second, a maximum of 50 and 70 requests, making a total of 120 requests per second. Our front page can handle this, but what happens on the backend? Every request sent to the front page causes the front end to send a correspondingly different number of requests to the services on the back end.
If we assume that at a given point in time (in a given second), 100 requests were sent to the frontend, it causes the frontend to send to Service #1 a maximum of 1000 requests, to Service #2 a maximum of 100 requests, and to Service #3 a maximum of 10'000 requests. And here comes the need for a solution to this problem of an intense workload for services #1 and #3 to handle all requests.

There are generally two ways out of this situation:

  1. services #1 and #3 can be scaled up or out to increase the capacity to handle the number of requests,
  2. the communication between the front page and services #1 and #3 can be throttled according to their capacity.

What should also be noted is what happens when a new consumer joins our system (Client #3). The number of requests required on the backend increases significantly. This raises the question of whether, for example, scaling up services is always the right approach because, in general, the more consumers there are, the more we have to scale up. In addition, our system will often not be used as intensively, and many resources will be unused.

Full throttling support

The last most controlled scenario is full throttling, that is, the throttling of both the consumer and the provider. I’ll skip its discussion, as it is a composite of previous cases.


As indicated earlier, solving problems depends entirely on the business case, system design and resources. For this reason, the choice of an appropriate throttling strategy is vital to address the specific needs of the systems adequately.


The most obvious strategy for handling the load is autoscaling. Depending on the needs of the consumers, the system resources are autoscaled to the extent that they meet the expectations of the generated load between the consumers and the provider. Such a method is, however, not without its drawbacks, as autoscaling is susceptible to delays. As a result, an extremely fast increase in the system’s load may create a bottleneck due to the delay in response and autoscaling.

The second way to solve such problems is to set a permissible threshold. If some part of the system exceeds it, it will be throttled. Being throttled can manifest itself in a variety of ways:

  • rejecting requests from consumers who accessed the API too many times,
  • by setting aside the handling of the request, in other words, deliberately delaying the handling of the request and informing the consumer that can try to send the request later,
  • appropriate deactivation of less critical services to increase the request capacity for services that are essential to the basic functionality of the system,
  • using Queue-Based Load Leveling pattern,
  • using Priority Queue pattern.

A suitable solution may be the composition of throttling in the form of a set threshold with autoscaling. In this way, while the system undergoes autoscaling, the set thresholds allow the system to maintain responsiveness.

It should be kept in mind that such a composition allowing autoscaling makes sense for a system that encounters problems with increased traffic on a relatively regular basis. Autoscaling for handling very infrequent increased traffic can only result in an unnecessary increase in the cost of maintaining the system. However, this is not a rule because if the system fulfills a critical purpose, it is better to scale it permanently above the resource requirements so that it will always be able to meet consumer demands quickly.

About the CDP Series

The number of solutions being built in the cloud is increasing by the day. As a result, solutions to common problems are shaping up in the form of design patterns. They ensure the reliability, security, and scalability of systems.

This series has been created to explain the commonly used cloud design patterns in a simple and understandable way.
I hope that everyone will find something valuable here.




Data engineering and Blockchain Enthusiast. Love coffee and the world of technology.