Request collapsing

Request collapsing is the practice of combining multiple requests for the same object into a single request to origin, and then potentially using the resulting response to satisfy all pending requests.

This is an important concept in the operation of Fastly's caching network, for two reasons:

  1. It solves an important problem, by preventing the expiry of a very highly demanded object in the cache causing an immediate flood of requests to an origin server, which might otherwise overwhelm it or consume expensive resources.
  2. The need to perform request collapsing explains several significant aspects of the design of our VCL request lifecycle.

Requests processed by vcl_miss (i.e. cache misses) will qualify for request collapsing unless the request is rerouted to vcl_pass by returning pass in vcl_miss or setting req.hash_ignore_busy in vcl_recv. Collapsing is also disabled automatically if a request matches a hit-for-pass object in cache during a cache lookup. In this case, control will move from vcl_hash directly to vcl_pass without calling miss or hit.

A hit-for-pass object is created and inserted into cache if return(pass) is found in vcl_fetch, for a cacheable response. In this situation, requests that have joined the waiting list are dequeued and processed separately, and future requests arriving during the TTL of the hit-for-pass object. See below for more details.

These scenarios are laid out below (non-default behaviors in bold):

ScenarioCollapsingOverall result
return(lookup) in recv, cache miss, return(fetch) in miss, return(deliver) in fetch (default behavior)EnabledRequests will collapse and receive copies of the same object when it's received from origin.
return(lookup) in recv, cache miss, return(fetch) in miss, return(pass) in fetchEnabledRequests will collapse, blocking on the first origin fetch, but when the response is received, the waiting requests will be dequeued and processed individually. A hit-for-pass object will be created. Subsequent requests will follow hit-for-pass rules (see below).
return(lookup) in recv, cache hit-for-passDisabledRequest will not be eligible for collapsing. Other requests for the same object received while waiting for the first to be received from origin will initiate independent fetches to origin. Response will not be cacheable.
return(pass) in recv or missDisabledAs above.
set req.hash_ignore_busy = true in recvDisabledRequest will not be eligible for collapsing. Other requests for the same object received while waiting for the first to be received from origin will initiate independent fetches to origin. Resulting origin response will be cacheable.

Request collapsing is not affected by whether a request has been restarted.

WARNING:: If a request exits vcl_fetch with beresp.cacheable set to false, then the response cannot be used to satisfy waiting clients, but we also cannot create a hit-for-pass marker. In this situation the next request in the queue will be sent to origin and the remaining clients will wait on that. Normally this results in the same outcome repeatedly, so to avoid this, you may prefer to always set beresp.cacheable to true and where you don't want to cache an object, return(pass) from vcl_fetch.

Without request collapsing - the cache stampede problem

Request collapsing is important because in practice the interval between requests for the same object is often smaller than the time required to fetch the object from origin. For example, the home page of a popular website might be requested 50 times per second, but might take 500ms to fetch from origin, including the network latency and origin processing time. Without request collapsing, the origin would be required to generate the same page multiple times concurrently:

Illustration of multiple requests going to origin at the same time

This might not be a problem if the number of concurrent requests is small, but for an object requested 50 times per second, with a 500ms fetch latency, the origin would be processing 25 concurrent requests for the same object before Fastly had the opportunity to store it in cache. This may also mean that where an object is already in cache, the moment it expires or is evicted the origin server will receive a sudden deluge of requests and become overloaded. This kind of effect is often called a cache stampede problem.

The waiting list mechanism

The answer is to have a queue, which Fastly calls a 'waiting list'. When the second request arrives, we know that we’re already in the process of fetching the resource that they want, so instead of starting a second fetch, we can attach user 2's request to user 1’s. When the fetch completes, the response can be saved in the cache, and also served simultaneously to both waiting users (and anyone else that’s joined on in the meantime).

Illustration of multiple requests being collapsed into one

Uncacheable and private responses

Request collapsing is essential to manage demand for high-traffic resources, but it creates numerous edge cases and exceptions that have to be addressed. The first and most obvious is what happens if the resource, once it arrives from the origin, is marked private.

Cache-Control: private

HINT: One of many reasons you should be using Cache-Control and not Expires for your caching configuration is the ability to use directives like private. The private directive tells Fastly that this response is good for only one user, the one that originally triggered the request. Anyone else that got tagged onto the queue must now be dequeued and their requests processed individually.

Responses not marked private, may be used to satisfy queued requests. Even a response with max-age=0 or the equivalent no-cache can be used to satisfy all the waiting clients, and the next request for the object will once again be a MISS. If your reason for giving a response a zero cache lifetime is that it contains content intended for a single user only, then ensure it is correctly marked private to avoid having this content sent to more than one user.


When a private cache-control directive or other logic causes a pass outcome in vcl_fetch, it prevents a response from being used to satisfy the clients on the waiting list, and triggers some of the queued requests to be dequeued. Since the cache state may have changed, each of these queued requests will now check cache again, and if they don't find a hit, they will all be sent to origin individually. However, whether this is done concurrently or consecutively depends on whether the pass response from vcl_fetch is cacheable. A cacheable response is one where beresp.cacheable is true.

If the pass response is cacheable, then a marker is placed in cache that says “don’t collapse requests for this resource”, called a hit for pass object. Requests already queued, and future incoming requests will now hit that marker, and will trigger fetches to the origin immediately, which will not block or affect each other. Hit-for-pass objects have a 2 minute TTL by default, but will respect beresp.ttl, subject to a minimum of 2 minutes and a maximum of 1 hour (so a TTL of zero will still set a hit-for-pass object for long enough to allow the queued requests to be dequeued and processed concurrently). Each completed request will dequeue more of the waiting requests until the waiting list is empty.

HINT: Even though a hit-for-pass object is an efficient mechanism for managing objects that cannot be served from cache, it's even better to pass in vcl_recv, to avoid building a waiting list in the first place. If you can predict (ahead of talking to an origin server) that a request should be passed, return(pass) from vcl_recv is recommended.

If the pass response is not cacheable, then the queued requests are still dequeued, but the first one to create a new fetch to origin will set up a new waiting list and the other requests will most likely join it. If the result of that fetch is also not cacheable (likely) then this process repeats with only one request for the object proceeding to origin at a time. This may result in client side timeouts for popular objects, so the use of hit-for-pass is highly recommended.

Streaming miss

Conventionally, the period of time in which duplicate requests are most likely to arrive is between receiving the first request and starting to receive the response from origin. However, when the response starts being received, it doesn't always arrive all at once, and may actually take a long time to download. In the case of large files, slow connections, or streams (e.g. video or sever-sent-events), downloading the response may take minutes. In this case, duplicate requests are actually more likely to be received during this second period, after vcl_fetch has already finished running.

The behavior in this situation depends on two factors: first, the configuration of beresp.do_stream. If false (default), Fastly will wait for the object to finish downloading, and will then deliver it to all the clients on the waiting list. If true, the response will start streaming to all clients on the waiting list simultaneously (using separate buffers, accounting for the fact that each client may be capable of receiving data at a different rate).

The second factor determines whether more new requests can receive the same stream if they arrive after vcl_fetch has run. This is influenced by the TTL of the object established by vcl_fetch. If we are still receiving a response from origin 10s after the headers were received, and vcl_fetch established a 1 minute TTL for this object, new requests will immediately receive the contents of the object downloaded so far, and will then be in sync with the other clients to which we are sending the same object. If the object has already expired, it will continue streaming to clients already receiving it, but new requests for the same object will initiate a new fetch to origin, even though the current one is still being downloaded.

Interaction with clustering and shielding

Requests may arrive on any one of thousands of Fastly edge servers. The vcl_hash process determines a cache address for each object, and normally, will forward the request from this initial server (the delivery server) to another server in the POP which is responsible for the storage of that specific object (the fetch server). This process which we call clustering is intended to increase cache efficiency, and also to reduce the number of requests that will be forwarded to origin, because all the fetches for a particular object will be focused through a single cache node that will then efficiently collapse them into a single origin request. Internally, Fastly also operates request collapsing between the delivery node and the fetch node.

Disabling clustering therefore has the potential to significantly increase traffic to origin not just due to a poorer cache hit ratio, but also due to the inability to collapse concurrent requests that originate on different delivery servers.

It is also possible to route requests through up to two Fastly POPs before they reach origin, normally whichever one is the initial handler for the request, and a fixed second location which is physically proximate to your origin. This kind of configuration is known as shielding. In this scenario, requests from POPs across the Fastly network are focused into a single POP, allowing it to perform request collapsing before forwarding a single request to origin.

Illustration of clustering and shielding in effect

The combination of clustering and shielding is an efficient way to dramatically reduce origin load for highly requested objects on high traffic sites.