Healthchecks

Backends attached to Fastly services have a health status of either healthy or sick. We determine this status by regularly sending a predefined HTTP request to the backend and checking that we get back the expected response. This regular polling of the backend is a healthcheck.

Creating healthchecks

You can create a healthcheck using fastly healthcheck create in the CLI, the healthcheck API endpoint, the web interface, or in VCL as part of a backend { ... } declaration.

Healthchecks can be configured to poll at an interval of your choice determined by the check_interval property in the API or the interval property in the VCL backend { ... } declaration. The sensitivity of the healthcheck is determined by other properties (which all apply both in the API and in VCL):

  • window: The number of healthcheck results to keep track of
  • threshold: The number of healthchecks that must pass (within the window) for a backend to be considered healthy
  • initial: The number of successful healthcheck results to pre-populate into the window at startup

Controlling initial state

Backends that do not have a healthcheck are considered healthy at all times, including immediately upon initialization.

Backends that do have a healthcheck are considered sick upon service initialization. They are marked healthy once enough successful health checks have been completed to reach the configured threshold. To allow a health-checked backend to instead be immediately considered healthy upon initialization, set the value of initial to be the same as the value of threshold. This means the threshold will be met immediately, even before any healthcheck requests are sent.

IMPORTANT: VCL services are initialized when the first request to the service is received. As a result, if initial is less than threshold, the first request made in each POP following the first deployment of the service will likely result in an HTTP 503 (service unavailable) response. Subsequent requests will continue to encounter unhealthy backends until enough successful healthchecks have been performed.

Redeployments of existing services do not affect health status unless a backend is renamed or added, or the healthcheck configuration of the backend changes.

Understanding healthcheck traffic volume

The number of healthcheck requests that are received by your backend server is likely to be much higher than your setting of interval/check_interval may suggest. This is due to a number of effects:

  1. Each Fastly POP handles healthchecks independently, but shares the results of the healthcheck within the POP in a process called healthcheck amortization. After the service is activated, the number of healthcheck instances for each backend definition will gradually approach the number of operational Fastly POPs that are handling traffic for your service.
  2. If the backend's hostname resolves to multiple IP addresses, a separate healthcheck will be sent to each one.
  3. If you create the same backend on multiple Fastly services and give each of them a healthcheck, then by default they will run independently, even if the healthcheck request is identical.

HINT: A realistic "worst case" scenario based on the above details might be one where you have 50 Fastly services that all use the same backend and that backend's DNS lookup returns 5 A records. In this situation, configuring the backend with a check_interval of 1000ms (1 per second) would actually result in:

[1 check/sec] x [100 POPs] x [50 services] x [5 IPs] = 2,500 requests/sec

It's also possible for Fastly POPs to briefly have amortized healthchecks disabled, for example during the initial deployment of a new POP. In such situations healthcheck rates may increase temporarily.

To reduce healthcheck traffic, first consider applying the same share_key to backends that are identical across multiple Fastly services, which will enable them to share the same healthcheck. If you have many similar Fastly services (e.g., staging and other non-live environments) then, given that healthchecks are performed independently in each Fastly POP, it's a good idea to manually define the backend in VCL, so that the share_key is the same across all services. The share_key property is only customizable in a manual backend { ... } definition.

To further reduce healthcheck traffic, consider increasing the interval/check_interval or reducing the number of IP addresses returned from a DNS query of the backend's hostname.

Using healthchecks to route requests to healthy backends

Fastly will not send HTTP requests to backends that are sick.

For the simplest possible service configuration (a single backend, and no content in cache), the effect of the backend being sick is that all end-user requests will elicit a Fastly-generated 503 Service unavailable response. This may still be better than not having a healthcheck at all, because a backend server that is failing might output unpredictable content.

The following example demonstrates this by assigning an always-sick backend as the backend for all requests. Press ▶ RUN to see the 503 response:

Healthchecks have more impact and provide a more seamless user experience when applied to services with multiple backends because it's then possible to intelligently select a healthy backend in preference to a sick one. You can select a healthy backend using custom logic or by setting up a director.

Using directors

A director is a grouping mechanism that balances traffic across its member backends based on a specified strategy. Backends are excluded from directors while they are unhealthy, so Fastly will instead choose a healthy backend using the strategy specified in the director definition. For example, the following director definition will choose a random backend from the healthy members:

For more information on how to set up directors and all of the director strategies available, see the directors reference docs.

Manually using VCL

The req.backend.healthy and backend.{NAME}.healthy VCL variables can be used to query the health status of the currently selected backend, or a nominated backend, at runtime. This information can be used as part of a hand-built routing solution. For example, if you have deployed infrastructure at geographically distributed locations, you may want to map Fastly edge locations to the most appropriate origin location. However, if one of your origins is down, you might want to instead use the next closest one.

These same variables can also be used if you wish to create an API to query the health status of your backends. In the following example, requests to /api/origin-status will be intercepted and a dynamic response created in JSON to return the current health status of each backend:

The JSON returned includes the server.datacenter variable, which identifies the Fastly POP, since the health status of a particular backend may be different in each POP. For a convenient way to make the request to every Fastly POP at once, see the /content/edge_check API endpoint.

Health-checking mechanics

If amortized healthchecks are enabled at both the service and POP level, each registered backend will be healthchecked by one designated Fastly cache server, which will then distribute the results to the rest of the POP. This server is also responsible for performing DNS lookups on the backend hostname, and distributing the resulting host addresses.

Illustration of the health check mechanism

Partial health

If a backend has multiple IP addresses and some, but not all, are sick, then the backend will be considered healthy. Fastly will continue to route traffic to it using the healthy IPs and will consider it to be healthy as part of assessing the health of any director that the backend is a member of.

By default, we will register up to 16 IPs for each backend. If more than 16 addresses are returned from a DNS query for a backend hostname, we will use only 16 of them. In some circumstances this limit can be increased; if you need more, contact support. Keep in mind that it may be better to split up large IP pools into multiple backends so that Fastly can assign different health statuses to each of them.

When forwarding live traffic to a healthy backend that has more than one healthy IP, Fastly cache servers will select a random healthy address.

DNS caching

Fastly honors the DNS TTL of backend hostnames. However, since renewing DNS results is only performed when needed for healthchecking purposes, the time between DNS queries also depends on the frequency of the healthcheck (determined by the interval or check_interval parameter).

Backend requests resulting from live end-user traffic to your Fastly service do not trigger DNS lookups and will always used cached results. As a result, we may use stale DNS data for short periods depending on the frequency of the healthcheck attached to a backend.

If a DNS lookup triggered by a healthcheck fails (i.e., the response is not one of "NOERROR", "NODATA", or "NXDOMAIN"), we will continue to use stale DNS data for both the health check and the forwarding of backend traffic for a short period. If it continues to fail after this period, we will clear the stale IPs, mark the backend as sick, and continue to attempt to obtain fresh DNS results at the healthcheck check interval frequency.