Outlier detection

Docs

Documentation

Resiliency

Outlier detection

Configure passive health checks and remove unhealthy hosts from the load balancing pool with an outlier detection policy.

About outlier detection

Outlier detection is an important part of building resilient apps. An outlier detection policy sets up several conditions, such as retries and ejection percentages, that kgateway uses to determine if a service is unhealthy. In case an unhealthy service is detected, the outlier detection policy defines how the service is removed from the pool of healthy destinations to send traffic to. Your apps then have time to recover before they are added back to the load-balancing pool and checked again for consecutive errors.

The steps in this section use the Envoy-based kgateway proxy. The steps do not work with the agentgateway proxy.

Before you begin

Follow the Get started guide to install kgateway.
Follow the Sample app guide to create a gateway proxy with an HTTP listener and deploy the httpbin sample app.

Get the external address of the gateway and save it in an environment variable.

export INGRESS_GW_ADDRESS=$(kubectl get svc -n kgateway-system http -o jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}")
echo $INGRESS_GW_ADDRESS

kubectl port-forward deployment/http -n kgateway-system 8080:8080

Set up outlier detection

Scale the httpbin app to 2 replicas.

kubectl scale deploy/httpbin -n httpbin --replicas=2

Verify that you see two replicas of the httpbin app.

kubectl get pods -n httpbin

Example output:

NAME                                   READY   STATUS    RESTARTS   AGE
httpbin-577649ddb-lsgp8                2/2     Running   0          31d
httpbin-577649ddb-q9b92                2/2     Running   0          3s

Send a few requests to the httpbin app. Because both httpbin replicas are exposed under the same service, requests are automatically load balanced between all healthy replicas.

for i in {1..5}; do curl -vi http://$INGRESS_GW_ADDRESS:8080/status/200 -H "host: www.example.com:8080" ; done

for i in {1..5}; do curl -vi localhost:8080/status/200 -H "host: www.example.com:8080"; done

Example output for one request:

* Request completely sent off
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< access-control-allow-credentials: true
access-control-allow-credentials: true
< access-control-allow-origin: *
access-control-allow-origin: *
< content-length: 0
content-length: 0
< x-envoy-upstream-service-time: 1
x-envoy-upstream-service-time: 1
< server: envoy
server: envoy
< 
....

Review the logs for both replicas. Verify that you see log entries for the 5 requests spread across both replicas. For example, one replica might have 2 log entries and the other one 3.
```
kubectl logs <httpbin-replica> -n httpbin -f
```
Example output for one request:
```
time="2025-09-15T21:11:52.8514" status=200 method="GET" uri="/status/200" 
size_bytes=0 duration_ms=0.03 user_agent="curl/8.7.1" client_ip=10.X.X.XX
```

Create a BackendConfigPolicy with your outlier detection policy. The following example ejects an unhealthy upstream host for one hour if the host returns one 5XX HTTP response code. Note that the maximum number of ejected hosts is set to 80%. Because of that, only one replica of httpbin can be ejected at any given time. If two replicas were ejected, that would exceed the 80% maximum threshold (because the two replicas equal 100%).

kubectl apply -f- <<EOF
kind: BackendConfigPolicy
apiVersion: gateway.kgateway.dev/v1alpha1
metadata:
  name: httpbin-policy
  namespace: httpbin
spec:
  targetRefs:
    - name: httpbin
      group: ""
      kind: Service
  outlierDetection:
    interval: 2s
    consecutive5xx: 1
    baseEjectionTime: 1h
    maxEjectionPercent: 80
EOF

Setting	Description
`interval`	The time interval after which the hosts are evaluated to determine if they are healthy or not. In this example, the hosts are evaluated every 2 seconds. If not set, this field defaults to `10s`.
`consecutive5xx`	The number of consecutive server-side error responses, such as 5XX HTTP response codes for HTTP traffic and connection failures for TCP traffic, before a host is ejected from the load balancing pool. In this example, you remove the host when one 5XX HTTP response code is returned. If not set, ejection occurs after 5 consecutive errors by default. If this field is set to 0, passive health checks are disabled.
`baseEjectionTime`	The duration that a host is removed from the load balancing pool before a new evaluation starts. If not set, this field defaults to `30s`.
`maxEjectionPercent`	The maximum percent of hosts that can be ejected from the load balancing pool. In this example, 80% of all hosts can be ejected at a given time. If not set, this field defaults to `10` percent.

Repeat the requests to the httpbin app. In the log output for both httpbin replicas, verify that all requests are still spread across both httpbin instances.

for i in {1..5}; do curl -vi http://$INGRESS_GW_ADDRESS:8080/status/200 -H "host: www.example.com:8080" ; done

for i in {1..5}; do curl -vi localhost:8080/status/200 -H "host: www.example.com:8080"; done

Force one httpbin replica to return a 503 HTTP response code. This response code triggers the outlier detection policy and automatically removes this httpbin replica from the load balancing pool for 1 hour.

curl -vik http://$INGRESS_GW_ADDRESS:8080/status/503 -H "host: www.example.com:8080"

curl -vi localhost:8080/status/503 -H "host: www.example.com:8080"

Example output:

* Request completely sent off
< HTTP/1.1 503 Service Unavailable
HTTP/1.1 503 Service Unavailable
< access-control-allow-credentials: true
access-control-allow-credentials: true
< access-control-allow-origin: *
...

Send a few more requests to the httpbin app. In the logs for both replicas, verify that all requests now only go to the instance that is still considered healthy.

for i in {1..5}; do curl http://$INGRESS_GW_ADDRESS:8080/status/200 -H "host: www.example.com:8080" ; done

for i in {1..5}; do curl -vi localhost:8080/status/200 -H "host: www.example.com:8080"; done

Example log output of the healthy instance:

time="2025-09-15T21:19:02.0808" status=200 method="GET" uri="/status/200" size_bytes=0 duration_ms=0.03 user_agent="curl/8.7.1" client_ip=10.X.X.XX
time="2025-09-15T21:19:02.2452" status=200 method="GET" uri="/status/200" size_bytes=0 duration_ms=0.03 user_agent="curl/8.7.1" client_ip=10.X.X.XX
time="2025-09-15T21:19:02.4053" status=200 method="GET" uri="/status/200" size_bytes=0 duration_ms=0.04 user_agent="curl/8.7.1" client_ip=10.X.X.XX
time="2025-09-15T21:19:02.6067" status=200 method="GET" uri="/status/200" size_bytes=0 duration_ms=0.01 user_agent="curl/8.7.1" client_ip=10.X.X.XX
time="2025-09-15T21:19:02.7604" status=200 method="GET" uri="/status/200" size_bytes=0 duration_ms=0.01 user_agent="curl/8.7.1" client_ip=10.X.X.XX

Example log output of the unhealthy instance:

time="2025-09-15T21:17:25.4035" status=503 method="GET" uri="/status/503" size_bytes=0 duration_ms=0.04 user_agent="curl/8.7.1" client_ip=10.X.X.XX

Force the second httpbin replica to return a 503 HTTP response code. Note that the outlier detection only allows 80% of all upstream hosts to be ejected at a given time. Since both replicas would equal 100%, the outlier detection does not remove the host from the load balancing pool. The instance is still considered healthy and can receive requests. In your log output for the healthy instance, verify that you see the log entry for the 503 request.
curl -vik http://$INGRESS_GW_ADDRESS:8080/status/503 -H "host: www.example.com:8080"
curl -vi localhost:8080/status/503 -H "host: www.example.com:8080"
Example log output of the healthy instance:
```
time="2025-09-15T21:20:27.1117" status=503 method="GET" uri="/status/503" size_bytes=0 duration_ms=0.02 user_agent="curl/8.7.1" client_ip=10.X.X.XX
```

Send a few more requests to the httpbin app. In the logs for both replicas, verify that all requests are still routed to the same instance as the instance was not removed from the load balancing pool.

for i in {1..5}; do curl http://$INGRESS_GW_ADDRESS:8080/status/200 -H "host: www.example.com:8080" ; done

for i in {1..5}; do curl -vi localhost:8080/status/200 -H "host: www.example.com:8080"; done

Example log output for the healthy instance:

# Previous log output
time="2025-09-15T21:20:27.1117" status=503 method="GET" uri="/status/503" size_bytes=0 duration_ms=0.02 user_agent="curl/8.7.1" client_ip=10.0.9.76
# New log output
time="2025-09-15T21:25:11.4236" status=200 method="GET" uri="/status/200" size_bytes=0 duration_ms=0.02 user_agent="curl/8.7.1" client_ip=10.0.15.215
time="2025-09-15T21:25:11.5833" status=200 method="GET" uri="/status/200" size_bytes=0 duration_ms=0.03 user_agent="curl/8.7.1" client_ip=10.0.9.76
time="2025-09-15T21:25:11.7473" status=200 method="GET" uri="/status/200" size_bytes=0 duration_ms=0.03 user_agent="curl/8.7.1" client_ip=10.0.9.76
time="2025-09-15T21:25:11.9098" status=200 method="GET" uri="/status/200" size_bytes=0 duration_ms=0.01 user_agent="curl/8.7.1" client_ip=10.0.15.215
time="2025-09-15T21:25:12.0824" status=200 method="GET" uri="/status/200" size_bytes=0 duration_ms=0.01 user_agent="curl/8.7.1" client_ip=10.0.9.76

Port-forward the Gateway pod on port 19000.

kubectl port-forward deploy/http -n kgateway-system 19000

Open the stats Prometheus endpoint and look for the following metrics.
- envoy_cluster_outlier_detection_ejections_consecutive_5xx: The number of times a host qualified for ejection. In this example, the number is 2, because both hosts qualified for ejection.
- envoy_cluster_outlier_detection_ejections_enforced_consecutive_5xx: The number of times an ejection was forced. In this example, the number is 1, because the second host instance could not be ejected as it did not meet the maximum percentage setting in your outlier policy.
Example output:
```
envoy_cluster_outlier_detection_ejections_consecutive_5xx{envoy_cluster_name="kube_httpbin_httpbin_8000"} 2
envoy_cluster_outlier_detection_ejections_enforced_consecutive_5xx{envoy_cluster_name="kube_httpbin_httpbin_8000"} 1
```

Cleanup

Scale down the httpbin app to 1 replica.

kubectl scale deploy/httpbin -n httpbin --replicas=1

Remove the BackendConfigPolicy.

kubectl delete backendconfigpolicy httpbin-policy -n httpbin

Mirroring TCP keepalive