Designing kgateway for Scalability – Not All Gateways Are Created Equal

Lin Sun & Yuval Kohavi

May 7, 2025

With the Kubernetes Gateway API becoming the de facto standard for managing traffic into, out of, and within clusters, a growing number of gateways now implement this API. While gateways are often thought of as interchangeable, the choice can have major implications—especially in terms of scale.

Before comparing features, a critical consideration is whether a gateway is built on a solid, reliable foundation.

Aren’t All Envoy-Based Gateways the Same?

It’s a fair question. Many gateways use the Envoy proxy under the hood to enforce routing, security, and other traffic policies. But despite sharing the same data plane, not all gateways are equal: the control plane makes all the difference.

The control plane is responsible for translating Kubernetes Gateway API resources into actual Envoy configuration. This translation layer can be simple for a few routes, but when you’re managing 20,000 resources (which will translate into 500,000+ lines of Envoy config), the efficiency and scalability of the control plane become critical—as I noted in a previous LinkedIn post.

Designing kgateway for Scalability

When we started building Gloo (now kgateway) seven years ago, we used a snapshot-based model that recalculated everything on every update—whether it was a new route, an update, or a backend change. This meant even small changes triggered a full control plane recalculation. With Kubernetes, where pods and backends change constantly, this was not scalable.

Managing resource dependencies via manual references like targetRefs and extensionRefs also proved cumbersome. Dependency resolution sometimes failed, creating reliability issues.

Learning from that experience, we designed kgateway using the battle-tested krt framework to handle dependency tracking automatically. Now, only affected objects are re-translated when a change occurs—ensuring fast updates and efficient scaling.

Let’s walk through a few core scenarios to illustrate how kgateway scales to support large teams and applications, starting with a setup of 10,000 routes and backends, and see how kgateway handles new routes, route update, deletion or control plane becomes unavailable.

Core Scalability Scenarios

When a New Route Is Added

In a production system with thousands of existing routes, adding a new one should be near-instant. Ideally, the route becomes effective within 1 second, without needing a periodic timer or restart. This ensures teams can test and deploy with confidence.

Watch the full demo in the Test These Scenarios Yourself section below

When a Route Is Updated

Updates to an existing route should not interrupt traffic. The old route should be replaced atomically, and the new route should take effect immediately. Any delay or downtime can lead to outages or security issues—especially in critical systems.

Watch the full demo in the Test These Scenarios Yourself section below

When a Route Is Removed

Removing a route should instantly revoke access. Lingering routes, even for a few seconds, can pose security risks and violate compliance requirements.

Watch the full demo in the Test These Scenarios Yourself section below

When the Control Plane Is Restarted or Scaled

The system should remain fully operational when the control plane restarts or scales out. Since the Envoy proxies are already configured, their behavior should not be affected. Once the control plane returns, it should not trigger unnecessary reconfigurations if there’s been no change.

Test These Scenarios Yourself

  1. Install kgateway.

  2. Create the http gateway proxy using a Gateway resource:

$ kubectl apply -f - <<EOF
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1
metadata:
 name: http
 namespace: kgateway-system
spec:
 gatewayClassName: kgateway
 listeners:
 - protocol: HTTP
   port: 8080
   name: http
   allowedRoutes:
     namespaces:
       from: All
EOF
  1. (Optional) Install the OpenTelemetry Collector and kube-prometheus-stack to observe CPU, memory and Envoy metrics. (Make sure your cluster has adequate resources if doing this.)

  2. Clone the kgateway repo. From the cloned directory, use the applier utilitiy to load 10,000 routes and backends (from 0 to 9999). Send a few random requests to confirm routes are effective immediately.

cd hack/utils/applier
wget https://raw.githubusercontent.com/linsun/gateway-tests/refs/heads/main/scale/routes.yaml
go run main.go apply -f routes.yaml --iterations 10000
kubectl port-forward deployment/http -n kgateway-system 8080:8080 &
curl http://localhost:8080/foo/9
curl http://localhost:8080/foo/99
curl http://localhost:8080/foo/9999
  1. Add the 10,001st route and backend, and check if the request to the newly added route is effective immediately.
go run main.go apply -f routes.yaml --start 10000 --iterations 1
curl http://localhost:8080/foo/10000
  1. Update the 10,001st route. Send the request with -v (verbose); you should see the response header hello: kgateway being added:
kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
 generation: 1
 name: route-10000
 namespace: default
spec:
 parentRefs:
 - group: gateway.networking.k8s.io
   kind: Gateway
   name: http
   namespace: kgateway-system
 rules:
 - backendRefs:
   - group: gateway.kgateway.dev
     kind: Backend
     name: backend-10000
     weight: 1
   filters:
   - type: URLRewrite
     urlRewrite:
       path:
         replacePrefixMatch: /anything/
         type: ReplacePrefixMatch
   - type: ResponseHeaderModifier
     responseHeaderModifier:
       add:
         - name: hello
           value: kgateway
   matches:
   - path:
       type: PathPrefix
       value: /foo/10000
EOF
curl http://localhost:8080/foo/10000 -v
  1. Delete the 10,001st route, and confirm the route is no longer functional:
kubectl delete httproute route-10000
curl http://localhost:8080/foo/10000
  1. Delete the kgateway control plane. The CPU/memory of the Envoy proxy managed by the control plane should remain stable, and all requests should continue to work:
for i in {2..1009}
do
  curl http://localhost:8080/foo/9999
  sleep 0.1
  date
Done

In a new terminal window:

kubectl delete pod -l kgateway=kgateway -n kgateway-system

You can also watch the video demo here.

Wrapping Up

Scalability and stability are the foundation of kgateway. Before we add new features, we ensure the core is reliable under heavy load and across failure scenarios. If the foundation isn’t solid, everything built on top is at risk of collapsing.

To dive deeper into our control plane design, check out our previous blog post. We’d love to hear your thoughts—join us on the kgateway slack channel or follow us on LinkedIn to keep the conversation going.