Model failover

Documentation

AI Gateway

Model failover

Prioritize the failover of requests across different models from an LLM provider.

About failover

Failover is a way to keep services running smoothly by automatically switching to a backup system when the main one fails or becomes unavailable.

For AI gateways, you can set up failover for the models of the LLM providers that you want to prioritize. If the main model from one provider goes down, slows, or has any issue, the system quickly switches to a backup model from that same provider. This keeps the service running without interruptions.

This approach increases the resiliency of your network environment by ensuring that apps that call LLMs can keep working without problems, even if one model has issues.

Before you begin

Set up AI Gateway.
Authenticate to the LLM.

Get the external address of the gateway and save it in an environment variable.

export INGRESS_GW_ADDRESS=$(kubectl get svc -n kgateway-system ai-gateway -o jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}")
echo $INGRESS_GW_ADDRESS

kubectl port-forward deployment/ai-gateway -n kgateway-system 8080:8080

Fail over to other models

In this example, you create a Backend with multiple pools for the same LLM provider. Each pool represents a specific model from the LLM provider that fails over in the following order of priority. For more information, see the MultiPool API reference docs.

Create or update the Backend for your LLM providers. The priority order of the models is as follows:

OpenAI gpt-4o model
OpenAI gpt-4.0-turbo model
OpenAI gpt-3.5-turbo model

kubectl apply -f- <<EOF
apiVersion: gateway.kgateway.dev/v1alpha1
kind: Backend
metadata:
  labels:
    app: model-failover
  name: model-failover
  namespace: kgateway-system
spec:
  type: AI
  ai:
    multipool:
      priorities:
      - pool:
        - provider:
            openai:
              model: "gpt-4o"
              authToken:
                kind: SecretRef
                secretRef:
                  name: openai-secret
      - pool:
        - provider:
            openai:
              model: "gpt-4.0-turbo"
              authToken:
                kind: SecretRef
                secretRef:
                  name: openai-secret
      - pool:
        - provider:
            openai:
              model: "gpt-3.5-turbo"
              authToken:
                kind: SecretRef
                secretRef:
                  name: openai-secret
EOF

Create an HTTPRoute resource that routes incoming traffic on the /model path to the Backend backend that you created in the previous step. In this example, the URLRewrite filter rewrites the path from /model to the path of the API in the LLM provider that you want to use, such as /v1/chat/ completions for OpenAI.

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: model-failover
  namespace: kgateway-system
  labels:
    app: model-failover
spec:
  parentRefs:
    - name: ai-gateway
      namespace: kgateway-system
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /model
    filters:
    - type: URLRewrite
      urlRewrite:
        path:
          type: ReplaceFullPath
          replaceFullPath: /v1/chat/completions
    backendRefs:
    - name: model-failover
      namespace: kgateway-system
      group: gateway.kgateway.dev
      kind: Backend
EOF

Send a request to observe the failover. In your request, do not specify a model. Instead, the Backend will automatically use the model from the first pool in the priority order.

curl -v "$INGRESS_GW_ADDRESS:8080/model" -H content-type:application/json -d '{
  "messages": [
    {
      "role": "user",
      "content": "What is kubernetes?"
    }
]}' | jq

curl -v "localhost:8080/model" -H content-type:application/json -d '{
  "messages": [
    {
      "role": "user",
      "content": "What is kubernetes?"
    }
]}' | jq

Example output: Note the response is from the gpt-4o model, which is the first model in the priority order from the Backend.

model-response.json

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


{
  "id": "chatcmpl-BFQ8Lldo9kLC56S1DFVbIonOQll9t",
  "object": "chat.completion",
  "created": 1743015077,
  "model": "gpt-4o-2024-08-06",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Kubernetes is an open-source container orchestration platform designed to automate the deployment, scaling, and management of containerized applications. Originally developed by Google, it is now maintained by the Cloud Native Computing Foundation (CNCF).\n\nKubernetes provides a framework to run distributed systems resiliently. It manages containerized applications across a cluster of machines, offering features such as:\n\n1. **Automatic Bin Packing**: It can optimize resource usage by automatically placing containers based on their resource requirements and constraints while not sacrificing availability.\n\n2. **Self-Healing**: Restarts failed containers, replaces and reschedules containers when nodes die, and kills and reschedules containers that are unresponsive to user-defined health checks.\n\n3. **Horizontal Scaling**: Scales applications and resources up or down automatically, manually, or based on CPU usage.\n\n4. **Service Discovery and Load Balancing**: Exposes containers using DNS names or their own IP addresses and balances the load across them.\n\n5. **Automated Rollouts and Rollbacks**: Automatically manages updates to applications or configurations and can rollback changes if necessary.\n\n6. **Secret and Configuration Management**: Enables you to deploy and update secrets and application configuration without rebuilding your container images and without exposing secrets in your stack configuration and environment variables.\n\n7. **Storage Orchestration**: Allows you to automatically mount the storage system of your choice, whether from local storage, a public cloud provider, or a network storage system.\n\nBy providing these functionalities, Kubernetes enables developers to focus more on creating applications, while the platform handles the complexities of deployment and scaling. It has become a de facto standard for container orchestration, supporting a wide range of cloud platforms and minimizing dependencies on any specific infrastructure.",
        "refusal": null,
        "annotations": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  ...
}

Cleanup

You can remove the resources that you created in this guide.

kubectl delete backend,httproute -n kgateway-system -l app=model-failover

Explore other AI Gateway features.

Pass in functions to an LLM to request as a step towards agentic AI.
Set up prompt guards to block unwanted requests and mask sensitive data.
Enrich your prompts with system prompts to improve LLM outputs.

Authenticate to the LLM Function calling

Model failover

About failover

Before you begin

Fail over to other models

Cleanup

Next