This extension upgrades an ext-proc-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an inference gateway - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local OpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level AI Gateway like LiteLLM, Solo AI Gateway, or Apigee.
The inference gateway:
- Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases
- Provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades
- Adds end to end observability around service objective attainment
- Ensures operational guardrails between different client model names, allowing a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators
It currently requires a version of vLLM that supports the necessary metrics to predict traffic load which is defined in the model server protocol. Support for Google's Jetstream, nVidia Triton, text-generation-inference, and SGLang is coming soon.
This project is alpha (0.1 release). It should not be used in production yet.
Follow our Getting Started Guide to get the inference-extension up and running on your cluster!
See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation on leveraging our Kubernetes-native declarative APIs
Coming soon!
Follow this README to learn more about running the inference-extension end-to-end test suite on your cluster.
Our community meeting is weekly at Thursday 10AM PDT (Zoom, Meeting Notes).
We currently utilize the #wg-serving slack channel for communications.
Contributions are readily welcomed, follow the dev guide to start contributing!
Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.