Gateway API Inference Extension

This extension upgrades an ext-proc-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an inference gateway - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local OpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level AI Gateway like LiteLLM, Solo AI Gateway, or Apigee.

The inference gateway:

Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases
Provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades
Adds end to end observability around service objective attainment
Ensures operational guardrails between different client model names, allowing a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators

It currently requires a version of vLLM that supports the necessary metrics to predict traffic load which is defined in the model server protocol. Support for Google's Jetstream, nVidia Triton, text-generation-inference, and SGLang is coming soon.

Status

This project is alpha (0.1 release). It should not be used in production yet.

Getting Started

Follow our Getting Started Guide to get the inference-extension up and running on your cluster!

See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation on leveraging our Kubernetes-native declarative APIs

Roadmap

Coming soon!

End-to-End Tests

Follow this README to learn more about running the inference-extension end-to-end test suite on your cluster.

Contributing

Our community meeting is weekly at Thursday 10AM PDT (Zoom, Meeting Notes).

We currently utilize the #wg-serving slack channel for communications.

Contributions are readily welcomed, follow the dev guide to start contributing!

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
.github		.github
api		api
client-go		client-go
cmd/epp		cmd/epp
config		config
docs		docs
hack		hack
internal/runnable		internal/runnable
pkg		pkg
site-src		site-src
test		test
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
OWNERS_ALIASES		OWNERS_ALIASES
PROJECT		PROJECT
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
SECURITY_CONTACTS		SECURITY_CONTACTS
cloudbuild.yaml		cloudbuild.yaml
code-of-conduct.md		code-of-conduct.md
crd-ref-docs.yaml		crd-ref-docs.yaml
go.mod		go.mod
go.sum		go.sum
mkdocs.yml		mkdocs.yml
netlify.toml		netlify.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gateway API Inference Extension

Status

Getting Started

Roadmap

End-to-End Tests

Contributing

Code of conduct

About

Releases 2

Packages

Contributors 30

Languages

License

kubernetes-sigs/gateway-api-inference-extension

Folders and files

Latest commit

History

Repository files navigation

Gateway API Inference Extension

Status

Getting Started

Roadmap

End-to-End Tests

Contributing

Code of conduct

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 30

Languages

Packages