Overview
Once a fine-tuned language model has been deployed across multiple nodes using KubeRay and served via vLLM,the next step is often to make that model easily accessible to applications, tools, and users. This is where LiteLLM comes in: it acts as a lightweight, standardized proxy layer that exposes LLMs—regardless of their backend—as an OpenAI-compatible API.
Integrating a fine-tuned model running on vLLM (backed by KubeRay) into LiteLLM allows you to route requests to it as if it were any other model served by OpenAI, Anthropic, or HuggingFace. This abstraction enables drop-in compatibility with tools and libraries that already speak the OpenAI API format (like LangChain, AutoGen, LangGraph, and OpenWeb UI), without needing to modify backend-specific client logic.
Why Add KubeRay/vLLM Models into LiteLLM?
Unified API Across Diverse Backends
LiteLLM provides a single, consistent API interface for all LLMs—whether hosted by third parties or self-hosted via vLLM. By adding your fine-tuned model to LiteLLM, it becomes instantly usable with any frontend, orchestration framework, or agent system that expects the OpenAI API spec (e.g., /v1/chat/completions).Simplified Integration into Applications
Applications don’t need to care whether a model is served from vLLM on Ray or from OpenAI’s cloud. LiteLLM handles the routing, authentication, and backend differences, allowing developers to switch models with a config change instead of rewriting code.Access Control, Rate Limiting, and Observability
LiteLLM provides middleware capabilities such as token-based auth, rate limiting, logging, and tracing hooks(e.g., LangFuse integration). This is particularly useful when exposing your internal vLLM-deployed model to internal teams, external users, or production systems.Multi-Model Routing and Fallbacks
With LiteLLM, you can route traffic between multiple models—e.g., use your fine-tuned vLLM model for specific tasks while falling back to OpenAI for general ones. This setup supports A/B testing, shadow deployments, and smooth transitions between model versions.Infrastructure Decoupling and Maintainability
Instead of embedding serving logic or endpoint-specific code into each consuming application, you centralize the connection logic in LiteLLM. This decouples app teams from infrastructure changes and makes model upgrades or replacements seamless.
Adding a fine-tuned model served via KubeRay and vLLM into LiteLLM transforms it from a low-level system component into a fully integrated, API-accessible resource—ready for experimentation, monitoring, and production use. It provides a clean boundary between model infrastructure and application logic, enhancing both scalability and developer productivity.
Key References
The key reference documents for LiteLLM are located at:
LiteLLM Getting Started - Main location for LiteLLM info https://docs.litellm.ai/docs/
LiteLLM UI Quick Start - Discusses the LiteLLM Administration interface https://docs.litellm.ai/docs/proxy/ui
LiteLLM has a number of powerful features. We recommend reading about:
Spend Tracking https://docs.litellm.ai/docs/proxy/cost_tracking
Secrets Management https://docs.litellm.ai/docs/secret
MCP Tool Endpoints https://docs.litellm.ai/docs/mcp
Instructions
User management for LiteLLM
Required permissions
You must be a cluster admin with full Kubernetes access
Steps
If no one has set up LiteLLM, obtain the LiteLLM "Master Key" using the following command:
kubectl -n litellm get secret litellm -o jsonpath='{.data.masterKey}' | base64 -d
Make sure to mind the trailing % is an end-of-line marker and not part of the password!
Log in on https://litellm.[my_platform_root_url]/ui with username admin and with the password set to the master key above.
Feel free to invite other users accordingly: https://litellm.[my_platform_root_url]/ui/?page=users
Adding a Ray LLM model to LiteLLM
Requirements
You must have admin access in LiteLLM
You must have read access to the RayService configuration hosting your Ray LLM model
Steps
Obtain the RayService cluster internal hostname and port number. If you are using the default shared Ray Serve cluster whose dashboard is available on /rayserve/, then the hostname is ray-service-head-svc.ray-service.svc.cluster.local and the Ray Serve port is the default 8000.
Identify the name of the model. This is typically model_id in your Ray LLM configuration.
In the LiteLLM Admin UI, visit Models > Add Model.
Provider: OpenAI-Compatible Endpoints (Together AI, etc.)
LiteLLM Model Name(s): Custom model name (Enter below)
Custom model name: The model_id identified in step 2, prefixed with openai/. For example, openai/Llama-3.2-3B-Instruct
If you have uploaded a LoRA,
Model mappings: leave as is
Mode: leave blank
Existing credentials: leave blank
API Base: http://[your_ray_service_hostname]:[your_ray_service_port]/v1
OpenAI API Key: enter any non-blank value (required placeholder, ignored)
Finally, click "Test Connect".
Security
The LiteLLM security and Data Privacy guide is below. NOTE: The Cake deployment of LiteLLM is what LiteLLM calls “self-hosted” in the documentation: https://docs.litellm.ai/docs/data_security
Cake does not currently trust the default security model of LiteLLM. Therefore, we deploy it behind an Envoy gateway that requires a JWT token generated via OAUTH 2.0 and attached to the request header before a LiteLLM request can pass through the gateway.
curl 'https://litellm.aidp.pwell.net/models' -H 'Authorization: Bearer <LiteLLM Key>' -H 'X-Cake-Authorization: <JWT Token>'
See Cake’s guide to accessing cluster resources externally for information on getting a JWT token. It is located here: Accessing Cake Platform Resources Externally
Troubleshooting
Common error seen in LiteLLM are here: https://docs.litellm.ai/docs/proxy/debugging#common-errors
Logs
System logs for LiteLLM can be accessed via Lens. The important pods for LiteLLM are located in the litellm namespace. Generally, in the litellm pod with the litellm container log is most informative:
Metrics
Metrics for LiteLLM can be found in Grafana in the Kubernetes > Pods and Kubernetes > Namespace (pods) dashboards: