Add vLLM Model to LiteLLM

Prev Next

Overview

Once a fine-tuned language model has been deployed across multiple nodes using KubeRay and served via vLLM,the next step is often to make that model easily accessible to applications, tools, and users. This is where LiteLLM comes in: it acts as a lightweight, standardized proxy layer that exposes LLMs—regardless of their backend—as an OpenAI-compatible API.


Integrating a fine-tuned model running on vLLM (backed by KubeRay) into LiteLLM allows you to route requests to it as if it were any other model served by OpenAI, Anthropic, or HuggingFace. This abstraction enables drop-in compatibility with tools and libraries that already speak the OpenAI API format (like LangChain, AutoGen, LangGraph, and OpenWeb UI), without needing to modify backend-specific client logic.

Why Add KubeRay/vLLM Models into LiteLLM?

  1. Unified API Across Diverse Backends
    LiteLLM provides a single, consistent API interface for all LLMs—whether hosted by third parties or self-hosted via vLLM. By adding your fine-tuned model to LiteLLM, it becomes instantly usable with any frontend, orchestration framework, or agent system that expects the OpenAI API spec (e.g., /v1/chat/completions).

  2. Simplified Integration into Applications
    Applications don’t need to care whether a model is served from vLLM on Ray or from OpenAI’s cloud. LiteLLM handles the routing, authentication, and backend differences, allowing developers to switch models with a config change instead of rewriting code.

  3. Access Control, Rate Limiting, and Observability
    LiteLLM provides middleware capabilities such as token-based auth, rate limiting, logging, and tracing hooks(e.g., LangFuse integration). This is particularly useful when exposing your internal vLLM-deployed model to internal teams, external users, or production systems.

  4. Multi-Model Routing and Fallbacks
    With LiteLLM, you can route traffic between multiple models—e.g., use your fine-tuned vLLM model for specific tasks while falling back to OpenAI for general ones. This setup supports A/B testing, shadow deployments, and smooth transitions between model versions.

  5. Infrastructure Decoupling and Maintainability
    Instead of embedding serving logic or endpoint-specific code into each consuming application, you centralize the connection logic in LiteLLM. This decouples app teams from infrastructure changes and makes model upgrades or replacements seamless.

Adding a fine-tuned model served via KubeRay and vLLM into LiteLLM transforms it from a low-level system component into a fully integrated, API-accessible resource—ready for experimentation, monitoring, and production use. It provides a clean boundary between model infrastructure and application logic, enhancing both scalability and developer productivity.

Key References

The key reference documents for LiteLLM are located at:

LiteLLM has a number of powerful features. We recommend reading about:


Instructions

User management for LiteLLM

Required permissions

  • You must be a cluster admin with full Kubernetes access

Steps

  1. If no one has set up LiteLLM, obtain the LiteLLM "Master Key" using the following command:

kubectl -n litellm get secret litellm -o jsonpath='{.data.masterKey}' | base64 -d

Make sure to mind the trailing % is an end-of-line marker and not part of the password!

  1. Log in on https://litellm.[my_platform_root_url]/ui with username admin and with the password set to the master key above.


  1. Feel free to invite other users accordingly: https://litellm.[my_platform_root_url]/ui/?page=users


Adding a Ray LLM model to LiteLLM

Requirements

  • You must have admin access in LiteLLM

  • You must have read access to the RayService configuration hosting your Ray LLM model

Steps

  1. Obtain the RayService cluster internal hostname and port number. If you are using the default shared Ray Serve cluster whose dashboard is available on /rayserve/, then the hostname is ray-service-head-svc.ray-service.svc.cluster.local and the Ray Serve port is the default 8000.

  2. Identify the name of the model. This is typically model_id in your Ray LLM configuration.

  3. In the LiteLLM Admin UI, visit Models > Add Model.

    1. Provider: OpenAI-Compatible Endpoints (Together AI, etc.)

    2. LiteLLM Model Name(s): Custom model name (Enter below)

    3. Custom model name: The model_id identified in step 2, prefixed with openai/. For example, openai/Llama-3.2-3B-Instruct

      1. If you have uploaded a LoRA, 

    4. Model mappings: leave as is

    5. Mode: leave blank

    6. Existing credentials: leave blank

    7. API Base: http://[your_ray_service_hostname]:[your_ray_service_port]/v1

    8. OpenAI API Key: enter any non-blank value (required placeholder, ignored)

  1. Finally, click "Test Connect".

Security

The LiteLLM security and Data Privacy guide is below. NOTE: The Cake deployment of LiteLLM is what LiteLLM calls “self-hosted” in the documentation: https://docs.litellm.ai/docs/data_security

Cake does not currently trust the default security model of LiteLLM. Therefore, we deploy it behind an Envoy gateway that requires a JWT token generated via OAUTH 2.0 and attached to the request header before a LiteLLM request can pass through the gateway. 

curl 'https://litellm.aidp.pwell.net/models' -H 'Authorization: Bearer <LiteLLM Key>' -H 'X-Cake-Authorization: <JWT Token>'


See Cake’s guide to accessing cluster resources externally for information on getting a JWT token. It is located here: Accessing Cake Platform Resources Externally

Troubleshooting

Common error seen in LiteLLM are here:  https://docs.litellm.ai/docs/proxy/debugging#common-errors


Logs

System logs for LiteLLM can be accessed via Lens. The important pods for LiteLLM are located in the litellm namespace. Generally, in the litellm pod with the litellm container log is most informative:

Metrics

Metrics for LiteLLM can be found in Grafana in the Kubernetes > Pods and Kubernetes > Namespace (pods) dashboards: