Overview
Deploying large, fine-tuned language models in production introduces significant challenges related to memory constraints, computational overhead, and serving performance. To address these issues at scale, the Cake AI platform combines KubeRay and vLLM, enabling models to be deployed across multiple nodes within a Kubernetes cluster. This architecture makes it possible to serve models with massive parameter sizes that would otherwise exceed the limits of a single machine.
KubeRay acts as the orchestration layer, bringing Ray’s distributed computing capabilities to Kubernetes. Ray allows components of a model—such as inference workers, tokenizers, or pipeline stages—to run as separate actors across different nodes. When paired with vLLM, a high-throughput inference engine built specifically for serving transformer models efficiently, the system supports multi-node, multi-GPU inference with dynamic batching and memory paging for optimal performance.
Why Use KubeRay + vLLM together for Multi-Node Model Serving?
Support for Large Model Weights Across Nodes
Fine-tuned LLMs, especially those based on 13B, 30B, or 65B parameter architectures, often require more GPU memory than any single node can provide. Ray’s distributed actor model, orchestrated by KubeRay, enables model sharding and parallelism across multiple nodes and GPUs. This makes it possible to deploy and serve enormous models by breaking them into manageable pieces that communicate efficiently at runtime.High-Throughput Inference at Scale
vLLM’s optimized engine delivers continuous batching, efficient prefill/decode separation, and paged attention—resulting in low-latency, high-throughput serving even under heavy concurrent loads. When deployed with Ray across multiple nodes, this allows the platform to serve many users simultaneously without degradation in response time or model quality.Dynamic Cluster Scaling and Resource Management
KubeRay provides Kubernetes-native orchestration of Ray clusters, automatically handling node provisioning, task distribution, and failure recovery. As load increases, the system can scale horizontally—spinning up new pods and assigning Ray actors dynamically to make full use of the available GPU and CPU resources across the cluster.Seamless Deployment and Observability
With declarative configuration via Kubernetes manifests, deploying a fine-tuned model across a Ray cluster becomes streamlined and repeatable. The infrastructure also integrates easily with observability stacks—such as Prometheus, Grafana, and LangFuse—providing deep visibility into system behavior and model performance across distributed nodes.
KubeRay and vLLM together enable scalable, fault-tolerant, and memory-efficient deployment of fine-tuned models that would be otherwise infeasible. This multi-node deployment strategy unlocks production-grade performance for even the largest LLMs, while maintaining flexibility, observability, and operational simplicity.
Key References
The key reference documents for Ray, KubeRay and vLLM are:
Ray LLM - This is the Ray API that wraps vLLM https://docs.ray.io/en/latest/serve/llm/serving-llms.html
KubeRay - KubeRay integrates Ray with Kubernetes, providing RayService which manages Ray Clusters and Ray Serve applications https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayservice-quick-start.html#kuberay-rayservice-quickstart
vLLM - vLLM is core inference engine, and is extensively documented. In general, Ray LLM’s documentation should be consulted first, as it uses vLLM in a specific way to achieve distributed inference
https://docs.vllm.ai/en/latest/usage/index.html
Ray LLM, KubeRay and vLLM combine to create a powerful distributed inference system. These components are on the cutting edge of development. Extra care should be taken to ensure that the correct versions are being used, and correct configuration is made. We will document the approach and issues to be careful about further.
Instructions
Initial Setup
Cake has already deployed the infrastructure for Ray, KubeRay and vLLM. Cake has also deployed a default KubeRay service cluster. It is generally best practice to deploy multiple KubeRay service clusters in your environment.
Production Note
Production models should be served from a cluster running in a namespace where there is limited access. To create additional clusters, create a new RayService YAML with your preferred namespace to launch worker pods. You can manually deploy these YAMLs via
kubectl apply -f myservice.yaml
However, this cluster may disappear since it is not deployed as a GITops resource. To ensure its longterm survival, we recommend deploying as a Cake overlay. See the GitOps and Overlays help document here:
Weight Caching
Base model weights should be cached on the CakeFS Shared filesystem. This is available in Jupyter notebooks as /home/jovyan/shared or available in ray as /home/ray/shared. The model weights can be stored anywhere in the shared drive, but we recommend shared/models/. Even though Ray supports downloading a base model weights from HuggingFace at startup, we do not recommend this for several reasons:
Having models outside of cluster adds a point of failure
HuggingFace will throttle your weight downloads without enterprise credentials
Model providers can modify weights for an existing model version. This may cause a performance degradation that would not be seen until a rayworker was restarted and redownloaded weights from HuggingFace
Cached weights will load faster in vLLM
Instead, we recommend downloading your model weights and caching them in CakeFS
If you are using LoRA adapters, they must be stored in an s3 bucket accessible to Ray workers.This bucket must be accessible to the Ray workers so they can download the adapters. By default, we recommend using MLflow’s adapter storage. Bucket access has already been granted for Ray workers. You can also cut and paste the s3 location from MLflow’s UI easily for use in your KubeRay YAML
NOTE: Due to a current requirement of Ray-LLM LoRA adapters cannot live on a shared filesystem but must live in object storage like s3.
Defining the RayService
A default Cake’s RayService is defined in the platform/deploys section of your github repository:
NOTE: This location is changing soon in a new version of Cake. We are moving to project based Ray clusters and moving all defined Ray clusters to ‘Cake Overlays’
Cake provides a number of examples, including one for the multi-node LLM deployment.
The key section of the YAML example below, for the definition of the LLM service, is the ‘Server deployment section :
apiVersion: ray.io/v1
kind: RayService
metadata:
name: 70b-shared
namespace: ray-service
spec:
# -------- Server Deployment --------
serveConfigV2: |
applications:
- name: 70b-shared
route_prefix: "/"
import_path: "ray.serve.llm:build_openai_app"
runtime_env:
env_vars:
VLLM_USE_V1: "0"
ENGINE_START_TIMEOUT_S: "1600"
SERVE_LOG_LEVEL: "DEBUG"
# Ray DEBUG traces are on. Options: WARNING, ERROR, INFO
VLLM_LOGGING_LEVEL: "DEBUG"
# vLLM DEBUG traces are on. Options: WARNING, ERROR, INFO
args:
llm_configs:
- model_loading_config:
# -- base model -----------
model_id: "llama-3.3-70B-Instruct"
# name of model used by vLLM
model_source: "/home/ray/shared/models/models--meta-llama--Llama-3.3-70B-Instruct/snapshots/<SNAPSHOT_ID>"
# shared location of the base model snapshot on the shared CakeFS drive. (AWS EFS)
# -- lora attachments ---------
lora_config:
dynamic_lora_loading_path: "s3://<BUCKET_URL>/<MLFLOW_BUCKET_PATH>/artifacts" # its important that LoRAs are in s3. Easiest to pull directly from MLflow
# -- deployment tuning ---------
deployment_config:
health_check_timeout_s: 600
autoscaling_config:
min_replicas: 1
# how many instances of model do you want
max_replicas: 1
engine_kwargs:
tensor_parallel_size: 4
# determines number of GPUs on a single node
pipeline_parallel_size: 2
# determines number of nodes to split layers on
tokenizer_pool_size: 2
# configured based on CPU resources, parallelizing the text preprocessing step
tokenizer_pool_extra_config: "{\"runtime_env\": {}}"
max_model_len: 8192
gpu_memory_utilization: 0.96
trust_remote_code: true
# -------- Ray cluster topology --------
rayClusterConfig:
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-llm:2.47.0-py311-cu124
volumeMounts:
- name: ray-service-shared-volume
mountPath: /home/ray/shared
volumes:
- name: ray-service-shared-volume
persistentVolumeClaim:
claimName: shared-volume
workerGroupSpecs:
- groupName: gpu-workers
rayStartParams:
#num-cpus: "1" # DO NOT SET THIS, it will cause the worker to hang when it is unable to start a ray task to download the model
num-gpus: "4"
minReplicas: 1
maxReplicas: 1
template:
spec:
nodeSelector:
node.kubernetes.io/instance-type: g6e.12xlarge
# hint for scheduler to select nodes
containers:
- name: ray-worker
image: rayproject/ray-llm:2.47.0-py311-cu124
# determines what libraries and python avail for code
resources:
limits:
nvidia.com/gpu: "4" #scheduler hint
volumeMounts:
- name: ray-service-shared-volume
mountPath: /home/ray/shared
# mount CakeFS shared drive
volumes:
- name: ray-service-shared-volume
persistentVolumeClaim:
claimName: shared-volume
Ray Service Configuration
Basic configuration of the service involves setting the model_id and model_source.
Additionally two key configuration parameters should be tuned to the hardware requirements. :
The ‘tensor_parallel_size’ determines how many gpus will be used
The ‘pipeline_parallel_size’ determines how many nodes.
Our example is configured to run on a two node system with 4 gpus per node (for example, two g6e.12xlarge), and so it uses: tensor_paralle_size: 4, and pipeline_parallel_size: 2.
Additional configuration options such as gpu_memory_utilization and max_model_len should be tuned on a per-model basis and depends specifically on the use case and hardware being used. See vLLM’s optimization guide for more information: https://docs.vllm.ai/en/latest/configuration/optimization.html#multimodal-models
Calling the deployed LLM
After deploying, the RayService is available inside the cluster at the following location:
ray-service-head-svc.ray-service.svc.cluster.local:8000/<service-name>/
Since this is vLLM under the hood, you can use any of the vLLM API calls. Since vLLM is OpenAI-compatible, any typical OpenAI api call will work. For further information, see vLLM’s documentation.
For one example, you can see the available models that have been deployed. If you’ve deployed LoRAs, they will be shown:
curl ray-service-head-svc.ray-service.svc.cluster.local:8000/llama33-70b/v1/models | jq .
{
"data": [
{
"id": "meta-llama/Llama-3.2-1B-Instruct",
"object": "model",
"owned_by": "organization-owner",
"permission": [],
"rayllm_metadata": {
"model_id": "meta-llama/Llama-3.2-1B-Instruct",
"input_modality": "text",
"max_request_context_length": null
}
}
],
"object": "list"
}
An example of an API call for text completion is:
curl ray-service-head-svc.ray-service.svc.cluster.local:8000/llama33-70b/v1/completions -H "Content-Type: application/json" -d '{ "model": "meta-llama/Llama-3.2-1B-Instruct", "prompt": "how do i fix my car"}'
{
"id": "meta-llama/Llama-3.2-1B-Instruct-ab741eee-5852-4ee0-b0a9-05cb14bfac78",
"object": "text_completion",
"created": 1750267230,
"model": "meta-llama/Llama-3.2-1B-Instruct",
"choices": [
{
"index": 0,
"text": "' Learning to DIY: If you're handy with tools, you can learn to replace your car's lighting systems yourself. However,....",
"logprobs": {
"text_offset": [],
"token_logprobs": [],
"tokens": [],
"top_logprobs": []
},
"finish_reason": "stop",
"stop_reason": null,
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 7,
"total_tokens": 338,
"completion_tokens": 331,
"prompt_tokens_details": null
}
}
Automating Deployment with a Workflow Engine
Cake includes multiple Workflow and Pipeline engines. These applications are designed so that you can automate tasks as collections of containers that run as part of a Direct Acyclic Graph (DAG) In many cases, it makes sense to use one of the DAG engines and the MLflow. The flow can be something like:
This workflow can be run on a recurring basis for different models looking for changes
Production Considerations
Production Namespaces
It is totally fine (and even preferred) to have data scientists launching RayServices and Clusters in their own personal or team project namespaces, however this is not safe for critical models running in production. Cake has been built with this dichotomy in mind. Cake recommends running production models in a Kubernetes Namespace that limits access to only Cluster Administrators and specific ServiceAccounts.
Deployment NodePool
Where your production model will deploy is something you should consider. Cake is deployed in a Kubernetes cluster that can autoscale. The Kubernetes scheduler will work to deploy your workloads on underutilized capacity. This is great from a cost perspective.
However, when you have Cake workloads that are critical, it is generally better to isolate them onto their own production nodes. This can be accomplished by tainting a set of nodes and adding a pod toleration in your deployment YAML files. If you want a Karpenter NodePool created with a specific production taint, let Cake support know about the taint you would prefer and the node types that should be available. Once you have tainted nodes, you will need to add toleration annotations to your KubeRay head and worker specs in your YAML
tolerations:
- key: "production"
operator: "Exists"
effect: "NoSchedule"
Taints and Tolerations are discussed here: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
Karpenter NodePool discussion here: https://karpenter.sh/docs/concepts/nodepools/
GitOps
It is generally better to standardize the deployment of production models rather than doing things ad hoc. (Which may make sense during model development and testing) One good way to standardize deployments is via a CI/CD tool like Github Actions. The actual flow will depend on your CI/CD system and your specific preferred process but the high level steps are usually the same.
Data Scientist’s promotes an experiment to a versioned model in MLflow’s model registry.
Data Scientist’s declare a fine-tuned model ready for production by alias the model version with some indicator that it is production ready. Examples are alias’ like ‘Champion’ or ‘Challenger’
A CI/CD task can look for a model version with a particular alias and if different, it will update a RayService YAML with the new model info and checks the updated YAML into the appropriate overlay location in the Cake cluster GIT Repo.
Code for getting a model associated with MLflow alias is here:
import MLflow
from MLflow import MLflowClient
def print_model_info(rm):
print("--Model--")
print("name: {}".format(rm.name))
print("aliases: {}".format(rm.aliases))
def print_model_version_info(mv):
print("--Model Version--")
print("Name: {}".format(mv.name))
print("Version: {}".format(mv.version))
print("Aliases: {}".format(mv.aliases))
# Call model api
client = MLflowClient()
client.create_registered_model(name)
# Get model version by alias
alias_mv = client.get_model_version_by_alias(name, "test-alias")
print()
print_model_version_info(alias_mv)
Information on adding things as overlays to Cake are here: Cake GitOps and Overlays
NOTE: Long term, we expect MLflow will add a Webhooks API for alias changes which can simplify the YAML update. See: https://github.com/MLflow/MLflow/issues/14677 ArgoCD updated RayService running in production
Security
Key References
An extensive discussion of KubeRay security is available here:
https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-auth.html
General Ray security information is available here:
https://docs.ray.io/en/latest/ray-security/index.html
Cake Security
Cake protects all resources running in its cluster with an Envoy gateway. This gateway prevents any access to Cake models without a valid JWT token. This means that by default, all models are only callable internally. We recommend using LiteLLM to proxy models for external access.
However, LiteLLM is focused on transformer models. There will be models that are required to be called externally without a LitLLM proxy. For each model, various cluster resources will need to be created. These include things like Istio Virtual Services and Gateways. Please contact Cake support if you need to proxy a model.
If you want to make the model available externally, follow this guide:
Accessing Cake Resources Externally
Troubleshooting
A troubleshooting guide for Ray Service’s is here:
http://docs.ray.io/en/latest/cluster/kubernetes/troubleshooting/rayservice-troubleshooting.html
General troubleshooting info for Ray is here:
https://docs.ray.io/en/latest/cluster/kubernetes/troubleshooting/troubleshooting.html
A troubleshooting guide for vLLM is here:
https://docs.vllm.ai/en/stable/usage/troubleshooting.html
Logs and Metrics
System logs and metrics for Ray and vLLM are discussed in this help document:
Monitoring your Ray Deployed Models with Prometheus and Grafana