Overview

Deploying large, fine-tuned language models in production introduces significant challenges related to memory constraints, computational overhead, and serving performance. To address these issues at scale, the Cake AI platform combines KubeRay and vLLM, enabling models to be deployed across multiple nodes within a Kubernetes cluster. This architecture makes it possible to serve models with massive parameter sizes that would otherwise exceed the limits of a single machine.

KubeRay acts as the orchestration layer, bringing Ray’s distributed computing capabilities to Kubernetes. Ray allows components of a model—such as inference workers, tokenizers, or pipeline stages—to run as separate actors across different nodes. When paired with vLLM, a high-throughput inference engine built specifically for serving transformer models efficiently, the system supports multi-node, multi-GPU inference with dynamic batching and memory paging for optimal performance.

Why Use KubeRay + vLLM together for Multi-Node Model Serving?

Support for Large Model Weights Across Nodes
Fine-tuned LLMs, especially those based on 13B, 30B, or 65B parameter architectures, often require more GPU memory than any single node can provide. Ray’s distributed actor model, orchestrated by KubeRay, enables model sharding and parallelism across multiple nodes and GPUs. This makes it possible to deploy and serve enormous models by breaking them into manageable pieces that communicate efficiently at runtime.
High-Throughput Inference at Scale
vLLM’s optimized engine delivers continuous batching, efficient prefill/decode separation, and paged attention—resulting in low-latency, high-throughput serving even under heavy concurrent loads. When deployed with Ray across multiple nodes, this allows the platform to serve many users simultaneously without degradation in response time or model quality.
Dynamic Cluster Scaling and Resource Management
KubeRay provides Kubernetes-native orchestration of Ray clusters, automatically handling node provisioning, task distribution, and failure recovery. As load increases, the system can scale horizontally—spinning up new pods and assigning Ray actors dynamically to make full use of the available GPU and CPU resources across the cluster.
Seamless Deployment and Observability
With declarative configuration via Kubernetes manifests, deploying a fine-tuned model across a Ray cluster becomes streamlined and repeatable. The infrastructure also integrates easily with observability stacks—such as Prometheus, Grafana, and LangFuse—providing deep visibility into system behavior and model performance across distributed nodes.

KubeRay and vLLM together enable scalable, fault-tolerant, and memory-efficient deployment of fine-tuned models that would be otherwise infeasible. This multi-node deployment strategy unlocks production-grade performance for even the largest LLMs, while maintaining flexibility, observability, and operational simplicity.

Key References

The key reference documents for Ray, KubeRay and vLLM are:

Ray LLM - This is the Ray API that wraps vLLM https://docs.ray.io/en/latest/serve/llm/serving-llms.html
KubeRay - KubeRay integrates Ray with Kubernetes, providing RayService which manages Ray Clusters and Ray Serve applications https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayservice-quick-start.html#kuberay-rayservice-quickstart
vLLM - vLLM is core inference engine, and is extensively documented. In general, Ray LLM’s documentation should be consulted first, as it uses vLLM in a specific way to achieve distributed inference
https://docs.vllm.ai/en/latest/usage/index.html

Ray LLM, KubeRay and vLLM combine to create a powerful distributed inference system. These components are on the cutting edge of development. Extra care should be taken to ensure that the correct versions are being used, and correct configuration is made. We will document the approach and issues to be careful about further.

Instructions

Initial Setup

Cake has already deployed the infrastructure for Ray, KubeRay and vLLM. Cake has also deployed a default KubeRay service cluster. It is generally best practice to deploy multiple KubeRay service clusters in your environment.

Production Note

Production models should be served from a cluster running in a namespace where there is limited access. To create additional clusters, create a new RayService YAML with your preferred namespace to launch worker pods. You can manually deploy these YAMLs via

kubectl apply -f myservice.yaml

However, this cluster may disappear since it is not deployed as a GITops resource. To ensure its longterm survival, we recommend deploying as a Cake overlay. See the GitOps and Overlays help document here:

Cake Overlays

Weight Caching

Base model weights should be cached on the CakeFS Shared filesystem. This is available in Jupyter notebooks as /home/jovyan/shared or available in ray as /home/ray/shared. The model weights can be stored anywhere in the shared drive, but we recommend shared/models/. Even though Ray supports downloading a base model weights from HuggingFace at startup, we do not recommend this for several reasons:

Having models outside of cluster adds a point of failure
HuggingFace will throttle your weight downloads without enterprise credentials
Model providers can modify weights for an existing model version. This may cause a performance degradation that would not be seen until a rayworker was restarted and redownloaded weights from HuggingFace
Cached weights will load faster in vLLM

Instead, we recommend downloading your model weights and caching them in CakeFS

If you are using LoRA adapters, they must be stored in an s3 bucket accessible to Ray workers.This bucket must be accessible to the Ray workers so they can download the adapters. By default, we recommend using MLflow’s adapter storage. Bucket access has already been granted for Ray workers. You can also cut and paste the s3 location from MLflow’s UI easily for use in your KubeRay YAML

NOTE: Due to a current requirement of Ray-LLM LoRA adapters cannot live on a shared filesystem but must live in object storage like s3.

Defining the RayService

A default Cake’s RayService is defined in the platform/deploys section of your github repository:

https://github.com/<company>/<repo>/blob/main/platform/deploys/prod-us-west-2/overlays/ray-service/patches/ray.yaml

NOTE: This location is changing soon in a new version of Cake. We are moving to project based Ray clusters and moving all defined Ray clusters to ‘Cake Overlays’

Cake provides a number of examples, including one for the multi-node LLM deployment.

The key section of the YAML example below, for the definition of the LLM service, is the ‘Server deployment section :

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: 70b-shared
  namespace: ray-service
spec:
  # -------- Server Deployment --------
  serveConfigV2: |
    applications:
      - name: 70b-shared
        route_prefix: "/"
        import_path: "ray.serve.llm:build_openai_app"
        runtime_env:
          env_vars:
            VLLM_USE_V1: "0"
            ENGINE_START_TIMEOUT_S: "1600"
            	  SERVE_LOG_LEVEL: "DEBUG" 
  # Ray DEBUG traces are on. Options: WARNING, ERROR, INFO
            VLLM_LOGGING_LEVEL: "DEBUG" 
# vLLM DEBUG traces are on. Options: WARNING, ERROR, INFO
        args:
          llm_configs:
            - model_loading_config:
			# -- base model -----------
                model_id: "llama-3.3-70B-Instruct" 
# name of model used by vLLM
                model_source: "/home/ray/shared/models/models--meta-llama--Llama-3.3-70B-Instruct/snapshots/<SNAPSHOT_ID>" 
# shared location of the base model snapshot on the shared CakeFS drive. (AWS EFS)
   # -- lora attachments ---------
   lora_config:
	dynamic_lora_loading_path: "s3://<BUCKET_URL>/<MLFLOW_BUCKET_PATH>/artifacts" # its important that LoRAs are in s3. Easiest to pull directly from MLflow
   # -- deployment tuning ---------
              deployment_config:
                health_check_timeout_s: 600
                autoscaling_config:
                  min_replicas: 1 
                  # how many instances of model do you want
                  max_replicas: 1
              engine_kwargs:
                tensor_parallel_size: 4
                # determines number of GPUs on a single node
                pipeline_parallel_size: 2 
                # determines number of nodes to split layers on 
                tokenizer_pool_size: 2  
                # configured based on CPU resources, parallelizing the text preprocessing step
                tokenizer_pool_extra_config: "{\"runtime_env\": {}}"
                max_model_len: 8192
                gpu_memory_utilization: 0.96
                trust_remote_code: true
  # -------- Ray cluster topology --------
  rayClusterConfig:
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray-llm:2.47.0-py311-cu124
              volumeMounts:
                    - name: ray-service-shared-volume
                      mountPath: /home/ray/shared
          volumes:
            - name: ray-service-shared-volume
              persistentVolumeClaim:
                claimName: shared-volume
    workerGroupSpecs:
      - groupName: gpu-workers
        rayStartParams:
          #num-cpus: "1" # DO NOT SET THIS, it will cause the worker to hang when it is unable to start a ray task to download the model
          num-gpus: "4"
          minReplicas: 1
          maxReplicas: 1
        template:
          spec:
            nodeSelector:
              node.kubernetes.io/instance-type: g6e.12xlarge 
               # hint for scheduler to select nodes 
            containers:
              - name: ray-worker
                image: rayproject/ray-llm:2.47.0-py311-cu124      
                # determines what libraries and python avail for code
                resources:
                  limits:
                    nvidia.com/gpu: "4" #scheduler hint
                volumeMounts:
                  - name: ray-service-shared-volume
                    mountPath: /home/ray/shared 
                    # mount CakeFS shared drive
            volumes:
              - name: ray-service-shared-volume
                persistentVolumeClaim:
                  claimName: shared-volume

Ray Service Configuration

Basic configuration of the service involves setting the model_id and model_source.

Additionally two key configuration parameters should be tuned to the hardware requirements. :

The ‘tensor_parallel_size’ determines how many gpus will be used
The ‘pipeline_parallel_size’ determines how many nodes.

Our example is configured to run on a two node system with 4 gpus per node (for example, two g6e.12xlarge), and so it uses: tensor_paralle_size: 4, and pipeline_parallel_size: 2.

Additional configuration options such as gpu_memory_utilization and max_model_len should be tuned on a per-model basis and depends specifically on the use case and hardware being used. See vLLM’s optimization guide for more information: https://docs.vllm.ai/en/latest/configuration/optimization.html#multimodal-models

Calling the deployed LLM

After deploying, the RayService is available inside the cluster at the following location:

ray-service-head-svc.ray-service.svc.cluster.local:8000/<service-name>/

Since this is vLLM under the hood, you can use any of the vLLM API calls. Since vLLM is OpenAI-compatible, any typical OpenAI api call will work. For further information, see vLLM’s documentation.

For one example, you can see the available models that have been deployed. If you’ve deployed LoRAs, they will be shown:

curl ray-service-head-svc.ray-service.svc.cluster.local:8000/llama33-70b/v1/models | jq .

{
  "data": [
    {
      "id": "meta-llama/Llama-3.2-1B-Instruct",
      "object": "model",
      "owned_by": "organization-owner",
      "permission": [],
      "rayllm_metadata": {
        "model_id": "meta-llama/Llama-3.2-1B-Instruct",
        "input_modality": "text",
        "max_request_context_length": null
      }
    }
  ],
  "object": "list"
}

An example of an API call for text completion is:

curl ray-service-head-svc.ray-service.svc.cluster.local:8000/llama33-70b/v1/completions -H "Content-Type: application/json"   -d '{ "model": "meta-llama/Llama-3.2-1B-Instruct", "prompt": "how do i fix my car"}'

{
  "id": "meta-llama/Llama-3.2-1B-Instruct-ab741eee-5852-4ee0-b0a9-05cb14bfac78",
  "object": "text_completion",
  "created": 1750267230,
  "model": "meta-llama/Llama-3.2-1B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": "' Learning to DIY: If you're handy with tools, you can learn to replace your car's lighting systems yourself. However,....",
      "logprobs": {
        "text_offset": [],
        "token_logprobs": [],
        "tokens": [],
        "top_logprobs": []
      },
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],

  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 338,
    "completion_tokens": 331,
    "prompt_tokens_details": null
  }
}

Automating Deployment with a Workflow Engine

Cake includes multiple Workflow and Pipeline engines. These applications are designed so that you can automate tasks as collections of containers that run as part of a Direct Acyclic Graph (DAG) In many cases, it makes sense to use one of the DAG engines and the MLflow. The flow can be something like:

This workflow can be run on a recurring basis for different models looking for changes

Production Considerations

Production Namespaces

It is totally fine (and even preferred) to have data scientists launching RayServices and Clusters in their own personal or team project namespaces, however this is not safe for critical models running in production. Cake has been built with this dichotomy in mind. Cake recommends running production models in a Kubernetes Namespace that limits access to only Cluster Administrators and specific ServiceAccounts.

Deployment NodePool

Where your production model will deploy is something you should consider. Cake is deployed in a Kubernetes cluster that can autoscale. The Kubernetes scheduler will work to deploy your workloads on underutilized capacity. This is great from a cost perspective.

However, when you have Cake workloads that are critical, it is generally better to isolate them onto their own production nodes. This can be accomplished by tainting a set of nodes and adding a pod toleration in your deployment YAML files. If you want a Karpenter NodePool created with a specific production taint, let Cake support know about the taint you would prefer and the node types that should be available. Once you have tainted nodes, you will need to add toleration annotations to your KubeRay head and worker specs in your YAML

tolerations:
- key: "production"
  operator: "Exists"
  effect: "NoSchedule"

Taints and Tolerations are discussed here: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

Karpenter NodePool discussion here: https://karpenter.sh/docs/concepts/nodepools/

GitOps

It is generally better to standardize the deployment of production models rather than doing things ad hoc. (Which may make sense during model development and testing) One good way to standardize deployments is via a CI/CD tool like Github Actions. The actual flow will depend on your CI/CD system and your specific preferred process but the high level steps are usually the same.

Data Scientist’s promotes an experiment to a versioned model in MLflow’s model registry.
Data Scientist’s declare a fine-tuned model ready for production by alias the model version with some indicator that it is production ready. Examples are alias’ like ‘Champion’ or ‘Challenger’
A CI/CD task can look for a model version with a particular alias and if different, it will update a RayService YAML with the new model info and checks the updated YAML into the appropriate overlay location in the Cake cluster GIT Repo.

Code for getting a model associated with MLflow alias is here:

	import MLflow
from MLflow import MLflowClient

def print_model_info(rm):
    print("--Model--")
    print("name: {}".format(rm.name))
    print("aliases: {}".format(rm.aliases))

def print_model_version_info(mv):
    print("--Model Version--")
    print("Name: {}".format(mv.name))
    print("Version: {}".format(mv.version))
    print("Aliases: {}".format(mv.aliases))

# Call model api
client = MLflowClient()
client.create_registered_model(name)

# Get model version by alias
alias_mv = client.get_model_version_by_alias(name, "test-alias")

print()

print_model_version_info(alias_mv)

Information on adding things as overlays to Cake are here: Cake GitOps and Overlays

NOTE: Long term, we expect MLflow will add a Webhooks API for alias changes which can simplify the YAML update. See: https://github.com/MLflow/MLflow/issues/14677 ArgoCD updated RayService running in production

Security

Key References

An extensive discussion of KubeRay security is available here:

https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-auth.html

General Ray security information is available here:

https://docs.ray.io/en/latest/ray-security/index.html

Cake Security

Cake protects all resources running in its cluster with an Envoy gateway. This gateway prevents any access to Cake models without a valid JWT token. This means that by default, all models are only callable internally. We recommend using LiteLLM to proxy models for external access.

However, LiteLLM is focused on transformer models. There will be models that are required to be called externally without a LitLLM proxy. For each model, various cluster resources will need to be created. These include things like Istio Virtual Services and Gateways. Please contact Cake support if you need to proxy a model.

If you want to make the model available externally, follow this guide:

Accessing Cake Resources Externally