RayService Autoscaling

RayService Autoscaling Example

Ray’s built-in autoscaler that grows or shrinks your cluster automatically whenever its current set of allocated nodes cannot satisfy pending resource requests. The below example shows basic configuration options that demonstrate how this can be configured, and how to test it. We’ll use the load-generation tool Blazemeter, which generates client calls to the LLM for long periods of time in order to trigger autoscaling.

See the official Ray documentation for more information on basic autoscaling here:

https://docs.ray.io/en/latest/serve/autoscaling-guide.html#resnet-autoscaling-example

Those docs define the important autoscaling parameters that we use in this example as:

target_ongoing_requests is the average number of ongoing requests per replica that the Serve autoscaler tries to ensure. You can adjust it based on your request processing length (the longer the requests, the smaller this number should be) as well as your latency objective (the shorter you want your latency to be, the smaller this number should be).
max_replicas is the maximum number of replicas for the deployment. Set this to ~20% higher than what you think you need for peak traffic.

Here is a RayService YAML which includes an autoscaling_config section, and deploys the Llama 3.2-3b LLM from CakeFS (shared EFS drive in your Cake cluster):

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: llama32-3b-autoscale
  namespace: ray-service
spec:
  serveConfigV2: |

    applications:
      - name: llama32-3b-autoscale
        route_prefix: "/"
        import_path: "ray.serve.llm:build_openai_app"
        runtime_env:
          env_vars:
            VLLM_USE_V1: "0"
            ENGINE_START_TIMEOUT_S: "1600"
            SERVE_LOG_LEVEL: "DEBUG"
            VLLM_LOGGING_LEVEL: "DEBUG"
            HF_TOKEN: "YOUR_HF_TOKEN"
        args:
          llm_configs:
            - model_loading_config:
                model_id: "Llama-3.2-3B-Instruct"
                model_source: "/home/ray/shared/models/models--meta-llama--Llama-3.2-3B-Instruct/snapshots/0cb88a4f764b7a12671c53f0838cd831a0843b95"
              deployment_config:
                health_check_timeout_s: 600
                autoscaling_config:
                  min_replicas: 1
                  max_replicas: 4
                  target_ongoing_requests: 1
              engine_kwargs:
                tensor_parallel_size: 1
                pipeline_parallel_size: 1
                tokenizer_pool_size: 2
                tokenizer_pool_extra_config: "{\"runtime_env\": {}}"
                max_model_len: 8192
                trust_remote_code: true
  rayClusterConfig:
    rayVersion: '2.47.1' # should match the Ray version in the image of the containers
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        metadata:
          annotations:
            karpenter.sh/do-not-disrupt: "true"
        spec:
          containers:
          - env:
            - name: RAY_GRAFANA_IFRAME_HOST
              value: "https://<CLUSTER_NAME>/grafana"
            - name: RAY_GRAFANA_HOST
              value: "http://kube-prometheus-stack-grafana.monitoring.svc.cluster.local:80"
            - name: RAY_PROMETHEUS_HOST
              value: "http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090"
            name: ray-head
            image: rayproject/ray-llm:2.47.1-py311-cu124
            imagePullPolicy: Always
            resources:
              requests:
                cpu: 500m
                memory: 1200Mi
            # Allows us to use ptrace for debugging (CPU Flame Chart)
            securityContext:
              capabilities:
                add:
                - SYS_PTRACE
            volumeMounts:
            - name: log-volume
              mountPath: /tmp/ray
            - name: ray-service-shared-volume
              mountPath: /home/ray/shared
            ports:
            - containerPort: 6379
              name: gcs-server
              protocol: TCP
            - containerPort: 8265 # Ray dashboard
              name: dashboard
              protocol: TCP
            - containerPort: 10001
              name: client
              protocol: TCP
            - containerPort: 8000
              name: serve
              protocol: TCP
          volumes:
          - name: log-volume
            emptyDir: {}
          - name: ray-service-shared-volume
            persistentVolumeClaim:
              claimName: shared-volume
    workerGroupSpecs:
    - groupName: gpu
      minReplicas: 1
      maxReplicas: 4
      numOfHosts: 1
      rayStartParams:
        num-gpus: "1"
      template:
        metadata:
          annotations:
            karpenter.sh/do-not-disrupt: "true"
        spec:
          nodeSelector:
            karpenter.sh/nodepool: ray-serve-gpu
            karpenter.k8s.aws/instance-family: g6e
          containers:
          - env:
            name: ray-worker
            image: rayproject/ray-llm:2.47.1-py311-cu124
            imagePullPolicy: Always
            resources:
              limits:
                nvidia.com/gpu: 1
              requests:
                cpu: 6000m
                memory: 20Gi
            # Allows us to use ptrace for debugging (CPU Flame Chart)
            securityContext:
              capabilities:
                add:
                - SYS_PTRACE
            volumeMounts:
            - mountPath: /tmp/ray
              name: log-volume
            - name: ray-service-shared-volume
              mountPath: /home/ray/shared
            ports:
            - containerPort: 6379
              name: gcs-server
              protocol: TCP
            - containerPort: 8265 # Ray dashboard
              name: dashboard
              protocol: TCP
            - containerPort: 10001
              name: client
              protocol: TCP
            - containerPort: 8000
              name: serve
              protocol: TCP
          volumes:
          - name: log-volume
            emptyDir: {}
          - name: ray-service-shared-volume
            persistentVolumeClaim:
              claimName: shared-volume

The important section for autoscaling is below:

autoscaling_config:
                  min_replicas: 1
                  max_replicas: 4
                  target_ongoing_requests: 1

The max_replicas: 4 tells the autoscaler to launch up to 4 GPU workers, and we’ve set the target_ongoing_requests parameter to a low value to encourage the autoscaler to eagerly launch workers to keep the number of queued client requests to a minimum.

Note: Alternative syntax for autoscaling using the num_replicas parameter also exists. See below. However, we recommend the previous syntax for more control

num_replicas: auto

After applying this yaml, we can start Blazemeter, using a YAML file such as the following, to start 10 virtual users that ramp up over 10 seconds, and hold that load constant for 30 minutes. The think-time is set to 5s to allow for a slight delay between virtual users’ requests.

load-test.yaml

version: 1
execution:
  - scenario: llama_chat        # reference to the scenario below
    concurrency: 10             # peak VUs
    ramp-up: 10s                # time to reach peak (1 → 10)
    steps: 10                   # linear ramp (start at 1, add 1 each step)
    hold-for: 30m               # keep peak load; adjust or remove as needed
scenarios:
  llama_chat:
    default-address: http://localhost:8000
    requests:
      - label: chat_completion
        method: POST
        url: /v1/chat/completions
        headers:
          Content-Type: application/json
        body: |

          {
            "model": "Llama-3.2-3B-Instruct",
            "messages": [
              { "role": "system", "content": "You are a helpful assistant." },
              { "role": "user",
                "content": "${__RandomFromList(Describe the moon.,Name three prime numbers.,Who wrote Faust?,Give me a haiku on rain.,Translate 'good morning' to Spanish.,)}"
              }
            ],
            "temperature": 0.9,
            "max_tokens": 2000
          }
        think-time: 5s          # per-VU pause before the next iteration

To start the Blazemeter test, port-forward to the ray service port 8000, and run:

uv run --with bzt bzt -report load-test.yaml

As it runs, you’ll see statistics and response codes (HTTP 200):

For more about Blazemeter usage, see docs at https://github.com/Blazemeter/taurus/blob/master/site/dat/docs/CommandLine.md and https://gettaurus.org/kb/Basic1/#Scaling-With-Cloud-Provisioning?utm_source=BM&utm_medium=kb&utm_campaign=creating-a-new-taurus-test

You can use k9s or Lens to open a shell in one of the GPU workers to see the low-level details of GPU, CPU, and RAM usage on the worker nodes.

You can use the following commands to install and run the excellent nvitop command:

python3 -m pip install --user pipx
python3 -m pipx ensurepath
. ~/.bashrc
pipx run nvitop -m full -c --colorful

Nvitop shows GPU memory and utilization, driver version, CUDA version, as well as GPU type, CPU and RAM, and much more:

Once the autoscaler launches pods, Karpenter will go to work launching Kubernetes nodes to accommodate them.

They will appear in k9s:

On the Ray Dashboard, the Replica count will show 4:

You can also run Ray CLI commands to find out more about what’s running. For example, running ray-status will show 4 GPUs active:

(base) ray@llama32-3b-autoscale-raycluster-gxq4h-gpu-worker-c2wsl:~$ ray status

======== Autoscaler status: 2025-07-07 13:43:31.532417 ========
Node status
---------------------------------------------------------------
Active:
 4 gpu
 1 headgroup

Pending:
 (no pending nodes)

Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Total Usage:

 6.0/25.0 CPU (4.0 used of 4.0 reserved in placement groups)
 4.0/4.0 GPU (4.0 used of 4.0 reserved in placement groups)
 0B/1.09TiB memory
 118.42KiB/79.85GiB object_store_memory
Total Constraints:
 (no request_resources() constraints)
Total Demands:
 (no resource demands)

The Ray Dashboard Cluster page shows details on each node, including GRAM and GPU usage:

Clicking on the View Config, you can see the full set of parameters that Ray used, including the default ones that we did not specify, including smoothing_factor, downscale_delay_s and many more:

Once the load test is complete, Ray will wait 10 minutes (by default) to shut down the worker pods, and Kubernetes will shut down the GPU nodes.

In the ray Cluster Dashboard, the workers that were scaled-down will show as “Dead”, with a reason “Expected termination” stating that they were intentionally terminated by the autoscaler.

Finally, the Ray Serve Dashboard shows that the replicas is now back to 1:

Advanced Ray Serve Autoscaling

See the Ray documentation for coverage of more advanced configuration settings:

https://docs.ray.io/en/latest/serve/advanced-guides/advanced-autoscaling.html

See the official docs for the AutoscalingConfig settings:

https://docs.ray.io/en/latest/serve/api/doc/ray.serve.config.AutoscalingConfig.html

Conclusion

This example demonstrates how to configure and test Ray's autoscaling capabilities for a RayService deployment. By leveraging tools like Blazemeter for load generation and observing the behavior through the Ray Dashboard and CLI commands, you can effectively confirm that your Ray cluster scales out to handle increased demand and scales back down when the load subsides, optimizing resource utilization and maintaining service availability.