RayService Autoscaling Example
Ray’s built-in autoscaler that grows or shrinks your cluster automatically whenever its current set of allocated nodes cannot satisfy pending resource requests. The below example shows basic configuration options that demonstrate how this can be configured, and how to test it. We’ll use the load-generation tool Blazemeter, which generates client calls to the LLM for long periods of time in order to trigger autoscaling.
See the official Ray documentation for more information on basic autoscaling here:
https://docs.ray.io/en/latest/serve/autoscaling-guide.html#resnet-autoscaling-example
Those docs define the important autoscaling parameters that we use in this example as:
target_ongoing_requests is the average number of ongoing requests per replica that the Serve autoscaler tries to ensure. You can adjust it based on your request processing length (the longer the requests, the smaller this number should be) as well as your latency objective (the shorter you want your latency to be, the smaller this number should be).
max_replicas is the maximum number of replicas for the deployment. Set this to ~20% higher than what you think you need for peak traffic.
Here is a RayService YAML which includes an autoscaling_config section, and deploys the Llama 3.2-3b LLM from CakeFS (shared EFS drive in your Cake cluster):
apiVersion: ray.io/v1
kind: RayService
metadata:
name: llama32-3b-autoscale
namespace: ray-service
spec:
serveConfigV2: |
applications:
- name: llama32-3b-autoscale
route_prefix: "/"
import_path: "ray.serve.llm:build_openai_app"
runtime_env:
env_vars:
VLLM_USE_V1: "0"
ENGINE_START_TIMEOUT_S: "1600"
SERVE_LOG_LEVEL: "DEBUG"
VLLM_LOGGING_LEVEL: "DEBUG"
HF_TOKEN: "YOUR_HF_TOKEN"
args:
llm_configs:
- model_loading_config:
model_id: "Llama-3.2-3B-Instruct"
model_source: "/home/ray/shared/models/models--meta-llama--Llama-3.2-3B-Instruct/snapshots/0cb88a4f764b7a12671c53f0838cd831a0843b95"
deployment_config:
health_check_timeout_s: 600
autoscaling_config:
min_replicas: 1
max_replicas: 4
target_ongoing_requests: 1
engine_kwargs:
tensor_parallel_size: 1
pipeline_parallel_size: 1
tokenizer_pool_size: 2
tokenizer_pool_extra_config: "{\"runtime_env\": {}}"
max_model_len: 8192
trust_remote_code: true
rayClusterConfig:
rayVersion: '2.47.1' # should match the Ray version in the image of the containers
enableInTreeAutoscaling: true
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
template:
metadata:
annotations:
karpenter.sh/do-not-disrupt: "true"
spec:
containers:
- env:
- name: RAY_GRAFANA_IFRAME_HOST
value: "https://<CLUSTER_NAME>/grafana"
- name: RAY_GRAFANA_HOST
value: "http://kube-prometheus-stack-grafana.monitoring.svc.cluster.local:80"
- name: RAY_PROMETHEUS_HOST
value: "http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090"
name: ray-head
image: rayproject/ray-llm:2.47.1-py311-cu124
imagePullPolicy: Always
resources:
requests:
cpu: 500m
memory: 1200Mi
# Allows us to use ptrace for debugging (CPU Flame Chart)
securityContext:
capabilities:
add:
- SYS_PTRACE
volumeMounts:
- name: log-volume
mountPath: /tmp/ray
- name: ray-service-shared-volume
mountPath: /home/ray/shared
ports:
- containerPort: 6379
name: gcs-server
protocol: TCP
- containerPort: 8265 # Ray dashboard
name: dashboard
protocol: TCP
- containerPort: 10001
name: client
protocol: TCP
- containerPort: 8000
name: serve
protocol: TCP
volumes:
- name: log-volume
emptyDir: {}
- name: ray-service-shared-volume
persistentVolumeClaim:
claimName: shared-volume
workerGroupSpecs:
- groupName: gpu
minReplicas: 1
maxReplicas: 4
numOfHosts: 1
rayStartParams:
num-gpus: "1"
template:
metadata:
annotations:
karpenter.sh/do-not-disrupt: "true"
spec:
nodeSelector:
karpenter.sh/nodepool: ray-serve-gpu
karpenter.k8s.aws/instance-family: g6e
containers:
- env:
name: ray-worker
image: rayproject/ray-llm:2.47.1-py311-cu124
imagePullPolicy: Always
resources:
limits:
nvidia.com/gpu: 1
requests:
cpu: 6000m
memory: 20Gi
# Allows us to use ptrace for debugging (CPU Flame Chart)
securityContext:
capabilities:
add:
- SYS_PTRACE
volumeMounts:
- mountPath: /tmp/ray
name: log-volume
- name: ray-service-shared-volume
mountPath: /home/ray/shared
ports:
- containerPort: 6379
name: gcs-server
protocol: TCP
- containerPort: 8265 # Ray dashboard
name: dashboard
protocol: TCP
- containerPort: 10001
name: client
protocol: TCP
- containerPort: 8000
name: serve
protocol: TCP
volumes:
- name: log-volume
emptyDir: {}
- name: ray-service-shared-volume
persistentVolumeClaim:
claimName: shared-volume
The important section for autoscaling is below:
autoscaling_config:
min_replicas: 1
max_replicas: 4
target_ongoing_requests: 1
The max_replicas: 4 tells the autoscaler to launch up to 4 GPU workers, and we’ve set the target_ongoing_requests parameter to a low value to encourage the autoscaler to eagerly launch workers to keep the number of queued client requests to a minimum.
Note: Alternative syntax for autoscaling using the num_replicas parameter also exists. See below. However, we recommend the previous syntax for more control
num_replicas: auto
After applying this yaml, we can start Blazemeter, using a YAML file such as the following, to start 10 virtual users that ramp up over 10 seconds, and hold that load constant for 30 minutes. The think-time is set to 5s to allow for a slight delay between virtual users’ requests.
load-test.yaml
version: 1
execution:
- scenario: llama_chat # reference to the scenario below
concurrency: 10 # peak VUs
ramp-up: 10s # time to reach peak (1 → 10)
steps: 10 # linear ramp (start at 1, add 1 each step)
hold-for: 30m # keep peak load; adjust or remove as needed
scenarios:
llama_chat:
default-address: http://localhost:8000
requests:
- label: chat_completion
method: POST
url: /v1/chat/completions
headers:
Content-Type: application/json
body: |
{
"model": "Llama-3.2-3B-Instruct",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user",
"content": "${__RandomFromList(Describe the moon.,Name three prime numbers.,Who wrote Faust?,Give me a haiku on rain.,Translate 'good morning' to Spanish.,)}"
}
],
"temperature": 0.9,
"max_tokens": 2000
}
think-time: 5s # per-VU pause before the next iteration
To start the Blazemeter test, port-forward to the ray service port 8000, and run:
uv run --with bzt bzt -report load-test.yaml
As it runs, you’ll see statistics and response codes (HTTP 200):
For more about Blazemeter usage, see docs at https://github.com/Blazemeter/taurus/blob/master/site/dat/docs/CommandLine.md and https://gettaurus.org/kb/Basic1/#Scaling-With-Cloud-Provisioning?utm_source=BM&utm_medium=kb&utm_campaign=creating-a-new-taurus-test
You can use k9s or Lens to open a shell in one of the GPU workers to see the low-level details of GPU, CPU, and RAM usage on the worker nodes.
You can use the following commands to install and run the excellent nvitop command:
python3 -m pip install --user pipx
python3 -m pipx ensurepath
. ~/.bashrc
pipx run nvitop -m full -c --colorful
Nvitop shows GPU memory and utilization, driver version, CUDA version, as well as GPU type, CPU and RAM, and much more:
Once the autoscaler launches pods, Karpenter will go to work launching Kubernetes nodes to accommodate them.
They will appear in k9s:
On the Ray Dashboard, the Replica count will show 4:
You can also run Ray CLI commands to find out more about what’s running. For example, running ray-status will show 4 GPUs active:
(base) ray@llama32-3b-autoscale-raycluster-gxq4h-gpu-worker-c2wsl:~$ ray status
======== Autoscaler status: 2025-07-07 13:43:31.532417 ========
Node status
---------------------------------------------------------------
Active:
4 gpu
1 headgroup
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
6.0/25.0 CPU (4.0 used of 4.0 reserved in placement groups)
4.0/4.0 GPU (4.0 used of 4.0 reserved in placement groups)
0B/1.09TiB memory
118.42KiB/79.85GiB object_store_memory
Total Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands)
The Ray Dashboard Cluster page shows details on each node, including GRAM and GPU usage:
Clicking on the View Config, you can see the full set of parameters that Ray used, including the default ones that we did not specify, including smoothing_factor, downscale_delay_s and many more:
Once the load test is complete, Ray will wait 10 minutes (by default) to shut down the worker pods, and Kubernetes will shut down the GPU nodes.
In the ray Cluster Dashboard, the workers that were scaled-down will show as “Dead”, with a reason “Expected termination” stating that they were intentionally terminated by the autoscaler.
Finally, the Ray Serve Dashboard shows that the replicas is now back to 1:
Advanced Ray Serve Autoscaling
See the Ray documentation for coverage of more advanced configuration settings:
https://docs.ray.io/en/latest/serve/advanced-guides/advanced-autoscaling.html
See the official docs for the AutoscalingConfig settings:
https://docs.ray.io/en/latest/serve/api/doc/ray.serve.config.AutoscalingConfig.html
Conclusion
This example demonstrates how to configure and test Ray's autoscaling capabilities for a RayService deployment. By leveraging tools like Blazemeter for load generation and observing the behavior through the Ray Dashboard and CLI commands, you can effectively confirm that your Ray cluster scales out to handle increased demand and scales back down when the load subsides, optimizing resource utilization and maintaining service availability.