Getting Started with Ray

Prev Next

Introduction

Ray is an open-source framework that enables teams to build and run distributed applications with ease, without needing to manage complex infrastructure or low-level communication protocols. Ray provides the foundation for many of the most compute-intensive workflows, from distributed model training and real-time inference to multi-agent orchestration and large-scale experimentation. With Ray, teams can scale workloads from a laptop to a Kubernetes cluster using the same codebase—bringing elasticity, speed, and fault-tolerance to AI engineering and production operations.

Key benefits of using Ray include:

  • Python-Native Distributed Execution: Write distributed applications using familiar Python idioms (e.g., @ray.remote)—without having to manage threads, processes, or networking.

  • Unified Framework for ML Workloads: Supports a broad suite of AI use cases via built-in libraries:

    • Ray Train for distributed model training

    • Ray Tune for scalable hyperparameter tuning

    • Ray Serve for low-latency model serving

    • Ray Data for distributed preprocessing and ETL

    • Ray RLlib for reinforcement learning

  • Autoscaling and Resource-Aware Scheduling: Dynamically scales compute resources up or down across Kubernetes or cloud environments, with fine-grained control over GPU, memory, and CPU utilization.

  • Built-In Observability and Debugging: Offers a dashboard, event timelines, task graphs, and logs to debug and optimize distributed execution across thousands of tasks and actors.

  • Composable with Other Infrastructure: Integrates with MLflow, LangChain, OpenTelemetry, Prometheus, and vLLM—supporting seamless development of complex AI systems.

Ray is used across training and inference workloads, agentic runtime environments, multi-step AI pipelines, and evaluation systems. It underpins services that require parallelism, resilience, and rapid iteration, and plays a central role in enabling the deployment of high-throughput, low-latency AI features at scale. By adopting Ray, you can empower teams to build, scale, and operate AI systems with simplicity and performance in mind—unlocking distributed computing as a first-class capability across the platform.

Important Links

Main Site

Documentation