Introduction
Welcome to this introduction to SkyPilot, a powerful framework for running machine learning workloads on any cloud. This guide will walk you through the process of setting up and using SkyPilot to launch GPU-enabled instances, manage clusters, and run jobs efficiently across multiple cloud providers.
In the following sections, you'll learn how to:
Set up your environment with the necessary credentials
Create and launch a SkyPilot cluster
Connect to your cluster and verify its resources
Terminate your cluster and review costs
By the end of this guide, you'll have a solid understanding of how to leverage SkyPilot for your machine learning and data science projects. Let's get started!
Important Links
Main Site
Step 1: Standard Installation
Launch a JupyterLab Notebook Server using the image kflow/skypilot-multi-cloud:v0.0.10
You can specify minimal RAM and GPU settings for this JupyterLab Notebook Server, since it’s going to be used to launch a GPU instance with SkyPilot.
Export AWS tokens for a user that has permissions to launch nodes in your cluster:
export AWS_REGION=ca-central-1
export AWS_ACCESS_KEY_ID=AKI*********************
export AWS_SECRET_ACCESS_KEY=8ae***********************
Run sky
to make sure it’s installed:
Run sky status
to see the cluster status and the status of jobs, etc.:
Run sky check
to see the status
Run these commands to create a necessary AWS config file. This can be empty since you exported your AWS_ environment variables already.
mkdir ~/.aws
touch ~/.aws/credentials
Step 2: Project App Installation (if applicable)
Create a hello-sky.yaml
file to define a cluster:
name: hello-sky
num_nodes: 1
resources:
cpus: 1+
memory: 8+
disk_size: 64
ordered:
- cloud: aws
use_spot: true
accelerators: A10G:1
workdir: .
setup: |
echo "Running setup..."
run: |
echo "Hello, SkyPilot!"
conda env list
ls -la /
echo "All Done!!!"
Launch using this command:
sky launch -c test-a10g hello-sky.yaml
📋 Useful Commands
Job ID: 1
├── To cancel the job: sky cancel test-a10g 1
├── To stream job logs: sky logs test-a10g 1
└── To view job queue: sky queue test-a10g
Cluster name: test-a10g
├── To log into the head VM: ssh test-a10g
├── To submit a job: sky exec test-a10g yaml_file
├── To stop the cluster: sky stop test-a10g
└── To teardown the cluster: sky down test-a10g
Step 3: Initial Setup
Once the cluster has started, you can ssh into the machine:
**ssh test-a10g**
(base) jovyan@skypilot-0: **ssh test-a10g**
Warning: Permanently added '52.42.117.2' (ED25519) to the list of known hosts.
Welcome to Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-1015-aws x86_64)
* Documentation: <https://help.ubuntu.com>
* Management: <https://landscape.canonical.com>
* Support: <https://ubuntu.com/pro>
System information as of Mon Nov 18 15:56:00 UTC 2024
System load: 0.98 Processes: 282
Usage of /: 30.8% of 61.84GB Users logged in: 0
Memory usage: 1% IPv4 address for ens5: 172.31.8.35
Swap usage: 0%
* Ubuntu Pro delivers the most comprehensive open source security and
compliance features.
<https://ubuntu.com/aws/pro>
Expanded Security Maintenance for Applications is not enabled.
23 updates can be applied immediately.
19 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable
Enable ESM Apps to receive additional future security updates.
See <https://ubuntu.com/esm> or run: sudo pro status
The list of available updates is more than a week old.
To check for new updates run: sudo apt update
New release '24.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.
Last login: Mon Nov 18 15:56:05 2024 from 52.60.60.24
(base) ubuntu@ip-172-31-8-35:~$ **whoami**
ubuntu
Run nvidia-smi
to check GPU status:
Run htop
to see system processes, memory, and CPU usage:
Step 4: Explore Key Features
Once you are done, terminate the cluster:
$ sky down test-a10g
Terminating 1 cluster: test-a10g. Proceed? [Y/n]: y
Terminating cluster test-a10g...done.
Terminating 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Verify everything is shutdown:
$ **sky status**
Clusters
No existing clusters.
Managed jobs
No in-progress managed jobs. (See: sky jobs -h)
Services
No live services. (See: sky serve -h)
Run sky cost-report
to get a report of cluster costs:
$ **sky cost-report**
Clusters
NAME LAUNCHED DURATION RESOURCES STATUS COST/hr COST (est.)
test-a10g 15 mins ago 4m 27s 1x AWS(g5.4xlarge[Spot], {'A10G': 1}, disk_size=64) TERMINATED $ 0.23 $ 0.02
Total Cost: $0.02
Showing up to 5 most recent clusters. To see all clusters in history, pass the --all flag.
This feature is experimental. Costs for clusters with auto{stop,down} scheduled may not be accurate.
Conclusion
In conclusion, SkyPilot offers a robust and flexible solution for managing machine learning workloads across various cloud platforms. By following the steps outlined in this guide, you can efficiently set up your environment, launch GPU-enabled instances, and manage your resources effectively. The ability to seamlessly switch between cloud providers while optimizing costs and performance makes SkyPilot an invaluable tool for data scientists and machine learning practitioners. We encourage you to explore its features further and integrate SkyPilot into your workflow to enhance your productivity and streamline your machine learning projects.