Getting Started with Skypilot

Prev Next

Introduction

Welcome to this introduction to SkyPilot, a powerful framework for running machine learning workloads on any cloud. This guide will walk you through the process of setting up and using SkyPilot to launch GPU-enabled instances, manage clusters, and run jobs efficiently across multiple cloud providers.

In the following sections, you'll learn how to:

  • Set up your environment with the necessary credentials

  • Create and launch a SkyPilot cluster

  • Connect to your cluster and verify its resources

  • Terminate your cluster and review costs

By the end of this guide, you'll have a solid understanding of how to leverage SkyPilot for your machine learning and data science projects. Let's get started!

Important Links

Main Site

Documentation

Step 1: Standard Installation

Launch a JupyterLab Notebook Server using the image kflow/skypilot-multi-cloud:v0.0.10

You can specify minimal RAM and GPU settings for this JupyterLab Notebook Server, since it’s going to be used to launch a GPU instance with SkyPilot.

Export AWS tokens for a user that has permissions to launch nodes in your cluster:

export AWS_REGION=ca-central-1
export AWS_ACCESS_KEY_ID=AKI*********************
export AWS_SECRET_ACCESS_KEY=8ae***********************

Run sky to make sure it’s installed:

Run sky statusto see the cluster status and the status of jobs, etc.:

Run sky check to see the status

Run these commands to create a necessary AWS config file. This can be empty since you exported your AWS_ environment variables already.

mkdir ~/.aws
touch ~/.aws/credentials

Step 2: Project App Installation (if applicable)

Create a hello-sky.yamlfile to define a cluster:

name: hello-sky
num_nodes: 1
resources:
  cpus: 1+
  memory: 8+
  disk_size: 64
  ordered:
  - cloud: aws
    use_spot: true
    accelerators: A10G:1
workdir: .
setup: |
  echo "Running setup..."
run: |
  echo "Hello, SkyPilot!"
  conda env list
  ls -la /
  echo "All Done!!!"

Launch using this command:

sky launch -c test-a10g hello-sky.yaml

📋 Useful Commands
Job ID: 1
├── To cancel the job:          sky cancel test-a10g 1
├── To stream job logs:         sky logs test-a10g 1
└── To view job queue:          sky queue test-a10g
Cluster name: test-a10g
├── To log into the head VM:    ssh test-a10g
├── To submit a job:            sky exec test-a10g yaml_file
├── To stop the cluster:        sky stop test-a10g
└── To teardown the cluster:    sky down test-a10g

Step 3: Initial Setup

Once the cluster has started, you can ssh into the machine:

**ssh test-a10g**
(base) jovyan@skypilot-0: **ssh test-a10g**
Warning: Permanently added '52.42.117.2' (ED25519) to the list of known hosts.
Welcome to Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-1015-aws x86_64)
 * Documentation:  <https://help.ubuntu.com>
 * Management:     <https://landscape.canonical.com>
 * Support:        <https://ubuntu.com/pro>
 System information as of Mon Nov 18 15:56:00 UTC 2024
  System load:  0.98               Processes:             282
  Usage of /:   30.8% of 61.84GB   Users logged in:       0
  Memory usage: 1%                 IPv4 address for ens5: 172.31.8.35
  Swap usage:   0%
 * Ubuntu Pro delivers the most comprehensive open source security and
   compliance features.
   <https://ubuntu.com/aws/pro>
Expanded Security Maintenance for Applications is not enabled.
23 updates can be applied immediately.
19 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable
Enable ESM Apps to receive additional future security updates.
See <https://ubuntu.com/esm> or run: sudo pro status
The list of available updates is more than a week old.
To check for new updates run: sudo apt update
New release '24.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.
Last login: Mon Nov 18 15:56:05 2024 from 52.60.60.24
(base) ubuntu@ip-172-31-8-35:~$ **whoami**
ubuntu

Run nvidia-smi to check GPU status:

Run htop to see system processes, memory, and CPU usage:

Step 4: Explore Key Features

Once you are done, terminate the cluster:

$ sky down test-a10g
Terminating 1 cluster: test-a10g. Proceed? [Y/n]: y
Terminating cluster test-a10g...done.
Terminating 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Verify everything is shutdown:

$ **sky status**
Clusters
No existing clusters.
Managed jobs
No in-progress managed jobs. (See: sky jobs -h)
Services
No live services. (See: sky serve -h)

Run sky cost-report to get a report of cluster costs:

$ **sky cost-report**
Clusters
NAME       LAUNCHED     DURATION  RESOURCES                                            STATUS      COST/hr  COST (est.)  
test-a10g  15 mins ago  4m 27s    1x AWS(g5.4xlarge[Spot], {'A10G': 1}, disk_size=64)  TERMINATED  $ 0.23   $ 0.02       
Total Cost: $0.02
Showing up to 5 most recent clusters. To see all clusters in history, pass the --all flag.
This feature is experimental. Costs for clusters with auto{stop,down} scheduled may not be accurate.

Conclusion

In conclusion, SkyPilot offers a robust and flexible solution for managing machine learning workloads across various cloud platforms. By following the steps outlined in this guide, you can efficiently set up your environment, launch GPU-enabled instances, and manage your resources effectively. The ability to seamlessly switch between cloud providers while optimizing costs and performance makes SkyPilot an invaluable tool for data scientists and machine learning practitioners. We encourage you to explore its features further and integrate SkyPilot into your workflow to enhance your productivity and streamline your machine learning projects.