Use GPU on interactive session#

List of partitions#

This will show the lost of partitions and GPU names as well with the NODELIST

sinfo -o "%.10P %.5a %.10l %.6D %.6t %.20N %.10G"

 PARTITION AVAIL  TIMELIMIT  NODES  STATE             NODELIST       GRES
  compute*    up 3-00:00:00      1 drain*              scs0123     (null)
  compute*    up 3-00:00:00      2  down*       scs[0022,0050]     (null)
  compute*    up 3-00:00:00     36    mix scs[0007,0009-0010,0     (null)
  compute*    up 3-00:00:00     35  alloc scs[0001-0006,0008,0     (null)
  compute*    up 3-00:00:00     48   idle scs[0049,0051-0062,0     (null)
  compute*    up 3-00:00:00      1   down              scs0100     (null)
developmen    up      30:00      1 drain*              scs0123     (null)
developmen    up      30:00      2  down*       scs[0022,0050]     (null)
developmen    up      30:00     36    mix scs[0007,0009-0010,0     (null)
developmen    up      30:00     35  alloc scs[0001-0006,0008,0     (null)
developmen    up      30:00     48   idle scs[0049,0051-0062,0     (null)
developmen    up      30:00      1   down              scs0100     (null)
       gpu    up 2-00:00:00      1    mix              scs2003 gpu:v100:2
       gpu    up 2-00:00:00      2  alloc       scs[2001-2002] gpu:v100:2
       gpu    up 2-00:00:00      1   idle              scs2004 gpu:v100:2
  accel_ai    up 2-00:00:00      2    mix       scs[2041,2043] gpu:a100:8
  accel_ai    up 2-00:00:00      3   idle  scs[2042,2044-2045] gpu:a100:8
accel_ai_d    up    2:00:00      2    mix       scs[2041,2043] gpu:a100:8
accel_ai_d    up    2:00:00      3   idle  scs[2042,2044-2045] gpu:a100:8
accel_ai_m    up   12:00:00      1   idle              scs2046 gpu:1g.5gb
s_highmem_    up 3-00:00:00      2   idle       scs[0151-0152]     (null)
s_compute_    up 3-00:00:00      1    mix              scs3001     (null)
s_compute_    up 3-00:00:00      1   idle              scs3003     (null)
s_compute_    up    1:00:00      1    mix              scs3001     (null)
s_compute_    up    1:00:00      1   idle              scs3003     (null)
 s_gpu_eng    up 2-00:00:00      1   idle              scs2021 gpu:v100:4

In this example I will use * PARTITION: accel_ai (because I have access to it) * Go here: https://scw.bangor.ac.uk/en/projects/memberships/ to check your memberships * Make sure the STATE is idle or mix not drain or down. * Here, accel_ai has 8 Nvidia A100 40 GB GPU. * We can use any node from scs[2042,2044-2045]

Start an interactive session#

At first use salloc to reserve resources.

salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1

Go here: https://scw.bangor.ac.uk/en/projects/memberships/

As discussed earlier we will use any node from accel_ai. Slurm will assign an idle node. I requested for only GPU.

[s.1915438@sl1 experiment]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1
salloc: Granted job allocation 7133017
salloc: Waiting for resource configuration
salloc: Nodes scs2042 are ready for job

Now can see your hardware allocation using

[s.1915438@sl1 experiment]$ squeue --user=s.1915438
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           7133017  accel_ai     bash s.191543  R      14:08      1 scs2042

Loading Anaconda#

You can see a list of modules using module avail. And load anaconda using module load anaconda/3. Otherwise just type module load ana and use TAB from keyboard to fill remaining characters.

Once anaconda is loaded.

Create a new Conda env to install Pytorch otherwise skip this section#

Now that conda is recognisable, use conda create --name ml to create a new conda environment with name ml or any name can be used.

Activate the Conda env#

First activate the base Conda env using source activate. Then type conda env list to see a list of Conda envs. Load the newly created Conda env using conda activate ml.

Install Pytorch#

  • Go here:https://pytorch.org/get-started/locally/

  • Get the command to install a stable Pytorch with latest CUDA.

  • On 15th March 2022 the latest stable release is 1.11.0

  • Copy the command in the end on the selection table: conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

  • Run this in the ml Conda env and wait. It takes time to install Pytorch with CUDA 11.3.

Run a python file#

Write any Pytorch script . For example, I created gpu.py. It checks the the availibility of CUDA as well as the GPU name.

(ml) [s.1915438@sl1 experiment]$ cat gpu.py
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")

try:
    print(f"Current Devices: {torch.cuda.current_device()}")
except :
    print('Current Devices: Torch is not compiled for GPU or No GPU')

print(f"No. of GPUs: {torch.cuda.device_count()}")

try:
    print(f"GPU Name:{torch.cuda.get_device_name(0)}")
except :
    print('GPU Name: No GPU available')

The easiest way is to use sftp. Check this out to use FileZilla, an sftp client.

Open filezilla and type sftp://sunbird.swansea.ac.uk into the host box. Enter your username (s.1915438) and password (uni password) in the username/password boxes. And transfer this python script to a specific directory. Next time, you can go to server menu and clickreconnect to login in a hastlefree fashion.

image.png

In the sunbird ssh session. go to the location where you transferred the gpu.py file. In the directory, run this command srun python gpu.py

(ml) [s.1915438@sl1 experiment]$ srun python gpu.py
1.11.0
Is available: True
Current Devices: 0
No. of GPUs: 1
GPU Name:NVIDIA A100-PCIE-40GB

It took me hours to understand and do whatever is written here. Just type exit, to free the node.

Two GPUs#

Once you exit the interactive session, you purge the conda module. Reload the anaconda/3 module and activate the ml Conda env.

Allocate 2 GPUs:#

[s.1915438@sl1 experiment]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2
salloc: Granted job allocation 7133023
salloc: Waiting for resource configuration
salloc: Nodes scs2042 are ready for job

Modify the gpu.py file using nano#

(ml) [s.1915438@sl1 experiment]$ cat gpu.py
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")

try:
    print(f"Current Devices: {torch.cuda.current_device()}")
except :
    print('Current Devices: Torch is not compiled for GPU or No GPU')

print(f"No. of GPUs: {torch.cuda.device_count()}")

try:
    print(f"GPU Name:{torch.cuda.get_device_name(0)}")
except :
    print('GPU Name: No GPU available')

try:
    print(f"GPU Name:{torch.cuda.get_device_name(1)}")
except :
    print('GPU Name: No GPU available')

Run the python script for 2 GPUs#

(ml) [s.1915438@sl1 experiment]$ srun python gpu.py
1.11.0
Is available: True
Current Devices: 0
No. of GPUs: 2
GPU Name:NVIDIA A100-PCIE-40GB
GPU Name:NVIDIA A100-PCIE-40GB