Interactive GPU best practices#
sbatch job submission is straightforward. Just write a job script and submit it. However for testing, we often prefer interactive session such as salloc and srun.
sbatch and srun --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2 python main.py works the same. The only idfference is srun gives you the output, whereas we have to pipe sbatch with tail -f to see the outputs.
As mentioned in the training, salloc is primarily for “making sure it works on a small test run” before moving up to running production jobs in batch mode. For test jobs, the accel_ai_mig partition provides a large number of small (partitioned) GPUs to ensure that there is always one available for interactive tests, so you don’t have to queue.
NVIDIA Multi-Instance GPU in short MIG divides a GPU into multple chunks and when you type nvidia-smi on a MIG partition you will something like this.
[s.1915438@scs2046 ~]$ nvidia-smi
Tue May 3 14:40:17 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:27:00.0 Off | On |
| N/A 39C P0 34W / 250W | 45MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:28:00.0 Off | On |
| N/A 40C P0 36W / 250W | 45MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... On | 00000000:43:00.0 Off | On |
| N/A 42C P0 36W / 250W | 45MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:44:00.0 Off | On |
| N/A 43C P0 36W / 250W | 45MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-PCI... On | 00000000:A3:00.0 Off | On |
| N/A 37C P0 36W / 250W | 45MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-PCI... On | 00000000:A4:00.0 Off | On |
| N/A 37C P0 35W / 250W | 45MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-PCI... On | 00000000:C3:00.0 Off | On |
| N/A 36C P0 35W / 250W | 45MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-PCI... On | 00000000:C4:00.0 Off | On |
| N/A 37C P0 37W / 250W | 45MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 3 0 0 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 5 0 1 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 6 0 2 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 3 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 3 0 0 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 4 0 1 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 5 0 2 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 13 0 3 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 3 0 0 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 4 0 1 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 5 0 2 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 13 0 3 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 3 0 0 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 5 0 1 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 6 0 2 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 9 0 3 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 4 3 0 0 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 4 4 0 1 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 4 5 0 2 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 4 13 0 3 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 5 3 0 0 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 5 4 0 1 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 5 5 0 2 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 5 13 0 3 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 6 3 0 0 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 6 5 0 1 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
See only allocated GPUs in nvidia-smi#
By default, nvidia-smi shows all the GPUs in the partition making it harder to see which one is allocated to you. This is where bash -c 'echo $CUDA_VISIBLE_DEVICES' comes handy. It shows the ID number of active CUDA devices. Then we use nvidia-smi -i <GPU number> to see the usage on that specific GPU. Also, srun --pty bash gives direct access to the shell on the compute node. So, we do not have to write srun beafore each command. Here is an example.
[s.1915438@sl1 ~]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2
salloc: Granted job allocation 7165023
salloc: Waiting for resource configuration
salloc: Nodes scs2041 are ready for job
[s.1915438@sl1 ~]$ srun --pty bash
[s.1915438@scs2041 ~]$ bash -c 'echo $CUDA_VISIBLE_DEVICES'
2,3
[s.1915438@scs2041 ~]$ nvidia-smi -i 2,3
Tue May 3 20:20:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 2 NVIDIA A100-PCI... On | 00000000:43:00.0 Off | 0 |
| N/A 42C P0 47W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:44:00.0 Off | 0 |
| N/A 44C P0 46W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[s.1915438@scs2041 ~]$
Problem with MIG partition#
Here is a detailed post of mine: https://unix.stackexchange.com/questions/701266/contradiction-in-gpu-numbering-of-cuda-visible-devices-and-nvidia-smi-when
The $CUDA_VISIBLE_DEVICES and nvidia-smi -i <GPU_NUMBER> works fine without MIG. Looks like they do not agree on the gpu numbering when MIG is active. Here is an example,
[s.1915438@sl1 ~]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai_mig --gres=gpu:2
salloc: Granted job allocation 7165025
salloc: Waiting for resource configuration
salloc: Nodes scs2046 are ready for job
[s.1915438@sl1 ~]$ srun --pty bash
[s.1915438@scs2046 ~]$ bash -c 'echo $CUDA_VISIBLE_DEVICES'
84,255
[s.1915438@scs2046 ~]$ nvidia-smi -i 84,255
No devices were found
[s.1915438@scs2046 ~]$
I have submitted a bug report on Nvidia’s website. I will update this notebook as soon as I get a reply.