Interactive GPU best practices#

sbatch job submission is straightforward. Just write a job script and submit it. However for testing, we often prefer interactive session such as salloc and srun.

sbatch and srun --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2 python main.py works the same. The only idfference is srun gives you the output, whereas we have to pipe sbatch with tail -f to see the outputs.

As mentioned in the training, salloc is primarily for “making sure it works on a small test run” before moving up to running production jobs in batch mode. For test jobs, the accel_ai_mig partition provides a large number of small (partitioned) GPUs to ensure that there is always one available for interactive tests, so you don’t have to queue.

NVIDIA Multi-Instance GPU in short MIG divides a GPU into multple chunks and when you type nvidia-smi on a MIG partition you will something like this.

[s.1915438@scs2046 ~]$ nvidia-smi
Tue May  3 14:40:17 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:27:00.0 Off |                   On |
| N/A   39C    P0    34W / 250W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  On   | 00000000:28:00.0 Off |                   On |
| N/A   40C    P0    36W / 250W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCI...  On   | 00000000:43:00.0 Off |                   On |
| N/A   42C    P0    36W / 250W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCI...  On   | 00000000:44:00.0 Off |                   On |
| N/A   43C    P0    36W / 250W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PCI...  On   | 00000000:A3:00.0 Off |                   On |
| N/A   37C    P0    36W / 250W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-PCI...  On   | 00000000:A4:00.0 Off |                   On |
| N/A   37C    P0    35W / 250W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-PCI...  On   | 00000000:C3:00.0 Off |                   On |
| N/A   36C    P0    35W / 250W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-PCI...  On   | 00000000:C4:00.0 Off |                   On |
| N/A   37C    P0    37W / 250W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    3   0   0  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    5   0   1  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    6   0   2  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   3  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    3   0   0  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    4   0   1  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    5   0   2  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   13   0   3  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    3   0   0  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    4   0   1  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    5   0   2  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   13   0   3  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    3   0   0  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    5   0   1  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    6   0   2  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    9   0   3  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  4    3   0   0  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  4    4   0   1  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  4    5   0   2  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  4   13   0   3  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  5    3   0   0  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  5    4   0   1  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  5    5   0   2  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  5   13   0   3  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  6    3   0   0  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  6    5   0   1  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

See only allocated GPUs in nvidia-smi#

By default, nvidia-smi shows all the GPUs in the partition making it harder to see which one is allocated to you. This is where bash -c 'echo $CUDA_VISIBLE_DEVICES' comes handy. It shows the ID number of active CUDA devices. Then we use nvidia-smi -i <GPU number> to see the usage on that specific GPU. Also, srun --pty bash gives direct access to the shell on the compute node. So, we do not have to write srun beafore each command. Here is an example.

[s.1915438@sl1 ~]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2
salloc: Granted job allocation 7165023
salloc: Waiting for resource configuration
salloc: Nodes scs2041 are ready for job
[s.1915438@sl1 ~]$ srun --pty bash
[s.1915438@scs2041 ~]$ bash -c 'echo $CUDA_VISIBLE_DEVICES'
2,3
[s.1915438@scs2041 ~]$ nvidia-smi -i 2,3
Tue May  3 20:20:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   2  NVIDIA A100-PCI...  On   | 00000000:43:00.0 Off |                    0 |
| N/A   42C    P0    47W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCI...  On   | 00000000:44:00.0 Off |                    0 |
| N/A   44C    P0    46W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[s.1915438@scs2041 ~]$

Problem with MIG partition#

Here is a detailed post of mine: https://unix.stackexchange.com/questions/701266/contradiction-in-gpu-numbering-of-cuda-visible-devices-and-nvidia-smi-when

The $CUDA_VISIBLE_DEVICES and nvidia-smi -i <GPU_NUMBER> works fine without MIG. Looks like they do not agree on the gpu numbering when MIG is active. Here is an example,

[s.1915438@sl1 ~]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai_mig --gres=gpu:2
salloc: Granted job allocation 7165025
salloc: Waiting for resource configuration
salloc: Nodes scs2046 are ready for job
[s.1915438@sl1 ~]$ srun --pty bash
[s.1915438@scs2046 ~]$ bash -c 'echo $CUDA_VISIBLE_DEVICES'
84,255
[s.1915438@scs2046 ~]$ nvidia-smi -i 84,255
No devices were found
[s.1915438@scs2046 ~]$

I have submitted a bug report on Nvidia’s website. I will update this notebook as soon as I get a reply.