Building an Apptainer image of Modulus v22.09#

Instead of creating a definition file from scratch we can also convert the Docker iamge to Apptainer image.

sudo apptainer build modulus.img docker-archive://modulus_image_v22.09.tar.gz

If the Docker image is downloaded from NVIDIA NGC, then just change the URI location.

sudo apptainer build modulus.img docker://modulus_image_v22.09.tar.gz

Getting the examples using the SSH key#

Make sure you have uploaded your SSH key to Gitlab.

Downlaod source code: git clone git@gitlab.com:nvidia/modulus/modulus.git
Download the examples: git clone git@gitlab.com:nvidia/modulus/examples.git
Download examples with checkpoints: git clone git@gitlab.com:nvidia/modulus/examples.git

Port forwarding the Jupyter-lab#

As per this article, you don’t need to map the ports from within the Apptainer image. I am not sure if this is the most efficient way to run a jupyter server which listens to any connection. Will deploy a script in future as Modulus supports interactive development in the newer versions.

[s.1915438@sl1 modulus22.09_apptainer]$ apptainer shell --contain --cleanenv modulus_22.09.img
Apptainer> jupyter lab --ip 0.0.0.0 --no-browser --port=8888

Now one can port forward easily as follows:

ssh -N -L 8888:localhost:8888 s.1915438@sunbird.swansea.ac.uk

In 8888:localhost:8888 the syntax is LOCAL_PORT:HOSTNAME:REMOTE_PORT

For compute node the same can be

ssh -N -L 8888:scs2041:8888 s.1915438@sunbird.swansea.ac.uk

Checking if the port is already occupied#

[s.1915438@sl1 ~]$ netstat -tulpn | grep :8888
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:8888            0.0.0.0:*               LISTEN      220549/python3.8

If a process is hindering then kill the process as follows:

(base) hell@Dell-Precision-T7910:~$ netstat -tulpn | grep :8888
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.0.1:8888          0.0.0.0:*               LISTEN      114998/ssh
tcp6       0      0 ::1:8888                :::*                    LISTEN      114998/ssh
(base) hell@Dell-Precision-T7910:~$ kill -9 114998

A robust way to execute jobs#

Allocate the resources Currently, Apptainer is fully functional on scs2043 only.

srun --pty --account=scw1901 --gres=gpu:1 --partition=accel_ai --nodelist=scs2043 /bin/bash

or just use this one when Apptainer is implemented system-wide.

srun --pty --account=scw1901 --gres=gpu:1 --partition=accel_ai /bin/bash

Load Apptainer

module load apptainer/1.0.3

Start Modulus Apptainer image

Use this if you want to bind the $(PWD).

apptainer shell --nv --contain --cleanenv --bind "$(pwd)":/data,/tmp:/tmp --env CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES "/home/scratch/s.1915438/modulus22.09_apptainer/modulus_22.09.img"

Or use this to bind the whole scratch partition.

apptainer shell --nv --contain --cleanenv --bind "/scratch/s.1915438/":/data,/tmp:/tmp --home /data --env CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES "/scratch/s.1915438/modulus22.09_apptainer/modulus_22.09.img"

--nv : allows pasing NVIDIA GPUs
--contain: no volume are binded by default
Start the jupyter-lab server

jupyter lab --ip 0.0.0.0 --no-browser --port=8888

Port forward Start a new terminal. Use this format to port forward; LOCAL_PORT:HOSTNAME:REMOTE_PORT

ssh -N -L 8888:scs2041:8888 s.1915438@sunbird.swansea.ac.uk

just copy the address from the Jupyter server terminal:

http://hostname:8888/?token=43ea95796cc4dc4c64325ab5687701c59f9dc7a2ede559a0

and rename the hostname to the localhost

http://localhost:8888/?token=43ea95796cc4dc4c64325ab5687701c59f9dc7a2ede559a0

For parallel training:

mpirun -np 2 python ldc_2d.py

where 2 is the number of GPUs allocated to the job. Make sure CUDA_VISIBLE_DEVICES is showing GPU numbers inside the container since we exported it from the host system.

Run this notebook exampel to learn more. Looks like the GPU is working.

$image.png$

Have to see if there is any advantage with Hydra manager.