Setup Nvidia Modulus v22.03 on Sunbird using interactive GPU session#

As of 17 Apr 2022, the link to Modulus tutorial is bit secret. Here is the link: https://docs.nvidia.com/deeplearning/modulus/index.html

Installation#

It turns out that Conda environment is experiencing lots of issues, thus I will use Python virtual environments with out Jupyter lab.

Installing latest Python#

If we have a look at available versions of Python in Sunbird, it is very old. The latest version is 3.6.

[s.1915438@sl1 ~]$ ls /usr/bin/python*
/usr/bin/python  /usr/bin/python2  /usr/bin/python2.7  /usr/bin/python2.7-config  /usr/bin/python2-config  /usr/bin/python3  /usr/bin/python3.6  /usr/bin/python3.6m  /usr/bin/python-config

If we want to create a virtual environment with latest Python then we can use Python from within a conda environment.

Create a new Conda environment as follows. This will create a new conda environment with the latest python.

module load anaconda/2021.05
conda create --name modulus
source activate modulus

Let us check the Python version in the modulus environment.

(modulus) [s.1915438@sl1 ~]$ which python
/lustrehome/home/s.1915438/modulus/bin/python
(modulus) [s.1915438@sl1 ~]$ python --version
Python 3.9.12

We can use this python to create our Python virtual environment as follows. Also, I will create this in /scratch/ partition as it is faster compared to /lustrehome/ partition.

(modulus) [s.1915438@sl1 ~]$ cd /scratch/s.1915438
(modulus) [s.1915438@sl2 s.1915438]$ mkdir env
(modulus) [s.1915438@sl2 s.1915438]$ ls
ansys195  env  jupyter_env.sh  jupyter_log  jupyter.sh  modulus  Modulus_examples  Modulus_source
(modulus) [s.1915438@sl2 s.1915438]$ cd env
(modulus) [s.1915438@sl2 env]$ python3 -m venv modulus
(modulus) [s.1915438@sl2 env]$

Now it is time to close the conda environment. The simplest way is to reestablish the ssh connection.

Running Python virtual environment#

A Python environment can be activate using this command:

[s.1915438@sl1 ~]$ cd /scratch/s.1915438
[s.1915438@sl1 s.1915438]$ source env/modulus/bin/activate
(modulus) [s.1915438@sl1 s.1915438]$

Now we can check the Python version:

(modulus) [s.1915438@sl1 s.1915438]$ which python
/scratch/s.1915438/env/modulus/bin/python
(modulus) [s.1915438@sl1 s.1915438]$ python --version
Python 3.9.12
(modulus) [s.1915438@sl1 s.1915438]$

Installing Pytorch#

Remember to install correct version of pytorch for Nvidia A100. Version '1.11.0+cu102' i.e. 1.11 with CUDA 10.2 is incompatible and you will see the following error.

(modulus) [s.1915438@sl2 helmholtz]$ srun python helmholtz.py
/scratch/s.1915438/env/modulus/lib/python3.9/site-packages/torch/cuda/__init__.py:145: UserWarning:
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.

So, install a later version such as '1.11.0+cu113' using pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113.

Installing Nvidia Modulus v22.03#

A requirements.txt file is present in this directory. It contains the command to install prerequisites for Modulus. Please, do not follow Nvidia’s online instructions.

pip3 install matplotlib transforms3d future typing numpy quadpy numpy-stl==2.11.2 h5py sympy==1.5.1 termcolor psutil symengine==0.6.1 numba Cython chaospy torch_optimizer vtk chaospy termcolor omegaconf hydra-core einops timm tensorboard pandas orthopy ndim


pip3 install -U https://github.com/paulo-herrera/PyEVTK/archive/v1.1.2.tar.gz

Go to the Nvidia Modulus’s source directory and install Modulus on modulus virtual environment.

[s.1915438@sl1 Modulus]$ ls
accompanying_licences  build  changelog_tensorflow.md  dist  Dockerfile  external  MANIFEST.in  modulus  modulus.egg-info  NVIDIA-OptiX-SDK-7.0.0-linux64.sh  README.md  requirements.txt  setup.cfg  setup.py
[s.1915438@sl1 Modulus]$ pwd
/scratch/s.1915438/Modulus_source/Modulus
[s.1915438@sl1 Modulus]$ python setup.py install

After some time you should see a success message

Using /scratch/s.1915438/modulus/lib/python3.9/site-packages
Finished processing dependencies for modulus==22.3

Installing PySDF#

A link: https://forums.developer.nvidia.com/t/modulus-22-03-bare-metal-installation-no-module-named-easy-install/210970

Copy PySDF files from previous i.e. from v21.06 ./Modulus/external/pysdf and paste it ./Modulus/external. I am doing this because, Python 3.9 no longer supports installation of egg files using easy_install which is the default method to install PySDF in Modulus v22.03.

Now we can proceed with the older instructions, from the older manual as follows.

(/scratch/s.1915438/modulus) [s.1915438@sl1 Modulus]$ pwd
/scratch/s.1915438/Modulus_source/Modulus
(/scratch/s.1915438/modulus) [s.1915438@sl1 Modulus]$ cd external/
(/scratch/s.1915438/modulus) [s.1915438@sl1 external]$ ls
eggs  lib  pysdf
(/scratch/s.1915438/modulus) [s.1915438@sl1 external]$ export LD_LIBRARY_PATH=$(pwd)/pysdf/:${LD_LIBRARY_PATH}

Now install PySDF

(modulus) [s.1915438@sl2 pysdf]$ pwd
/scratch/s.1915438/Modulus_source/Modulus/external/pysdf
(modulus) [s.1915438@sl2 pysdf]$ python setup.py install

after some time you will see

Installed /scratch/s.1915438/env/modulus/lib/python3.9/site-packages/pysdf-0.1-py3.9-linux-x86_64.egg
Processing dependencies for pysdf==0.1
Finished processing dependencies for pysdf==0.1

Running an interactive GPU session#

salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2

set the Number of GPU as you wish, number of CPU does not matter here.

(modulus) [s.1915438@sl2 helmholtz]$ salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1
salloc: Granted job allocation 7161838
salloc: Waiting for resource configuration
salloc: Nodes scs2041 are ready for job

We can see our job in two ways. Using squeue --user=s.1915438 or squeue --partition=accel_ai.

[s.1915438@sl2 ~]$ squeue --partition=accel_ai
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           7161842  accel_ai     bash s.191543  R       0:38      1 scs2041
           7161825  accel_ai Eval_ens   a.bip5  R    1:08:17      1 scs2041

Running Nvidia Modulus example#

We can use srun to run any Python on GPU as follows:

(modulus) [s.1915438@sl2 seismic_wave]$ srun python wave_2d.py
training:
  max_steps: 40000
  grad_agg_freq: 1
  rec_results_freq: 1000
  :
  <Output continues>

Cancelling model training#

Nvidia Modulus trains the model forever and stores the data in checkpoint folder. We can cancel the training anytime or when the loss is satisfactory using pressing ctrl+c multiple times.

Can’t run SDF library and STL file support.#

This is something I have to look at. For now here is the error.

(modulus) [s.1915438@sl1 s.1915438]$ cd Modulus_examples/examples/aneurysm/
(modulus) [s.1915438@sl1 aneurysm]$ ls
aneurysm.py  conf  openfoam  stl_files
(modulus) [s.1915438@sl1 aneurysm]$ srun python aneurysm.py
Error importing pysdf. Make sure 'libsdf.so' is in LD_LIBRARY_PATH and pysdf is installed
Traceback (most recent call last):
  File "/scratch/s.1915438/Modulus_examples/examples/aneurysm/aneurysm.py", line 25, in <module>
    from modulus.geometry.tessellation.tessellation import Tessellation
  File "/scratch/s.1915438/env/modulus/lib/python3.9/site-packages/modulus-22.3-py3.9.egg/modulus/geometry/tessellation/tessellation.py", line 11, in <module>
    import pysdf.sdf as pysdf
ImportError: libsdf.so: cannot open shared object file: No such file or directory
srun: error: scs2041: task 0: Exited with exit code 1