Issue with Tensorflow#

Tensorflow is notoriosly difficult to install on HPC clusters. Simply using pip install almost never works. This is because the cluster’s environment is not set up to support the installation of tensorflow. This is a guide to installing tensorflow on a HPC cluster using the apptainer (formerly singularity) container system.

Prerequisites#

There are few things we need to figure out before we can install tensorflow on a HPC cluster.

CUDA version#

Typically, HOPC clusters have multiple versions of CUDA installed. We need to know which version of CUDA are available and based on this we can install the correct version of tensorflow.

On Swansea university’s Sunbird cluster, we can check the available versions of CUDA using the following command:

module avail CUDA

This gives me:

-------------------------------- /apps/modules/libraries ---------------------------------
CUDA/10.0 CUDA/11.2 CUDA/11.4 CUDA/11.6 CUDA/8.0  CUDA/9.1
CUDA/10.1 CUDA/11.3 CUDA/11.5 CUDA/11.7 CUDA/9.0  CUDA/9.2

Based on this let us we need to figure out which version of tensorflow is compatible with the versions of CUDA available to us. This can be found on the tensorflow website. For example, CUDA 11.2 is compatible with tensorflow-2.11.0. So, we will go with tensorflow-2.11. This will require python 3.7-3.10. I am going with python 3.8.

What is the problem with `pip install`#

Short answer is “I don’t know”. This probably require some tinkering with the environment variables and paths. I have tried to install tensorflow using pip install on a HPC cluster and it has never worked.

Importing tensorflow in a python script gives me the following error:

2024-03-01 17:03:41.627877: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-01 17:03:45.430288: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/extras/CUPTI/lib64:/usr/local/cuda-11.2/lib64:/opt/slurm/23.02.6/el7/lib:/opt/slurm/23.02.6/el7/lib:/opt/slurm/23.02.6/el7/lib:/opt/slurm/23.02.6/el7/lib
2024-03-01 17:03:45.430764: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/extras/CUPTI/lib64:/usr/local/cuda-11.2/lib64:/opt/slurm/23.02.6/el7/lib:/opt/slurm/23.02.6/el7/lib:/opt/slurm/23.02.6/el7/lib:/opt/slurm/23.02.6/el7/lib
2024-03-01 17:03:45.430792: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Any tensor created reverts to CPU. This is not ideal for training large models.

Apptainer to the rescue#

Tensorflow publishes Docker image on dockerhub. Here we can find tensorflow/tensorflow:2.11.1-gpu which is compatible with CUDA 11.2.

Now we need to pull this image with apptainer/singularity. This can be done using the following command:

module load apptainer/1.0.3
apptainer pull docker://tensorflow/tensorflow:2.11.1-gpu

This will pull the image from dockerhub and save it in the current directory. This will create a file called tensorflow_tensorflow_2.11.1-gpu.sif. This is the image file that we can use to run tensorflow on the HPC cluster. This image contains everything we need to run tensorflow including CUDA, cuDNN, and other dependencies.

Just to reiterate, we need to load two modules.

module load apptainer/1.0.3
module load CUDA/11.2

Now we can run the image using the following command:

apptainer run --nv tensorflow_tensorflow_2.11.1-gpu.sif

Now we can run the following python script to check if tensorflow is working:

import tensorflow as tf
my_variable = tf.Variable([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print(my_variable.device)

This should print something like:

/job:localhost/replica:0/task:0/device:GPU:0

This means that tensorflow is using the GPU.

This image is very generic. We might want to install additional packages such as wandb, hydra, or any other package. For this we can build a new image with all the packages we need.

We need to build the image on our local machine, because --fakeroot might be disabled on the cluster. So install apptainer on local PC, which unfortunately is not available for Windows.

Create a definition file called tensorflow.def with the following content:

Bootstrap: localimage
From: /home/hell/Desktop/temp/tensorflow_wandb/tensorflow_2.11.1-gpu.sif


%post
    pip install wandb xgboost scikit-learn seaborn statsmodels

The From field should point to the image we pulled from dockerhub. The post section is used to install additional packages. We can also install packages using pip in the terminal after running the image. But this is not ideal because we will have to install the packages every time we run the image. It is better to build a new image with all the packages we need.

Now we can build the image using the following command:

sudo apptainer build tensorflow_wandb.sif tensorflow.def

This will create a new image called tensorflow_wandb.sif with all the packages we need. We can now run this image on the HPC cluster (after transferring on HPC) using the following command:

apptainer run --nv tensorflow_wandb.sif

Binding directories#

We might want to bind directories to the image. This is useful for reading and writing files. For example, we might want to bind the directory containing the data to the image. This can be done using the following command:

apptainer run --nv --bind /path/to/data:/data tensorflow_wandb.sif

Note: The --nv flag is used to enable GPU support. The --bind flag is used to bind directories to the image. The --bind flag can be used multiple times to bind multiple directories to the image.

Submitting jobs#

So far we have been running the image interactively. We might want to submit a job to the cluster. Say our code and files resides in a directory called project. We can submit a job to the cluster using the following job script:

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks=1
#SBATCH --job-name ML_tf
#SBATCH -o batch_output.log
#SBATCH -e batch_error.log
#SBATCH --gres=gpu:1
#SBATCH --account=scw1901
#SBATCH --partition=accel_ai

module load apptainer/1.0.3
module load CUDA/11.2

cd /scratch/s.1915438/ # location of the apptainer image
# To run wandb sweep
apptainer exec --bind /scratch/s.1915438/project/:/data tensorflow_wandb.sif /bin/bash -c "cd /data && wandb agent your_sweep_id"

# OR to simply run a python file
apptainer exec --bind /scratch/s.1915438/project/:/data tensorflow_wandb.sif /bin/bash -c "cd /data && python your_script.py"

My guide on wandb.