Efficiently train model#

I just realised that we have to set the environment variable every single time we want to use the tessellation geometry.

I made a job script for sbatch training.

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --cpus-per-task 8
#SBATCH --time 06:00:00
#SBATCH --ntasks=1
#SBATCH --job-name Nvidia-modulus-jupyter-lab
#SBATCH -o /scratch/s.1915438/jupyter_log/jupyter-lab-%J.log
#SBATCH -e /scratch/s.1915438/jupyter_log/jupyter-lab-%J.log
#SBATCH --gres=gpu:1
#SBATCH --account=scw1901
#SBATCH --partition=accel_ai

# get tunneling info

port=8888
node=$(hostname -s)
user=$(whoami)


# run jupyter notebook
cd /scratch/s.1915438
source modulus_pysdf/modulus_pysdf/bin/activate

# set environment variable for PySDF
cd /scratch/s.1915438/Modulus_source/Modulus/external
export LD_LIBRARY_PATH=$(pwd)/lib/:${LD_LIBRARY_PATH}

cd /scratch/s.1915438
# Run Jupyter lab
jupyter-lab --no-browser --port=${port} --ip=${node}

However, a better option is to port forward the Jupyter-lab with a Python venv of our choice.

Create a bash file in the \scratch or partition of your choice.

cd /scratch/s.1915438
source modulus_pysdf/modulus_pysdf/bin/activate

# set environment variable for PySDF
cd /scratch/s.1915438/Modulus_source/Modulus/external
export LD_LIBRARY_PATH=$(pwd)/lib/:${LD_LIBRARY_PATH}

cd /scratch/s.1915438
# Run Jupyter lab
jupyter-lab

I named it new_modulus.sh.

Before we port forward we have to update the Jupyter-lab.

pip install --upgrade jupyterlab

[s.1915438@sl2 ~]$ cd /scratch/s.1915438
[s.1915438@sl2 s.1915438]$ source modulus_pysdf/modulus_pysdf/bin/activate
(modulus_pysdf) [s.1915438@sl2 s.1915438]$ pip install --upgrade jupyterlab

Now to port forward the Jupyter-lab. we can type this

ssh -L 8888:localhost:8888 -t s.1915438@sunbird.swansea.ac.uk "bash /scratch/s.1915438/new_modulus.sh"

if the port is not free then we can change the port our local port binding.

ssh -L local_port:destination_server_ip:sunbird_port ssh_server_hostname

ssh -L 8889:localhost:8888 -t s.1915438@sunbird.swansea.ac.uk "bash /scratch/s.1915438/new_modulus.sh"

salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1
srun --pty bash
python FILENAME

salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1
srun --pty bash
python aneurysm.py
training:
  max_steps: 1500000
  grad_agg_freq: 1
  rec_results_freq: 10000
  rec_validation_freq: ${training.rec_results_freq}
  rec_inference_freq: ${training.rec_results_freq}
  rec_monitor_freq: ${training.rec_results_freq}
  rec_constraint_freq: 50000
  save_network_freq: 1000
  print_stats_freq: 100
  summary_freq: 1000
  amp: false
  amp_dtype: float16
  ntk:
    use_ntk: false
    save_name: null
    run_freq: 1000
profiler:
  profile: false
  start_step: 0
  end_step: 10

[AND SO ON]

Also, Modulus examples with tessellation will only work if we start the training. afresh.