Efficiently train model#
I just realised that we have to set the environment variable every single time we want to use the tessellation geometry.
I made a job script for sbatch training.
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --cpus-per-task 8
#SBATCH --time 06:00:00
#SBATCH --ntasks=1
#SBATCH --job-name Nvidia-modulus-jupyter-lab
#SBATCH -o /scratch/s.1915438/jupyter_log/jupyter-lab-%J.log
#SBATCH -e /scratch/s.1915438/jupyter_log/jupyter-lab-%J.log
#SBATCH --gres=gpu:1
#SBATCH --account=scw1901
#SBATCH --partition=accel_ai
# get tunneling info
port=8888
node=$(hostname -s)
user=$(whoami)
# run jupyter notebook
cd /scratch/s.1915438
source modulus_pysdf/modulus_pysdf/bin/activate
# set environment variable for PySDF
cd /scratch/s.1915438/Modulus_source/Modulus/external
export LD_LIBRARY_PATH=$(pwd)/lib/:${LD_LIBRARY_PATH}
cd /scratch/s.1915438
# Run Jupyter lab
jupyter-lab --no-browser --port=${port} --ip=${node}
However, a better option is to port forward the Jupyter-lab with a Python venv of our choice.
Create a bash file in the \scratch or partition of your choice.
cd /scratch/s.1915438
source modulus_pysdf/modulus_pysdf/bin/activate
# set environment variable for PySDF
cd /scratch/s.1915438/Modulus_source/Modulus/external
export LD_LIBRARY_PATH=$(pwd)/lib/:${LD_LIBRARY_PATH}
cd /scratch/s.1915438
# Run Jupyter lab
jupyter-lab
I named it new_modulus.sh.
Before we port forward we have to update the Jupyter-lab.
pip install --upgrade jupyterlab
[s.1915438@sl2 ~]$ cd /scratch/s.1915438
[s.1915438@sl2 s.1915438]$ source modulus_pysdf/modulus_pysdf/bin/activate
(modulus_pysdf) [s.1915438@sl2 s.1915438]$ pip install --upgrade jupyterlab
Now to port forward the Jupyter-lab. we can type this
ssh -L 8888:localhost:8888 -t s.1915438@sunbird.swansea.ac.uk "bash /scratch/s.1915438/new_modulus.sh"
if the port is not free then we can change the port our local port binding.
ssh -L local_port:destination_server_ip:sunbird_port ssh_server_hostname
ssh -L 8889:localhost:8888 -t s.1915438@sunbird.swansea.ac.uk "bash /scratch/s.1915438/new_modulus.sh"
salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1
srun --pty bash
python FILENAME
salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:1
srun --pty bash
python aneurysm.py
training:
max_steps: 1500000
grad_agg_freq: 1
rec_results_freq: 10000
rec_validation_freq: ${training.rec_results_freq}
rec_inference_freq: ${training.rec_results_freq}
rec_monitor_freq: ${training.rec_results_freq}
rec_constraint_freq: 50000
save_network_freq: 1000
print_stats_freq: 100
summary_freq: 1000
amp: false
amp_dtype: float16
ntk:
use_ntk: false
save_name: null
run_freq: 1000
profiler:
profile: false
start_step: 0
end_step: 10
[AND SO ON]
Also, Modulus examples with tessellation will only work if we start the training. afresh.