Install and use Tensorflow on GPU nodes

The Cholesky cluster has two GPU nodes each equipped with 4 Nvidia Tesla V100 graphics cards. The corresponding computation queue is the gpu queue.
You can simply install the Tensorflow environment you need using Anaconda

Installation of Tensorflow

The installation must be done from a GPU node, interactively, with CUDA loaded, so that the graphics cards are correctly detected

Get an interractive shell on a GPU node via Slurm :

$ srun --nodes=1  --gres=gpu:1 --partition=gpu --time=01:30:00 --pty bash -i

Load the Cuda and Anaconda modules, create a dedicated Conda environment, then install Tensorflow :

$ module load anaconda3/2020.11 cuda/10.2
(base) $ conda create -n tf-gpu
(base) $ conda activate tf-gpu
(tf-gpu) $ conda install tensorflow-gpu

To install specific version of Tensorflow :

(tf-gpu) $ conda search tensorflw-gpu  
(tf-gpu) $ conda install tensorflow-gpu==2.1.0

Then you can logout from the GPU node.

Use Tensorflow

We can run a Python script using Tensorflow, which will detect the number of available GPUs:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Slurm script :

#!/bin/sh

#SBATCH --job-name=gpu-job
#SBATCH --time=120  # max 120 minutes
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2 #number of GPU to be used

module load anaconda3 cuda/10

conda activate tf-gpu

python3 $HOME/python/tensorflow/gpu-available.py

$ sbatch tf-job.slurm
Submitted batch job 1249
$ cat slurm-1249.out
[...]
pciBusID: 0000:1a:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-03-29 15:08:19.205278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
pciBusID: 0000:1c:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
[...]
Num GPUs Available:  2