G4dn

Community Images

To run this benchmark we fetch community images from gallery.ecr.aws/hpc.

We’ll use Thread-MPI with baked in settings of how many OpenMP threads should be spawned.

Two images with different tags:

  1. g4dn_8xl_on: build for a g4dn.8xllarge with hypterthreading on. This image will use 1 OpenMP threads.
sarus pull public.ecr.aws/hpc/spack/gromacs/2021.1/cuda-tmpi:g4dn_8xl_on_2021-04-29

32 Ranks

Our first job is aquivalent to the 32 ranks and 1 OpenMP threads, even though Thread-MPI is using process threads to mimic MPI ranks.

cat > gromacs-single-node-sarus-g4dn-cuda-tmpi-32x1.sbatch << \EOF
#!/bin/bash
#SBATCH --job-name=gromacs-single-node-sarus-g4dn-cuda-tmpi-32x1
#SBATCH --exclusive
#SBATCH --output=/fsx/logs/%x_%j.out
#SBATCH --partition=g4dn

mkdir -p /fsx/jobs/${SLURM_JOBID}
export INPUT=/fsx/input/gromacs/benchRIB.tpr
export CUDA_VISIBLE_DEVICES=all
sarus run --workdir=/fsx/jobs/${SLURM_JOBID} public.ecr.aws/hpc/spack/gromacs/2021.1/cuda-tmpi:g4dn_8xl_on_2021-04-29
EOF

Let’s submit two jobs of those.

sbatch -N1 gromacs-single-node-sarus-g4dn-cuda-tmpi-32x1.sbatch
sbatch -N1 gromacs-single-node-sarus-g4dn-cuda-tmpi-32x1.sbatch

Results

After those runs are done, we grep the performance results.

grep -B2 Performance /fsx/logs/gromacs-single-node-sarus-g4dn-cuda-tmpi-*

This extends the table from the gromacs-on-pcluster workshop started with decomposition.

# execution spec instance Ranks x Threads ns/day
1 native gromacs@2021.1 c5n.18xl 18 x 4 4.7
2 native gromacs@2021.1 c5n.18xl 36 x 2 5.3
3 native gromacs@2021.1 c5n.18xl 72 x 1 5.5
4 native gromacs@2021.1 ^intel-mkl c5n.18xl 36 x 2 5.4
5 native gromacs@2021.1 ^intel-mkl c5n.18xl 72 x 1 5.5
6 native gromacs@2021.1 ~mpi c5n.18xl 36 x 2 5.5
7 native gromacs@2021.1 ~mpi c5n.18xl 72 x 1 5.7
8 native gromacs@2021.1 +cuda ~mpi g4dn.8xl 1 x 32 6.3
9 sarus gromacs@2021.1 ~mpi c5n.18xl 36 x 2 5.5
10 sarus gromacs@2021.1 ~mpi c5n.18xl 72 x 1 5.7
11 sarus gromacs@2021.1 +cuda ~mpi fftw precision=float g4dn.8xl 1 x 32 6.3