Hardware Architecture

Code was run on Harvard’s Odyssey cluster. Odyssey is a large scale heterogeneous high performance computing cluster used by Harvard University. Refer to the FAS RC website for more information.

Hardware Specification

To replicate the benchmarking results, please use the following specifications.

Partition: Shared
CPUs/Node: 2 (Intel Xeon Broadwell)
Cores/Node: 32
CPU clock speed: 2095 MHz
CPU cache: L1 32K, L2 256K, L3 40960K
Memory/Node: 128GB
Network: FDR InfiniBand
Nodes Used: 1-8

Software

To replicate benchmarking results, please make sure to, at the minimum, allow for the below software dependencies.

Dependencies

Main Programming Language: Fortran90

Secondary Programming Language: Python 2.7.x or 3.6.x (used for I/O, plotting and automating bash scripts only)

Parallel Implementation:

Odyssey modules: intel/17.0.2-fasrc01 openmpi/2.1.0-fasrc01 fftw/3.3.5-fasrc01

Serial Implementation:

Odyssey modules: gcc/4.8.2-fasrc01 fftw/3.3.5-fasrc01

Libraries used:

Open MPI 2.1.0
FFTW 3.3.5
BLAS and LAPACK

Compiler

Parallel Implementation:

Intel Fortran compiler (mpifort) 17.0.2

Serial Implementation:

GCC compiler (gfortran) 4.8.2

Installation

Log into Odyssey (e.g., via SSH)

ssh username@login.rc.fas.harvard.edu

Load required modules

module purge
module load intel/17.0.2-fasrc01 
module load openmpi/2.1.0-fasrc01 
module load fftw/3.3.5-fasrc01

Clone GitHub repository

git clone https://github.com/toledy/ParallelRayleighBenardConvection.git

Note: In order to clone a GitHub repository on Odyssey it is strongly encouraged to use the SSH protocol, in conjunction with a personal, Odyssey public SSH key that needs to be included in your Github account. Instructions on how to do this is available here.

Update path variables in global.f90 in the src directory. (Note that length argument needs to be updated depending on the length of the input string)

vim global.f90
character(len=84) :: vtkloc="PATH"
character(len=82) :: floc="PATH"

Compile code in src directory using make

make all

Specify simulation parameters in input.data as required

vim input.data

Ensure the vtkdata and fdata directories are empty if exporting output
Edit SLURM sbatch configuration in sbatchmpi.run

vim sbatchmpi.run

Run SLURM sbatch command

sbatch sbatchmpi.run

Monitor progress of the SLURM run using

sacct

Summary information will be output to Time_loop.out and any error logs will be exported to Time_loop.err. Relevant output (VTK files and Nusselt numbers at different time steps) are exported to the vtkdata and fdata directories respectively, as specified in global.f90.

Input File Structure

Line #	Column 1	Column 2	Column 3
1	$Pr$	Initial $\alpha$	$\alpha$ step
2	Initial $Ra$*	# of $Ra$s to do	$Ra$ increment multiplier
3	Time*	Time step*	Leave blank
4	$y$ at bottom	$y$ at top	Leave blank
5	Nx*	Ny*	Nz
6	x-refinement	y-refinement	z-refinement
7	Save to VTK*	Save field output*	Leave blank

We retained the same input file structure used by the serial code. Given that the code has been modified to be used only as a time integrator and the flowmap algorithm that finds steady solutions to the 2D Boussinesq equations has been removed not all of the input parameters are relevant for the parallel code. The input.data file in the GitHub repository contains the recommended default parameter settings.

*Parameters that we modified for testing purposes.

I/O

Separate files for each processor and timestep interval will be exported to the fdata directory, representing the field solutions for $T$ and $u_y$ at the specified timestep. The files can be combined using the CombineFields.py script in the scripts directory.

cd scripts
python CombineFields.py

Performance Analysis

Given the extensive amount of benchmarks run by (i) varying the domain size Nx/Ny, (ii) the number of nodes (-N), (iii) the number of processors (-n) and (iv) the number of threads (-c), a Python script was written to automate the BASH/SLURM process. In pseudocode, the script followed roughly the below approach.

import os
import sys
import time

num_nodes_to_try = [List of Nodes]
num_processes_to_try = [List of Processors] 
num_threads_to_try = [List of Threads]
nx_to_try = [List of Nx]
ny_to_try = [List of Ny] 

partition = "shared"

for N in num_nodes_to_try:
    for n in num_processes_to_try:
        for c in num_threads_to_try:
            for nx in nx_to_try:
                for ny in ny_to_try:
                    1. Open the input file
                    2. Check if benchmark has been run before
                    3. If not run before, set up appropriate SBATCH.run script
                    4. SBATCH SBATCH.run
                    5. Save data into relevant directories

Reproducibility

Parallel Rayleigh–Bénard Convection

Contents