Contents
Hardware Architecture
Code was run on Harvard’s Odyssey cluster. Odyssey is a large scale heterogeneous high performance computing cluster used by Harvard University. Refer to the FAS RC website for more information.
Hardware Specification
To replicate the benchmarking results, please use the following specifications.
- Partition: Shared
- CPUs/Node: 2 (Intel Xeon Broadwell)
- Cores/Node: 32
- CPU clock speed: 2095 MHz
- CPU cache: L1 32K, L2 256K, L3 40960K
- Memory/Node: 128GB
- Network: FDR InfiniBand
- Nodes Used: 1-8
Software
To replicate benchmarking results, please make sure to, at the minimum, allow for the below software dependencies.
Dependencies
Main Programming Language: Fortran90
Secondary Programming Language: Python 2.7.x or 3.6.x (used for I/O, plotting and automating bash scripts only)
Parallel Implementation:
Odyssey modules: intel/17.0.2-fasrc01 openmpi/2.1.0-fasrc01 fftw/3.3.5-fasrc01
Serial Implementation:
Odyssey modules: gcc/4.8.2-fasrc01 fftw/3.3.5-fasrc01
Libraries used:
- Open MPI 2.1.0
- FFTW 3.3.5
- BLAS and LAPACK
Compiler
Parallel Implementation:
- Intel Fortran compiler (
mpifort
) 17.0.2
Serial Implementation:
- GCC compiler (
gfortran
) 4.8.2
Installation
- Log into Odyssey (e.g., via SSH)
ssh username@login.rc.fas.harvard.edu
- Load required modules
module purge
module load intel/17.0.2-fasrc01
module load openmpi/2.1.0-fasrc01
module load fftw/3.3.5-fasrc01
- Clone GitHub repository
git clone https://github.com/toledy/ParallelRayleighBenardConvection.git
Note: In order to clone a GitHub repository on Odyssey it is strongly encouraged to use the SSH protocol, in conjunction with a personal, Odyssey public SSH key that needs to be included in your Github account. Instructions on how to do this is available here.
- Update path variables in
global.f90
in the src directory. (Note that length argument needs to be updated depending on the length of the input string)
vim global.f90
character(len=84) :: vtkloc="PATH"
character(len=82) :: floc="PATH"
- Compile code in src directory using make
make all
- Specify simulation parameters in
input.data
as required
vim input.data
-
Ensure the vtkdata and fdata directories are empty if exporting output
-
Edit SLURM sbatch configuration in
sbatchmpi.run
vim sbatchmpi.run
- Run SLURM sbatch command
sbatch sbatchmpi.run
- Monitor progress of the SLURM run using
sacct
Summary information will be output to Time_loop.out
and any error logs will be exported to Time_loop.err
. Relevant output (VTK files and Nusselt numbers at different time steps) are exported to the vtkdata and fdata directories respectively, as specified in global.f90
.
Input File Structure
Line # | Column 1 | Column 2 | Column 3 |
---|---|---|---|
1 | $Pr$ | Initial $\alpha$ | $\alpha$ step |
2 | Initial $Ra$* | # of $Ra$s to do | $Ra$ increment multiplier |
3 | Time* | Time step* | Leave blank |
4 | $y$ at bottom | $y$ at top | Leave blank |
5 | Nx* | Ny* | Nz |
6 | x-refinement | y-refinement | z-refinement |
7 | Save to VTK* | Save field output* | Leave blank |
We retained the same input file structure used by the serial code. Given that the code has been modified to be used only as a time integrator and the flowmap algorithm that finds steady solutions to the 2D Boussinesq equations has been removed not all of the input parameters are relevant for the parallel code. The input.data
file in the GitHub repository contains the recommended default parameter settings.
*Parameters that we modified for testing purposes.
I/O
Separate files for each processor and timestep interval will be exported to the fdata directory, representing the field solutions for $T$ and $u_y$ at the specified timestep. The files can be combined using the CombineFields.py
script in the scripts directory.
cd scripts
python CombineFields.py
Performance Analysis
Given the extensive amount of benchmarks run by (i) varying the domain size Nx/Ny, (ii) the number of nodes (-N
), (iii) the number of processors (-n
) and (iv) the number of threads (-c
), a Python script was written to automate the BASH/SLURM process. In pseudocode, the script followed roughly the below approach.
import os
import sys
import time
num_nodes_to_try = [List of Nodes]
num_processes_to_try = [List of Processors]
num_threads_to_try = [List of Threads]
nx_to_try = [List of Nx]
ny_to_try = [List of Ny]
partition = "shared"
for N in num_nodes_to_try:
for n in num_processes_to_try:
for c in num_threads_to_try:
for nx in nx_to_try:
for ny in ny_to_try:
1. Open the input file
2. Check if benchmark has been run before
3. If not run before, set up appropriate SBATCH.run script
4. SBATCH SBATCH.run
5. Save data into relevant directories