Contents

Discussion

Speed-Up

We have been able to show meaningful speed up by converting a serial code to a parallel MPI version, using a pencil decomposition in the horizontal direction. We are also able to show, in addition to MPI in the horizontal direction, that by parallelizing the vertical domain using OpenMP even better performance can be achieved. However, the speed-up (as is typically the case) depends on the problem size, number of nodes and number of processors used.

We illustrate that using 8 nodes (and the optimal number of processors per node), we achieve approximately 7x speed up when using MPI solely and approximately 10x speed up when combining MPI with OpenMP. If this relationship holds across all the nodes then we would achieve N times speed up for the number of nodes (N) and optimal number of processors (n). Having said that, it should be noted that the speed-up is limited given the different discretizations in x and y directions. We expect to see more speed-up if the y-domain is discretized as Fourier transforms instead of a finite difference stencil as this would allow us to use FFTW with MPI in both dimensions.

Challenges

The main challenges we faced developing a working and accurate version of the serial code in parallel are outlined below:

  1. Developing an understanding of the various data types and initialization required by the FFTW MPI library
  2. Handling the local domain assignments correctly to ensure each processor had the right part of the domain (e.g. Kx modes in Fourier space) which involved line-by-line analysis of the entire code
  3. Working within a complex code base of thousands of lines of Fortran code and developing an adequate understanding of the underlying mathematics to inform the application design
  4. Validating the correctness of parallel output and performing painstaking debugging to identify the root cause for discrepancies between the two versions initially
  5. Securing the necessary compute resource on Odyssey was a challenge (it was very difficult to get allocated 8 entire nodes, i.e. 32 processors per node, for example)

Conclusions & Next Steps

We successfully implemented and validated a parallel version of Sondak et al’s Rayleigh–Bénard Convection solver using both MPI only and MPI/OpenMP across single and multiple nodes and varying number of processors.

Using 8 nodes, Nx=65,536, Ra=5000 and 32 processors we are able to achieve a speed-up of c. 7x using MPI only and c. 10x using MPi/OpenMP compared to the serial version with the same parameters. We also illustrate that optimal performance significantly depends on number of nodes and number of processors used as well as the size of the domain. Larger domains tend to have a larger number of optimal processors after which communication overhead becomes too large.

The analysis forms a strong foundation to extend the problem into 3-D as long as discretizations across all the dimensions are the same (e.g., finite differences or FFT’s throughout). As such, the analysis could lend itself to interesting research and further work in this context.

Acknowledgements

We would like to sincerely thank David Sondak for providing us with a working serial version of the code as well as for his support and encouragement throughout the design and implementation process. We would also like to extend our thanks to Ignacio Llorente for his support during lab sessions. Finally, we would like to thank the FAS Research Computing members for helping us navigate SLURM jobs on Odyssey.