Tuning WRF performance for HPC cloud: Broadwell and Omni-Path
The Penguin Computing® On-DemandTM (PODTM) public HPC cloud combines non-virtualized, bare-metal compute nodes, low-latency network, and fast storage with an optimized software stack and end-user applications specifically tuned for the hardware platform.
Our latest expansion, the MT2 cluster, is powered by the B30 class of compute nodes, featuring Dual Intel® Xeon® E5-2680 v4 Broadwell processors with 28 non-hyperthreaded cores per node, 256 GB of DDR4 RAM and an Intel® Omni-Path Architecture (Intel® OPA) low-latency, non-blocking, 100Gb/s fabric. High speed storage is provided by a Lustre® parallel file system, delivered through Penguin Computing FrostByte™ storage solution.
The Weather Research and Forecasting (WRF) model, a mesoscale numerical weather prediction system for atmospheric research and weather forecasting, is a widely used application in the POD cloud. In this post, we describe how we tuned the performance of WRF for the B30 cluster.
For this work, we used version 3.8.1 of the WRF code built with the Intel® Parallel Studio XE 2017 suite of compilers, the Intel MPI message passing library for distributed memory parallelism (dmp) and OpenMP for shared memory parallelism (smp). The same compilers and MPI libraries were used for building the external libraries needed by WRF, for instance NetCDF. We built a hybrid dmp + smp binary that can leverage both the distributed and shared memory models. Testing and tuning was done with the 4dbasic WRF benchmark, which is a test case designed to represent a typical end user scenario for regional weather modelling. Jobs were run using 5 of the B30 compute nodes, corresponding to 140 total processors.
In a parallel run, WRF implements domain decomposition to divide the computational region into tasks, with one task assigned to each MPI rank of the job. Each task can be further divided into tiles, with the number of tiles set to 1 by default. Due to WRF sensitivity to memory bandwidth, to ensure maximum efficiency it is very important for a tile to be small enough to fit inside the cache of the processor and/or core. If the default tile is too large, the number of tiles can be increased (and thus the size of each tile can be reduced) by defining the numtiles input parameter or by setting the WRF_NUM_TILES environment variable. The optimal number of tiles depends on the characteristics of the model and on the hardware. For the present case we found the optimal number of tiles to be 16, so all the jobs reported here were done with WRF_NUM_TILES=16.
We started our test by running a pure MPI job; i.e., starting 140 MPI ranks over five B30 nodes with a single thread each. We used the default Intel MPI settings for process pinning, which result in each MPI rank pinned to a single CPU core. This job completed in 57 minutes, which establishes the baseline for our study and also provides a useful term of comparison with the previous generation of POD hardware: the T30 cluster which features dual Intel® E5-2600v3 Haswell processors with 20 cores per node, 128GB RAM, an Intel® QDR InfiniBand 40Gb/s interconnect, and 10GigE data network. An equivalent job on the T30 cluster (140 MPI ranks over 7 T30 nodes, single thread with core pinning) completed in 66 minutes and 7 seconds, with the B30 nodes delivering a nearly 16% speedup over the T30 class.
We then started working on improving the performance of the B30 jobs: a first source of inefficiency could be contention between the WRF processes and the system processes on the compute nodes, in particular the processes used by the OPA drivers. On the B30 cluster, OPA provides not only the low latency interconnect used by MPI, but also the high speed Lustre storage. Thus, both communication and I/O processes generate a load on the OPA system. In our baseline run, all the cores are used for the WRF computation, and inevitably some of the compute processes will need to share resources with OPA. This disrupts the processor cache and, given the sensibility of WRF to memory bandwidth, can be a source of inefficiency.
We decided to trade compute processes for a more efficient use of the cache: we did a second run starting only 130 MPI ranks, thus leaving 2 cores per node for the OPA workload. We modified the pinning map of the job to make sure that the first core on each of the dual Broadwell chips was not used by the job (these are core 0 and core 14 in the Intel MPI pinning map). We realized this with the following commands:
mpirun -perhost 26 -np 130 ./wrf.exe
The first environment variable tells Intel MPI not to use the default process placement map (this is the map created by the job scheduler), while the second variable leaves cores 0 and 14 unused by the job and thus free for the OPA load. The mpirun command starts 130 MPI ranks, 26 on each B30 node. This run completed in 57 minutes and 42 seconds, just a little slower than the baseline run.
Now that the cache is used more efficiently, we can try to utilize some of the cores left free in the previous run for the job I/O, using WRF asynchronous I/O quilting facility. Since the I/O processes do not do any computation, they can share cores with the OPA processes. We set up a run using 135 MPI ranks, of which 5 (one per node) are used for the I/O quilting. In this run only core 0 of each node is left free:
mpirun -perhost 27 -np 135 ./wrf.exe
I/O quilting with 5 processes is configured with the WRF input parameter nio_tasks_per_group = 5. This run completed in 52 minutes and 49 seconds, nearly 8% faster than the baseline run.
As last experiment, we tried a hybrid dmp + smp run, reducing the number of MPI ranks and increasing the number of threads. Here, the best results were obtained with 65 MPI ranks (13 per node), 2 threads for each MPI rank, and 5 I/O processes. In this run two cores per node are left free:
mpirun -perhost 13 -np 65 ./wrf.exe
This run completed in 52 minutes and 57 seconds, indicating that for this workload pure MPI and hybrid runs provide virtually the same performance.
In conclusion, we have tuned a WRF workload on a Broadwell cluster with OPA fabric: under-subscribing the nodes to leave one or two cores free for the OPA loads provides for a more efficient use of the processor cache. Further gains can be achieved by using some of the free cores for I/O quilting. Pure dmp and hybrid dmp + smp runs provide virtually the same performance for the 4dbasic benchmark.
By Massimo Malagoli, Sr. Cloud Architect