Tuning WRF Performance for HPC Cloud Sandy Bridge Systems
In High-Performance Computing, the importance of performance cannot be stressed enough. This is even more crucial when using a pay-per-use, HPC Cloud service like Penguin Computing on Demand (POD), where better computational efficiency directly translates in cost savings for the end-user. Achieving high performance requires an optimal combination of hardware and software components.
Our POD public HPC cloud combines non-virtualized, bare-metal compute nodes, low-latency network and fast storage with an optimized software stack and end-user applications specifically tuned for the hardware platform. Here I describe how we tuned the performance of the Weather Research and Forecasting (WRF) model on our new H30 cluster based on Sandy Bridge compute nodes.
WRF is a mesoscale numerical weather prediction system that is widely used for both atmospheric research and weather forecasting. The computational core of WRF is written in Fortran, and allows parallelism using both the distributed memory and shared memory models. The H30 cluster features compute nodes with dual Intel E5-2670 Sandy Bridge processors (hyperthreading off) and 64GB of memory. An Intel True Scale QDR Fabric provides low-latency connections, and the nodes are connected to a Ceph and NFS infrastructure via 10 Gb Ethernet.
We chose to use the Intel Composer XE compiler suite to build WRF, as this compiler can provide code optimization tailored to the processor features. For instance, the flag ‘-xAVX’ can generate Advanced Vector Extensions instructions that take advantage of the Sandy Bridge architecture.
I built version 3.5.1 of WRF using the Intel 12.1.0 compilers, following the Intel hints for improving performance. For distributed memory parallelism I used our custom build of Open MPI 1.6.4 which is tailored for POD’s InfiniBand interconnect and supports the Performance Scaled Messaging (PSM) communication interface. All the other external libraries needed by WRF, for instance NetCDF, were built using the same compiler and MPI stacks. Two WRF executables were built: an MPI version leveraging distributed memory parallelism, and an hybrid MPI-OpenMP version leveraging both the distributed and shared memory models.
For testing and tuning I used the 4dbasic WRF benchmark, which is a test case designed to represent a typical end user scenario for regional weather modelling. Jobs were run using 12 of the H30 compute nodes, corresponding to 192 total processors.
A first run of the distributed memory executable completed in 75 minutes of wall clock time. This already represents a nice improvement on the WRF performance with respect to our Westmere cluster, M40, which is based on Westmere 2.9 GHz processors. A run of the 4dbasic benchmark on the M40 cluster (16 nodes, 192 processors, optimized with ‘-xSSE4.2’) takes about 110 minutes. Thus, we have a 32% improvement in performance right off the bat for switching from Westmere to Sandy Bridge processors. Even more impressive considering that our Sandy Bridge processors have a clock rate of 2.6 GHz.
Having established 75 minutes as the baseline for my tests, I worked on further tuning the WRF runs. A first way to improve the performance is to optimize memory and cache usage. This is usually achieved by pinning the computing processes to a particular CPU core or socket in order to improve the memory access and reduce the cache misses. Open MPI has a number of command line options specially designed for this purpose. I thus repeated the run using the following command line:
mpirun –bycore –bind-to-core –report-bindings ./wrf.exe
This will bind each compute process to a specific core. The last option (–report-bindings) produces a printout of the CPU affinity masks for each MPI process, allowing to check that the desired effect has been achieved. This run completed in about 65 minutes. A nice 13% improvement on the baseline.
I then went to work on the hybrid version, to see if further gains can be extracted with a careful balance of MPI and OpenMP threads. In addition to adjusting the number of MPI processes and threads, the WRF input parameters for evenly distributing the simulation domain between the compute processes can have an impact on performance. Furthermore, in the case of the OpenMP version, the domain of each process can be divided into tiles and ideally, the best performance is achieved when the size of a tile can fit inside the processor cache.
The best performance of the hybrid version was obtained using 96 MPI processes, 2 threads per process and 2 tiles, with the following command line (after setting numtiles=2 in namelist.input):
mpirun –npernode 8 –bysocket –bind-to-socket –report-bindings ./wrf.exe
Here the computing processes are pinned to a CPU socket. This run completed in 65 minutes too. No further improvement on the pure distributed memory run, but still 13% better than the baseline. The main advantage of using the hybrid WRF over the pure distributed memory version is the reduction of the number of MPI processes, and thus in the communication overhead. This probably takes jobs larger than 192 processors to become noticeable, especially if MPI runs on a fast interconnect like the POD InfiniBand fabric.
A close inspection of the WRF output revealed a further area for improvements: I/O. WRF reports detailed timing for the computational steps of the run. By comparing the time spent in computations to the total run time one can deduce the time spent during I/O. For the 65 minutes run the split turned out to be 60 minutes computation and 5 minutes I/O. This is already quite good, and is due to the efficiency of the POD Ceph storage, but WRF provides a way of further tuning the performance: I/O quilting.
With I/O quilting, some of the compute processes are dedicated to I/O. This allows the operations to proceed asynchronously, as now the pure compute processes do not have to wait anymore on the I/O completion to proceed with the computation. Of course, the drawback is that now there are less computing processes, so again a delicate balance has to be found between the number of processes devolved to I/O and those dedicated to the computation.
By assigning 2 processes to I/O I was able to further reduce the compute time to 62 minutes, a 17% improvement over the baseline. This also required a small modification of the command line, to ensure the two I/O processes were running on different compute nodes:
mpirun –bynode –cpus-per-proc 1 –bind-to-core –report-bindings ./wrf.exe
In conclusion, achieving high-performance takes a multifaceted approach: state-of-the art hardware, optimized software stacks, and fine tuning of the end user application are all important parts of the process. At Penguin Computing we strive to offer an HPC Cloud environment that includes the best of each component.