Running LS-DYNA Efficiently on HPC Cloud: MPP vs. Hybrid
One of the applications most widely used on the Penguin Computing on Demand (POD) HPC cloud service is LS-DYNA, a general-purpose multiphysics finite element analysis software that can simulate complex real word problems. LS-DYNA is commonly used by engineers in the automotive, aerospace, construction, military, manufacturing, and bioengineering industries.
Two parallel versions of the LS-DYNA solver are available: one implementing shared memory parallelism (SMP) via multiple threads, and one exploiting message passing parallelism (MPP) using the MPI library. More recently a hybrid solver has been developed that combines the two paradigms: SMP can be used within a compute node while processes running on different compute nodes communicate with MPP.
The hybrid approach has some attractive features that can, in principle, make it preferable to a pure MPP model in the case of large jobs: with a mixture of MPI processes and threads one can deploy a computational profile tailored for the NUMA architecture of modern HPC hardware: one MPI process per NUMA node (typically a CPU socket) using threads to exploit the multiple cores of the processor. Another advantage is the reduction in the number of MPI processes with respect to a pure MPP job of the same size, which can decrease the overall communication overhead. Finally, since the MPP solver is based on the domain decomposition method, the use of threads allows to scale the simulation without changing the number of domains, thus preserving the numerical profile of the solution.
Here I compare the performance of MPP and hybrid LS-DYNA on the POD HPC cloud platform. The jobs were run on the H30 cluster, which features compute nodes with dual Intel® E5-2670 Sandy Bridge processors (hyperthreading off) and 64GB of memory. An Intel® True Scale QDR Fabric provides low-latency connections, and the nodes are connected to a Ceph and NFS storage infrastructure via 10 GbE Ethernet. The runs were performed with LS-DYNA version 7.0.0, using the MPP and hybrid distribution binaries for Linux x64, CentOS, Intel® Fortran Compiler (IFORT), and Open MPI. The run time Open MPI libraries used are our custom build of Open MPI 1.5.5 which is tailored for POD’s InfiniBand interconnect and supports the Performance Scaled Messaging (PSM) communication interface. For the test I chose the car2car benchmark, a 2 car crash simulation model with 2.4 million elements.
In a previous post we have already discussed the importance of binding the compute processes to a particular CPU socket, in order to improve memory access and reduce cache misses. This is particularly relevant for the hybrid solver, which is designed to closely mimic the physical architecture of the processor. For this reason, in both the MPP and hybrid runs I used the openmpi command line options for process binding. The MPP runs were thus started using the following command line:
mpirun -npernode 16 -bysocket -bind-to-socket mpp971 …
while the hybrid solver was launched as follows:
mpirun -npernode 2 -bysocket -bind-to-socket hyb971 ncpu=-8 …
Thus the pure MPP run uses 16 MPI processes per compute node, while in the hybrid solver only 2 MPI processes per node are started (one for each CPU socket), each process spawning 8 threads.
The results of the runs are collected in the following table. Here the second column lists the total number of processes for the run. In the case of the MPP solver this corresponds to the number of MPI processes, while for the hybrid solver the number of MPI processes is obtained dividing the total number of processes by 8.
We can see that for this benchmark the pure MPP solver performs better than the hybrid for lower processor counts. Here the number of MPI processes is not large enough for the communication overhead to dominate the compute time of the pure MPP version. In this case the different type of overhead incurred by the multi threaded run, together with the different parallelization model, make the hybrid solver less efficient.
For larger jobs the performance of the two solvers becomes virtually the same, with perhaps a little advantage for the hybrid solver in the 256 processors run. For these jobs the communication overhead takes a larger share of the total time, which gives a relative advantage to the hybrid solver. On the other hand, the fast InfiniBand hardware and the optimized Open MPI stack used by POD ensure that the MPP DYNA implementation remains fairly efficient at these job sizes too.
In conclusion, the MPP LS-DYNA solver exhibits optimal performance for both small and medium-large jobs on the Penguin HPC cloud. The hybrid solver may be appealing for large jobs, especially if one wants to further speed up the calculation without changing the domain decomposition properties of the simulation.