Application-centric experimental evaluation of six partitioning schemes - SFC, G-MISP, G-MISP+SP, pBD-ISP, SP-ISP, and ParMetis, using RM3D compressible turbulence application, conducted on NPACI IBM SP2 - "Blue Horizon" at San Diego Supercomputing Center.
Experiments used base (coarse) grid of 128*32*32 with 3 levels of factor 2 space-time refinements with dynamic regriding and redistribution at regular intervals. The application ran for 150 coarse-level steps in each case. Experiments consisted of varying the partitioner used, number of processors (16-128), and partitioning granularity (2*2*2 - 8*8*8). Metrics used for evaluation are the total run-time, maximum load-imbalance, and corresponding AMR efficiency. AMR efficiency is the measure of effectiveness of AMR and affects partitioning and load balancing requirements. High AMR efficiency leads to finer granularity refinements. The results are shown in the following graphs. Note that the absence of a bar for a partitioner in the graph indicates that the partitioner was not suitable for that combination.
The RM3D application required rapid refinement and efficient redistribution due to the shock wave introduced. The pBD-ISP, G-MISP+SP, and SFC partitioning schemes are best suited to the RM3D application as they are high-speed partitioners that attempt to distribute the workload as evenly as possible while maintaining good communication patterns. The pBD-ISP scheme is the fastest partitioner but generates average load balance which worsens with higher granularity. G-MISP+SP and SFC techniques yield excellent load balance but are relatively slower. The G-MISP scheme favors speed over load balance and has an average overall performance. The SP-ISP technique fares poorly due to partitioning overheads and high computational costs, resulting in higher partitioning time and poor load balance. All evaluated partitioning techniques scale reasonably well. The optimal partitioning granularity for an application may require a trade-off between the execution speed and the load imbalance. In the case of RM3D application, a granularity of 4 gives the lowest execution time and yields acceptable load imbalance.
SP-ISP, G-MISP and G-MISP+SP partitioning schemes fail for experiments on large number of processors using a higher granularity. SP-ISP scheme fails due to the large number of blocks created. G-MISP and its variant G-MISP+SP fail due to the effects of high granularity on the underlying partitioning mechanism. Finally, our ParMetis integration proved to be computationally expensive due to the additional effort required for adapting it to SAMR grid hierarchies. As a result, it could not compete with dedicated SAMR partitioners for the RM3D application and is not part of the results.