Experimental Evaluation of SAMR Partitioners   

The experimental evaluation of the partitioners uses the 5-component metric - comprising of load balance (LB), communication (Comm), data movement (DM), overheads (OH), and speed - to obtain partitioner characterization. An overview of the experimental evaluation of the partitioners for 7 SAMR applications on 64 processors is presented in the table below.

For each partitioning technique, the grade ( +, -, o ) in the table indicates how well the partitioner succeeded in optimizing the particular metric. A '+' means significantly better than the other schemes, a '-' means significantly worse, and 'o' implies average or ok. The performance of each partitioner and the respective results follow.

G-MISP : The G-MISP technique is fast and receives average grades in all other metrics except load balance. The load balance generated by this scheme is poor due to the rudimentary load balancing approach.

G-MISP+SP : The G-MISP+SP scheme uses sequence partitioning to remedy G-MISP's poor load balance. The result is that load balance gets close to optimal at the expense of an additional computational cost.

pBD-ISP : Clearly, the pBD-ISP is a good overall partitioner. It excels in the speed and overhead metrics, and generates little communication and data movement. Load balance is its weakest spot. It does, however, receive average grades for load balance for about half of the applications.

SP : The SP technique is not a great success. It is computationally more intense than the other schemes, and its behavior is somewhat unpredictable. In many cases, SP generates a worse load balance than G-MISP+SP. This can be attributed to the differences in the sequences fed to the sequence partitioner.

ISP : The ISP technique is also very fast and generates low overhead. It does, however, struggle with load balance. Its behavior is similar to G-MISP, which is expected since the two schemes are very similar. However, ISP favors the overhead metric at the expense of communication.

WD : Our integration of the ParMetis WD partitioner could not compete with the other partitioners due to extremely large partitioning costs. Even though ParMetis is known to produce high-quality partitionings at a low cost, two extra steps had to be performed to adapt it to SAMR grid hierarchies. Prior to partitioning, a ParMetis graph is generated from the grid. After partitioning, clustering is used to re-generate grid blocks from graph partitions. As a consequence, the dedicated SAMR partitioners easily outperform this ParMetis integration.

RESULTS

Tests have been performed on 16, 32, and 64 processors using the suite of 2D and 3D applications. Besides partitioner evaluations, these tests also enable the evaluation of the impact of computational field population (number of computing nodes/processors) on partitioning quality for a given application. From the experiments, it has been observed that an increase in the number of processors results in a drop in load balance (i.e. imbalance increases), and a reduction in communication, data movement, and overheads. This is in conformance with the expected behavior. Click on the respective application to view the partitioning quality plots for 16, 32, and 64 processors.


Since the primary focus is on application-centric characterization, the table above illustrates the application partitioning requirements. A '+' means that it is significantly tougher to optimize this metric than for other applications, a '-' means significantly easier, and 'a' is for average.

Impact of Granularity

It is observed that as the atomic unit increases (i.e. as granularity is lowered û there being larger chunks and thus lesser number of grains), there is an influence on the partitioning quality parameters. The normalized effect is shown in the following graph, plotted for scalarwave 2D application on 16 processors. Atomic units of 1, 2, and 4 have been used to evaluate the effect of granularity on partitioning quality.

The intra-level communication reduces as atomic unit increases since a greater section of the application locality is under the purview of each processor and hence there is a lesser need for communicating data with other processors at the same level and at a particular time-step.

The data migration significantly increases as the atomic unit increases since the bounding boxes under each processor acquire larger size and hold more data/work. Hence, during refinement across time-steps, a large volume of data has to trade hands between processors in order to maintain state consistency, thus resulting in increased data migration.

Also, the load imbalance for processors increases since a large block of work has been assigned to different processors and there may not be any need for refinement and regridding for a lightly loaded processor, in which case it shares a reduced portion of the entire workload, and consequently, other processors are loaded more than their share. Lesser granularity leads to lesser balance between processors and load imbalance increases with increase in atomic unit. Moreover, there is a reduction in the overheads as atomic unit increases since there are lesser number of boxes (grains) to deal with, resulting in lesser overheads.

Hence, it is critical to arrive at a trade-off between the communication, data migration, and load imbalance parameters such that an optimal solution with respect to accuracy and speed of execution is obtained.
 

Back