next up previous
Next: 3.3 Adaptive HPA Up: 3 Hierarchical Partitioning Algorithm Previous: 3.1 General HPA

3.2 Static HPA

In the Static HPA strategy, the group size and the group topology is defined at startup based on the available processors and the size of the problem domain. The static processor group hierarchy consists of two levels. It is static in the sense that once the group configuration is setup it will be fixed for the entire execution of the application. Even though it is static, SHPA does possess the basic advantages of the general HPA strategy. It localizes the load redistribution and balancing within processor groups and enables concurrent communication among processor groups. Note that, SHPA is still a dynamic load balancing algorithm [14], as load is dynamically redistributed within and across processor groups - only the processor group hierarchy remains static. The load partitioning and assignment procedure is shown in Table 3. As described in the table, the number of groups, $N_{totalgroups}$, is defined at application startup. The load balancing phase in SHPA is similar to the steps in Table 2 with two group levels.

Table 3: Load partitioning and assignment in Static HPA
  1. [Step 1.] Use SFC to obtain the composite grid unit (CGU) list.
  2. [Step 2.] Partition the CGU list into $N_{totalgroups}$ subdomains.
  3. [Step 3.] Assign the load $L_i$ on subdomain $R_i$ to a group of processors $G_i$ such that the number of processors $NP_i$ in the group $G_i$ is proportional to the load $L_i$, i.e., $NP_i=L_i/L_{sum} \times NP_{sum}$, where $L_{sum}$ is the total size of load and $NP_{sum}$ is the total number of processors.
  4. [Step 4.] Partition the load portion $L_i$ and assign the appropriate portion to the individual processor in the group $G_i$, for $i=0, 1, ..., NP_{totalgroups}-1$.

The Static HPA is implemented as part of the GrACE library [10]. The groups are created using communicators provided by the MPI library. Communication within groups is through intracommunicators while communication between processors belonging to different groups is through intercommunicators. The Static HPA scheme is evaluated on the IBM SP2 cluster at University of California, San Diego. The application used in these experiments is a numerical relativity kernel (Wave3D) and belongs to the general class of SAMR applications. Wave3D solves a coupled set of partial differential equations. This kernel is a part of the Cactus numerical relativity toolkit [4].

Figure 5: Execution time: Static HPA versus Non-HPA scheme
\begin{figure*}
\centering\epsfig{file=figures/statichpa96.eps, scale=0.8,
clip=, bb= 35 505 580 685}
\end{figure*}

The experiments measure the total execution time of the simulations using Static HPA and Non-HPA schemes. In Figure 5, we observe that, the maximum performance gain is obtained for 128 processors with 8 groups, a $41\%$ overall execution time reduction as compared to the Non-HPA scheme. As shown by the evaluation, the benefits of SHPA depends on the appropriate selection of the number of processor groups, which in turn depends on the system and the application. The adaptive HPA scheme attempts to address the limitation by dynamically managing processor groups.
next up previous
Next: 3.3 Adaptive HPA Up: 3 Hierarchical Partitioning Algorithm Previous: 3.1 General HPA
Xiaolin Li 2002-06-15