The AMR simulator is a tool for evaluating partitioning techniques without having to run the actual application on a parallel computer. The application dumps a "trace" which is attacked by a partitioner. The trace contains the state of the grid hierarchy (without any partitioning) at each stage of the adaptive application. The simulator reads the output from the partitioner and evaluates the quality of the partitioning, as comprised by the 5-component metric defining partitioning quality.
The following three steps describe the concepts associated with the AMR simulator.
Intra-level Communication
At each time step, the intra-level communication
is the amount of communication present between processors at the same level.
This is measured as follows.
From the input list, each bounding box (bb) is
grown by a specified stencil size in all directions such that it maintains
an overlapping zone at its boundaries with its neighboring processing elements
(bb in this case) for synchronization of data during the updating process.
So, the bb can be logically divided into appropriate number of interaction_bbs
(depending on the dimension). Each of these interaction_bbs is intersected
with each of the remaining bounding boxes on the list. If there is an intersection
between the two, there is indeed intra-level communication and it is measured
by the size of the intersection_bb of the two bbs. The intra-level communication
at a particular level is the sum of intra-level communication of all bounding
boxes in the list.
Amount of intra-level communication =
à.. (1)
Now, intra-level communication of bounding box
i is the sum of intra-level communication it has with all other bounding
boxes. We can update eqn. (1) as
Amount of intra-level communication =
The amount of intra-level communication of bounding
box i with bounding box j can be found by calculating half the size of
intersection of two bounding boxes.
Inter-level Communication
This is measured for each of the time-steps, between the processors that contain bbs at different levels of refinement. When a refined grid is created, its boundaries are usually interior to some coarser grid values. After updating the function values on a fine grid, the underlying coarse grid values are updated through restriction. The measurement process is as follows for each of the lists of bbs belonging to all processors at each timestep.
Each bb for a bblist is compared to each bb of a bblist that is either one level greater (child). If there is an interaction between the 2 bb, then inter-level communication exists and it is measured by the size of the intersection bb.
Data Movement
At the end of each time-step, the partitioning scheme allocates different bb to different processors. Hence bb that belonged to one processor may or may not belong to it at the next step. Therefore, data movement is characterized by the migration of any such bbs from one processor to another at different timesteps. It is measured as follows.
Compare the list of bbs at different time-steps.
If there is an intersection between the 2 bbs, then there indeed is data
movement and it is measured by the size of the intersection bb. The redistribution
from time-step t1 to time-step t2 is given as:
Let the load on each processor be load1, load2,
..., loadn at time-step t1
Let the loads on each processor be load1a, load2a,
..., loadna at time-step t2
Then,
Data movement for processor1 = dm1 = load1 Ç
load1a ...... (intersection)
Data movement for processor2 = dm2 = load2 Ç
load2a
:
:
Data movement for processor1n = dmn = loadn Ç
loadna
And,
Total data movement = dm1 + dm2 + .... + dmn
Load Imbalance & Overheads
The load on each processor is computed as the
workload area belonging to that processor at a certain level of refinement.
Considering the refinement in space and time (with refinement factor of
2), and the size of the bounding box as the workload parameter, the work
for a processor is given by
Work
= bb.size() * (2 ^ level)
Sum of the loads on all the bounding boxes of
all the processors is given by Total_load + = Work
Ideal
load on each processor = Total_load /nproc
The load imbalance on each processor at a timestep
can then be calculated as
Load
Imbalance = ( ( Actual load û Ideal load ) / Ideal load ) * 100 %
Partitioning overheads in terms of number of boxes can be computed as the number of boxes in the bounding box list associated with each processor at a given time-step.