Objective: Profiling Applications allows the user to identify bottlenecks in the application and get function profiling information for templated functions. Generating a profile can show which modules are most time intensive. Hence the objective is to collect and analyze performance data, in order to isolate functions in GrACE that can be scheduled more efficiently to improve the overall execution time.
TAU: TAU (Tuning and Analysis utilities) is one such visual and performance analysis environment for parallel C++ and HPF that uses Tcl/Tk for graphics. It is currently designed to instrument parallel multi-threaded, C & C++ code. TAU collects performance data during run time execution of the program and then provides a post mortem analysis and display of performance information.
TAU can show the exclusive and inclusive time spent in
each function. For templated entities, it shows the breakup of time spent for
each instantiation. The other data includes how many times each function was
called, how many profiled functions did each function invoke, what the mean
inclusive time per call was. It shows the mean time spent in a function over all
nodes, contexts and threads. It can also show the exclusive and inclusive times
spent in a function for each invocation of every function (and the aggregated
sum over all invocations).
Instead of time, it can use hardware performance counters and show the number of
instructions issued for each function, the cycles, loads, stores, floating point
operations, primary and secondary data cache misses, TLB misses, etc.
It can also calculate the statistics such as the standard deviation of the
exclusive time( or counts) spent in each templated function.
Instead of Profiling functions, the user can profile at a finer granularity
using timers and it can profile all the above quantities for multiple user
defined timers to profile statements in the code instead of functions.
Instrumenting the code using PDT: For Profiling a function Macros must be added to the source code to identify routine transitions. This can be automatically done using the TAU C++ instrumentor tau_instrumentor. Or by instrumenting the code at runtime using the Dyninst API. We used PDT (Program Database Toolkit) provided by the Oregon University to instrument the GrACE source code. PDT inserts macros in the source code during compilation and then the object files are created from the instrumented source files. The architecture can be explained by the diagram below.

Visualizing traces using VAMPIR: Typically profiling shows the distribution of execution time across routines. It can show the code locations associated with specific bottlenecks, but it does not show the temporal aspect of performance variations. Tracing the execution of a parallel program shows when and where an event occurred, in terms of the process that executed it and the location in the source code. In Addition to PROFILE files, TAU also generates TRACE files for each node, thread and context. These TRACE files can be then converted to .pv format to be viewed using VAMPIR (Visualization and analysis of MPI programs).

This generates exactly at what time a message is sent from one node to other along with other parallelism statistics
References :
TAU http://www.acl.lanl.gov/tau/
VAMPIR http://www.pallas.com/e/products/vampir/index.htm
PDT http://www.cs.uoregon.edu/research/paracomp/pdtoolkit/