The first step in visiometrics, identification, involves feature extraction, which has been considered and carried out in contexts other than the process of obtaining understanding. Feature extraction can be used as a selective way to display data, which avoids the excessive, confusing visual clutter that arises when too much information is shown [10]. Besides supporting the understanding process and helping to avoid visual clutter, feature extraction may be a very effective and natural way to deal with large datasets. Many existing visualization systems make performance trade-offs that assume relatively small quantities of data (solution and grid fit in RAM) [11]. Unfortunately no current computer can hold some of the time dependent CFD datasets (from 5 to 162 Gigabyte) that are currently being produced. The key points in dealing with these datasets are: extraction of ``scientific data'' and use of a ``persistent object database'' [11]. These objects usually correspond to ``coherent structures'' (localized objects, which persist over ``characteristic times'' [3]). Feature extraction is accompanied by large reductions in storage requirements (0.3 - 6.7 % of solution size) [11]. The main disadvantage of the feature extraction approach is that the solution domain outside the extracted regions cannot be examined [11]. Reduced representations of those sub-domains in terms of statistical quantities are still under investigation.
The main function of feature extraction is to start the abstraction process, i. e., the selection of portions of information which are fundamental in the physical phenomena observed and can be accomplished in different ways. Initially it can be just a thresholding operation and extrema tracking [12]; However, it can also be viewed as the process of obtaining reduced representations (ellipsoidal or skeletal representation plus vector lines released from selected starting points) for the data [13]. This can be used to point out causal connections between different variables [12, 9, 13] and also for model juxtaposition, i.e., detailed and quantitative comparison of experimental and/or computational images of similar or different functions at the same or different times [2]. The reduced representations correspond to ``identified objects'' in the data, and can also be used as tools for perusal, interpretation, quantification and feature tracking [3]. Other feature extraction procedures include the use of streamlines connected through nodes or critical points, which characterize global flow topology [14]. Some of these methods have been implemented as interactive tools to extract meaning from datasets in visualization environments.
The feature extraction process suggest a natural distribution of tasks between supercomputer and workstation: solvers, solutions and extractors should reside in the supercomputer, where the large size of the data set is dealt with more efficiently. Reduced object manipulation, feature tracking and time correlation are more properly performed on the workstation. The desired interactivity requires communication between supercomputer and workstation. The communication at the level of reduced object representation helps in avoiding the network bottleneck.
In the particular case of isotropic turbulence simulations, different researchers have used a ``probe'' or ``window'' (fixed or Lagrangian) for searching and obtaining the full time history of vortex tubes [15, 16]. Their object identification algorithm consists of taking a local maximum in the field, and tracing the skeleton of the vortex tube. Diagnostics include the tube's length, curvature, length to diameter ratio and circulation [17]. Algorithms for extracting ``events'' have indicated that intermittent regions may be major, if not dominant contributors to global statistics in turbulence [18]. In our studies of vortex collapse and reconnection, our ``diagnostics box'' surrounds regions of significant physical behavior, detected through the search for maxima events [5].
The feature extraction/object identification is performed in three
steps: thresholding, object identification and ellipsoid fitting. The
objects in the dataset, or field, are defined by a scalar function,
(e.g. vorticity magnitude) and a threshold,
. The first stage, thresholding, consists of finding and
extracting the points in the dataset where f is above the threshold
specified by the user, including their position in space. Object
identification is performed on the thresholded grid points based on a
recursive search and a connectivity criteria. A direct method is not
efficient because the required operations is
). Our object
segment algorithm reduces the computational complexity of the problem
to
by using an octree data-structure. Once the
objects have been found, an object quantification process starts,
which we call ellipsoid fitting. This consists of finding the ``mass''
m,
the centroid
,
the average orientation of the vector field
,
the tensor of second moments
,
the maximum
,
and the position of the maximum
inside each object
[12].
The reduced quantities
and are used to produce low order
representations, or ellipsoids, which are located in the
centroid of the objects. The axes are the square roots of the
eigenvalues of the tensor of second moments
, normalized so
that the ellipsoids and the objects have the same volume. The
ellipsoids are oriented according to the eigenvectors of the tensor
. The reduced representation obtained in this manner not
only fits the shape of the object, but averages over the values of the
scalar field in the interior, making it possible to differentiate
between objects of similar shape and volume. Using one of the reduced
quantities (usually
), the objects are sorted and listed for
further use by the user or other post-processing program (like feature
tracking).
As a framework for implementing these tools, we selected the commercial package AVS, which is based on a data-flow model for visualization and control [19, 20] and is constructed using the modularity and networking concepts. Application units, called modules, are organized and made available to the user through a ``network editor''. The modules are selected by the user to form networks in the network editor working area. The networks are therefore flexible enough to meet the particular needs of the users. The modular characteristics have the advantage of allowing the user to produce his/her own modules and insert them in networks of standard modules. This gives the user all of the power of the commercial product in their very specific applications. Another advantage is the availability of mechanisms to share tasks among different machines via ``remote modules''. Therefore, the package provides a basis for interaction between the supercomputer and the workstation. We have feature extraction algorithms in different implementations operating in both the supercomputer and the workstation. In the first approach, we perform object segmentation on the supercomputer. The extracted objects are then displayed on the workstation. In a second approach we produce an interactive window in the large dataset by using a remote module running on the CM5. This module sends interactively selected sub-domains of data from the supercomputer to the workstation. In a third implementation, the large dataset is subjected to a thresholding post-processing operation (selective data reduction) in the supercomputer in batch mode. The resulting reduced dataset still covers (selectively) the complete domain and can then be post-processed interactively on the workstation.
In order to deal with the large number of vortex structures present in
the turbulence dataset, we classify them according to their
relationship to maxima events, not only of vorticity magnitude but
strain-rate as well. This is the objective of the ``object-segment''
program, which is an enhancement to the standard iso-surfacing
technique. On the CM5, a number of parallel functions are used to
represent data-points and for operations of connectivity and
membership. We demonstrate the use of this feature isolation code in
figure
. In this figure, the threshold value 20% of the
maximum was used to detect the objects observed, however, regions are
extracted based upon their connectivity. The dataset is a 256 cubed
scalar field (vorticity magnitude). The output of the program
consists of a list of objects sorted and colored according to the
local maxima inside them, which allows the user to select "coherent"
regions for further quantification. In the figure, after the objects
in the field have been identified, the predominant object (colored in
red) is extracted for closer examination (figure
).
The large dataset produced on the supercomputer can be accessed more
easily when it still resides in that environment. In particular, for
the CM5, parallel I/O can be used via the SDA (Scalable Disk Array),
which provides a capacity of 25-200 Gigabytes that can be accessed at
33-254 Megabytes/second. It is possible to use the CMAVS/AVS interface
to access interactively the CM5 resources through the
workstation. Using this procedure, we easily read the data (
Megabyte) and hold it in memory. After this process is
accomplished, a data reduction process is necessary to transfer the
data through the network. Options tried by different researchers
include the computation of geometries, which are passed to the
workstation for displaying [20]. In some other cases the
post-processing is extended to the production of the rendering (the 2D
pixel map) on the supercomputer, which may have a smaller volume of
data than the actual geometric objects forming an isosurface (for
example). In our case, we extract an interactively selected cubic
sub-domain, which is transferred through the network to the
workstation. From the workstation, the user is allowed to change the
size and the position of the extracted sub-domain so that he can
browse through the data. It is possible to work in this mode
interactively using the whole CM5 (1024 nodes at ACL-LANL), as
has been done by some ACL researchers in special
circumstances. Nevertheless, in practical situations, it may be
difficult to obtain more than 128 nodes to work interactively.
Another important factor for interactivity is the amount spent in
transferring data between the CM5 and the workstation. For a
sub-domain of
, we find that the amount of transfer time on a
local Ethernet network between the CM5 and the Onyx machine (at
ACL-LANL) is very acceptable. The same case running on the CM5 at NCSA
(Illinois) and the Vizlab Onyx machine (New Jersey) takes a few more
seconds, but is still acceptable. Researchers at ACL are able to send
considerably larger amounts of data by using the HIPPI network.
In our last approach the thresholding operation is performed on the CM5
in batch mode. The thresholded points are marked, enumerated and
then transferred to the output arrays by scatter-gather
operations. The output file of this program contains position and the
scalar value of the thresholded points. The reduced datasets obtained
in this way may also store other information like strain-rate and
vortex stretching magnitude. The thresholded points have the
appearance of a set of scattered points. The modules developed to
process this type of information include object segmentation
algorithms and the diagnostics box. Non-standard data types for
data-flow between the modules are introduced to handle the new formats
of data (scatter points and lists of objects) produced by the data
reduction and object identification processes. Rendering is achieved
using standard modules. The first output of the modules is a list of
"interesting" objects to be examined. The selection criteria is
prescribed interactively by the user, but the search is performed
automatically by the computer. The second output are ``ellipsoids'',
representing the objects in the list which are colored according to
the local maximum inside them. It is possible also to display the
"filtered"
domain by using spheres with sizes, colors and
transparencies proportional to the magnitude of the scalar field for
each of the thresholded points (figure
). The user is
able to visually browse through the list of objects via the
diagnostics box (figure
). Different variables
can be examined simultaneously. In this way a field can be visualized
(e.g. vorticity magnitude) according to the important objects found in
a related field (e.g. strain-rate magnitude), which is useful for
establishing correlations. The reduced object representations or
ellipsoids are also used as pointers to local maxima. The ellipsoids
turn out to be very effective release regions for vector line tracers,
which can be selected interactively by the user.
The region containing the maximum vorticity magnitude in the
turbulence dataset is examined in figure
. The
vorticity magnitude isosurface at the threshold 30% of the maximum,
shows the topology of this region. The ellipsoids, fitted at the
threshold 45% of the maximum mark the local maxima regions inside the
objects. The vector lines trace the vorticity field associated with
the objects. The color of the lines is used to indicate direction of
the vortex field (vorticity "flows" from blue to red). Sets of bundles
are released from the three ellipsoids. The lines in the object in the
tube at the center of the picture appears to be formed by two parallel
tubes winding around each other, which bifurcate in the upper right
corner region. The isosurface in the lower pictures corresponds to
the magnitude of the strain-rate. It can be observed that the
strain-rate maxima are not localized in the same position as the
vorticity maxima, which has been observed in other turbulence
simulations [21] and identified as a fundamental feature in
our vortex filament models. In figure
, we present
vorticity and strain-rate magnitude fields for the object classified
as the 12th according to the vorticity magnitude in the
dataset. The vorticity in the two parallel vortex tubes are of opposing
signs, nevertheless tracking in time of this object is necessary to
determine if this is a case of vortex collapse
[7].