Associative Mining of Scientific Data


 This project is supported by the National Science Foundation, ITR program.

Summary:

The goal of this project is to develop a methodology and a set of prototype tools to enable “associative” mining of very large (archived and runtime) scientific datasets. Wewill use content, such as solution features, patterns, and shapes to mine the datasets and isolate and retrieve required information. This is in contrast to current approaches that use index-based coordinates (i.e. i,j,k, or some run-id) and timesteps to retrieve information. This concept is based on content-based associative memories.

Our interests are in the questions that scientist typically ask when analyzing time-dependent datasets where evolving features are present. These questions include:

1. Have I seen this evolution before?
2. Is it “similar” to any experimental observation?
3. Can I quantify the “similarity”?
Using conventional analysis tools, these question can only be answered (painstakingly) by walking through the entire database. With the size of datasets quickly approaching many petabytes (1000 nodes, 10243 grid, 1000 timesteps and many variables), analyzing this information can be overwhelming.

Our aim is to develop a set of enabling technologies to assist in the analysis of scientific datasets by providing tools to answer the questions posed above. The tools will operate on distributed time-varying data and will act as a template for other methods.

Specific research objectives of this proposal include developing
distributed multi-resolution techniques:

(1) for cataloging interesting phenomena connected in space and time; and
(2) for searching both run-time and existing databases for interesting phenomena.
 
An overview of the environment is shown below:
 


 

Visualization Requirements:

What is needed is visualization, quantification and querying techniques to help filter and reduce the data to a formmore conducive to analysis. This is complementary to the standard visualization and helps explain in more mathematical and quantitative detail what is being seen. We also show how this quantification information can be used to improve the visualization. Our goal is to help the scientist understand the simulation in the context of all the simulations previously computed. This will enable the scientist to develop new theories or improve existing mathematical theories to explain the observed phenomena. To be useful, the visualization and cataloging techniques must be interactive.

Since the datasets are so large, the only way to compute interactively is to run on a distributed platform where the data resides and/or while the data is being computed. Therefore, the two goals of this proposal are: (1) To create a set of novel visualization-querying techniques which will allow users to ask questions about how phenomena interrelates between datasets; and (2) To implement these methods on a distributed platform. Towards this goal, we have been working on the following methodologies:

(1) Distributed feature extraction and tracking: This part of the project is concernted with the implementation of a parallel/distributed version of the feature extraction and tracking algorithms to enable run-time querying of features. The algorithm is being implemented withing GrACe and the querying made possible through the DISCOVER portal.

(2) Algorithms to determine similarities between features: This part of the project involves skeletonizing 3D objects and computing/comparing the similarities in the skeleton. Some skeletonization code is available here. A paper describing this work is available on the Publications page.
 

More information will be available soon.