Project Summary
Objective:
The objective of this project is to develop distributed indexing schemes
for P2P information discovery systems that can support scalable keywords
searches using partial keywords and wildcards, while providing search
guarantees. Potential applications include document sharing systems,
associative search systems (e.g. a globally distributed
collection of bioinformatics databases), interest-based P2P collaboration
systems, resource and service discovery mechanisms in Computational Grids
(e.g. to enhance them with range queries), and Web Service Discovery.
Squid can also be used as a content-based routing infrastructure.
Specifically we address two key issues, storing and retrieving data based
on its content and scalable and efficient keyword searching.
Motivation:
The recent years have seen an increasing interest in the Peer-to-Peer
(P2P) computing paradigm where entities at the edges of the network can
directly interact as equals (or peers) and share information, services and
resources without centralized servers. While the P2P architecture is not
new (the original Internet was conceived as a P2P system), the
proliferation of the World Wide Web and advances in computation and
communication technologies have enabled the development and deployment of
a large number of P2P applications. These applications take advantage of
resources such as storage, computing cycles and content available at users
machines at the edges of the Internet. Properties such as
decentralization, self-organization, dynamicity and fault-tolerance make
them naturally scalable and attractive solutions for applications in all
areas of distributed computing and communication. Usenet and DNS represent
an earlier generation of successful applications with P2P characteristics.
More recent P2P applications include content sharing and distribution
systems such as Napster (hybrid P2P) and Gnutella, communication
and collaboration systems such as Jabber, Groove and ICQ,
systems for anonymity such as Freenet, embarrassingly parallel
computing such as SETI@home and P2P resource discovery
systems.
An important area where P2P technologies continue to be successfully applied
is information storage and retrieval. P2P storage systems enable seamless
sharing of large volumes of information and data located at the edges of
the network. The key strengths of the P2P architecture include the ability
to harness peer resources, self-organization and adaptation, inherent
redundancy and fault-tolerance, and increased availability, making it an
attractive solution for this class of applications.
Approach:
The overall objective of this project is to design, develop and deploy a P2P storage system that support
efficient, scalable keyword search systems with guarantees, that is, the system guarantees that all
existing data elements matching a query will be found with reasonable costs in terms of number of
messages and number of nodes involved. Queries may be complex and contain wildcards. The proposed system
primarily targets storage and archival systems that focus on persistence and availability.
In our
approach, all data elements are described using a sequence of keywords, which are common words. Keywords
do not have to be statically or globally defined. These keywords can be viewed to form a multidimensional
keyword space, where data elements are points in the space and the keywords are the coordinates. Two data
elements are local if their keywords are lexicographically close or they have common keywords. Our goal
then, is to map data elements that are local in this multi-dimensional index space to indices that are
local in the 1-dimensional index space, which is then mapped to the same peer or to peers that are close
together in the overlay network. We derive such a mapping using locality-preserving mapping called
Space
Filling Curves (SFC). SFCs are a family of recursive and self-similar mappings from N-dimensional
space to 1-dimensional space having the property that points close together in the 1- dimensional space
come from points close together in the N-dimensional space. Using this hierarchical keyword space and the
locality-preserving mapping, we can effectively support keyword searches with guarantees.
In a prototype
implementation, we have used the Hilbert Space
Filling Curve, to map data elements to the index
space. The underlying overlay network topology in this prototype was deterministic and structured and was
based on the Chord
system.
Figure 1: (a) A 2-dimensional keyword space. The data element
"Document" is described by keywords "Computer" and "Network"; (b) Mapping
the 2-dimensional space to a curve. The query (011, *) defines clusters on
the curve (segments); (c) Recursive refinement of query (011, *) viewed as
a tree. Each node is a cluster, and the bold characters are the cluster's
prefixes; (d) Solving the query: embedding the leftmost tree path (solid
arrows) and the rightmost path (dashed arrows) onto the overlay network
topology.
Unlike the
consistent hashing mechanisms, SFC does not necessarily result in a uniform distribution of data elements
in the index space certain keywords may be more popular and hence the associated index space will be more
densely populated. We will investigate load-balancing techniques to address this. For example, in our
prototype we investigated two load-balancing techniques the first applicable at node join (a node is
directed to join in a dense part of the index space) and the second at runtime (nodes exchange
information about their load and the more crowded ones pass a part of their load to the less crowded
nodes).
Optimization techniques are critical
to the performance and availability of the storage system. Caching techniques can be used to minimize
query retrieval time. For example, partial query results can be stored at intermediate nodes that are
traversed by the query response messages. As a result, similar queries can be quickly handled. As the
overall system and the stored data is extremely dynamic, cached information has limited validity. The
size and lifetime of these caches will be investigated. Furthermore, we will investigate replication
techniques to improve the system availability, as well as the system's tolerance to failure. One approach
is to use parallel universes or realities that is the same index spaces mapped in multiple ways to the
peer nodes. Other issues that will investigate include fault tolerance and system resilience, ranking
techniques to prioritize the search results based on their relevance to the user, reputation and trust to
enable detection and handling of malicious peers corrupt/tampered data.
Applications:
P2P Collaboratory for Tissue Microarray Research
Meteor -
A Programming Framework for Content-based Decoupled Interaction
in Pervasive Environments
Status:
The system was evaluated using simulations that implement the SFC-based
mapping, the underlying overlay network, the load-balancing mechanisms,
and the query engine with query optimizations. The simulations are
currently augmented with techniques to further optimize the search,
especially for high-dimension keyword spaces.
The current implementation of Squid builds on JXTA, a general-purpose P2P framework.
People:
Cristina Schmidt
Manish Parashar
Publications:
C. Schmidt and M. Parashar, Analyzing the Search Characteristics of
Space Filling Curve-based Indexing within the Squid P2P Data Discovery System,
- Technical Report TR-276, CAIP, Rutgers University, December 2004
[PDF]
C. Schmidt, M. Parashar, W. Chen and D. Foran, Engineering A Peer-to-Peer Collaboratory for Tissue Microarray Research,
- in Proceedings of CLADE Workshop, at HPDC 13th, June 2004
[HTML]
N. Jiang, C. Schmidt, V. Matossian and M. Parashar, Enabling Applications in Sensor-based Pervasive Environments,
- BaseNets 2004, San Jose, CA, USA October 25, 2004
[PDF]
C. Schmidt and M. Parashar, Enabling Flexible Queries with Guarantees in P2P Systems,
- Internet Computing Journal, Vol. 8, No. 3, May/June 2004
[HTML]
C. Schmidt and M. Parashar, A Peer-to-Peer Approach to Web Service
Discovery, - World Wide Web Journal, Vol. 7, Issue 2, June 2004.
[HTML]
M. Agarwal, V. Bhat, Z. Li, H. Liu, B. Khargharia, V.
Matossian, V. Putty, C. Schmidt, G. Zhang, S. Hariri and M.
Parashar, AutoMate: Enabling Autonomic Applications on the
Grid - in Proceedings of the Autonomic Computing Workshop (AMS2003)
[PDF]
C. Schmidt and M. Parashar, A Peer-to-Peer Approach to Web Service
Discovery, - CAIP Technical Report TR-271:
[PDF]
C. Schmidt and M. Parashar, Flexible Information Discovery in Decentralized Distributed Systems,
- In Proceedings of HPDC-12:
[PDF], [Talk]