Squid: Enabling Flexible Querying with Guarantees in P2P Storage Systems

People      Publications


Project Summary

Objective:

The objective of this project is to develop distributed indexing schemes for P2P information discovery systems that can support scalable keywords searches using partial keywords and wildcards, while providing search guarantees. Potential applications include document sharing systems, associative search systems (e.g. a globally distributed collection of bioinformatics databases), interest-based P2P collaboration systems, resource and service discovery mechanisms in Computational Grids (e.g. to enhance them with range queries), and Web Service Discovery. Squid can also be used as a content-based routing infrastructure. Specifically we address two key issues, storing and retrieving data based on its content and scalable and efficient keyword searching.

Motivation:

The recent years have seen an increasing interest in the Peer-to-Peer (P2P) computing paradigm where entities at the edges of the network can directly interact as equals (or peers) and share information, services and resources without centralized servers. While the P2P architecture is not new (the original Internet was conceived as a P2P system), the proliferation of the World Wide Web and advances in computation and communication technologies have enabled the development and deployment of a large number of P2P applications. These applications take advantage of resources such as storage, computing cycles and content available at users machines at the edges of the Internet. Properties such as decentralization, self-organization, dynamicity and fault-tolerance make them naturally scalable and attractive solutions for applications in all areas of distributed computing and communication. Usenet and DNS represent an earlier generation of successful applications with P2P characteristics. More recent P2P applications include content sharing and distribution systems such as Napster (hybrid P2P) and Gnutella, communication and collaboration systems such as Jabber, Groove and ICQ, systems for anonymity such as Freenet, embarrassingly parallel computing such as SETI@home and P2P resource discovery systems.

An important area where P2P technologies continue to be successfully applied is information storage and retrieval. P2P storage systems enable seamless sharing of large volumes of information and data located at the edges of the network. The key strengths of the P2P architecture include the ability to harness peer resources, self-organization and adaptation, inherent redundancy and fault-tolerance, and increased availability, making it an attractive solution for this class of applications.

Approach:

The overall objective of this project is to design, develop and deploy a P2P storage system that support efficient, scalable keyword search systems with guarantees, that is, the system guarantees that all existing data elements matching a query will be found with reasonable costs in terms of number of messages and number of nodes involved. Queries may be complex and contain wildcards. The proposed system primarily targets storage and archival systems that focus on persistence and availability.

In our approach, all data elements are described using a sequence of keywords, which are common words. Keywords do not have to be statically or globally defined. These keywords can be viewed to form a multidimensional keyword space, where data elements are points in the space and the keywords are the coordinates. Two data elements are local if their keywords are lexicographically close or they have common keywords. Our goal then, is to map data elements that are local in this multi-dimensional index space to indices that are local in the 1-dimensional index space, which is then mapped to the same peer or to peers that are close together in the overlay network. We derive such a mapping using locality-preserving mapping called Space Filling Curves (SFC). SFCs are a family of recursive and self-similar mappings from N-dimensional space to 1-dimensional space having the property that points close together in the 1- dimensional space come from points close together in the N-dimensional space. Using this hierarchical keyword space and the locality-preserving mapping, we can effectively support keyword searches with guarantees.

In a prototype implementation, we have used the Hilbert Space Filling Curve, to map data elements to the index space. The underlying overlay network topology in this prototype was deterministic and structured and was based on the Chord system.



Figure 1: (a) A 2-dimensional keyword space. The data element "Document" is described by keywords "Computer" and "Network"; (b) Mapping the 2-dimensional space to a curve. The query (011, *) defines clusters on the curve (segments); (c) Recursive refinement of query (011, *) viewed as a tree. Each node is a cluster, and the bold characters are the cluster's prefixes; (d) Solving the query: embedding the leftmost tree path (solid arrows) and the rightmost path (dashed arrows) onto the overlay network topology.

Unlike the consistent hashing mechanisms, SFC does not necessarily result in a uniform distribution of data elements in the index space certain keywords may be more popular and hence the associated index space will be more densely populated. We will investigate load-balancing techniques to address this. For example, in our prototype we investigated two load-balancing techniques the first applicable at node join (a node is directed to join in a dense part of the index space) and the second at runtime (nodes exchange information about their load and the more crowded ones pass a part of their load to the less crowded nodes).

Optimization techniques are critical to the performance and availability of the storage system. Caching techniques can be used to minimize query retrieval time. For example, partial query results can be stored at intermediate nodes that are traversed by the query response messages. As a result, similar queries can be quickly handled. As the overall system and the stored data is extremely dynamic, cached information has limited validity. The size and lifetime of these caches will be investigated. Furthermore, we will investigate replication techniques to improve the system availability, as well as the system's tolerance to failure. One approach is to use parallel universes or realities that is the same index spaces mapped in multiple ways to the peer nodes. Other issues that will investigate include fault tolerance and system resilience, ranking techniques to prioritize the search results based on their relevance to the user, reputation and trust to enable detection and handling of malicious peers corrupt/tampered data.

Applications:

P2P Collaboratory for Tissue Microarray Research
Meteor - A Programming Framework for Content-based Decoupled Interaction in Pervasive Environments

Status:

The system was evaluated using simulations that implement the SFC-based mapping, the underlying overlay network, the load-balancing mechanisms, and the query engine with query optimizations. The simulations are currently augmented with techniques to further optimize the search, especially for high-dimension keyword spaces.

The current implementation of Squid builds on JXTA, a general-purpose P2P framework.

People:

Cristina Schmidt
Manish Parashar


Publications:

  • C. Schmidt and M. Parashar, Analyzing the Search Characteristics of Space Filling Curve-based Indexing within the Squid P2P Data Discovery System, - Technical Report TR-276, CAIP, Rutgers University, December 2004 [PDF]
  • C. Schmidt, M. Parashar, W. Chen and D. Foran, Engineering A Peer-to-Peer Collaboratory for Tissue Microarray Research, - in Proceedings of CLADE Workshop, at HPDC 13th, June 2004 [HTML]
  • N. Jiang, C. Schmidt, V. Matossian and M. Parashar, Enabling Applications in Sensor-based Pervasive Environments, - BaseNets 2004, San Jose, CA, USA October 25, 2004 [PDF]
  • C. Schmidt and M. Parashar, Enabling Flexible Queries with Guarantees in P2P Systems, - Internet Computing Journal, Vol. 8, No. 3, May/June 2004 [HTML]
  • C. Schmidt and M. Parashar, A Peer-to-Peer Approach to Web Service Discovery, - World Wide Web Journal, Vol. 7, Issue 2, June 2004. [HTML]
  • M. Agarwal, V. Bhat, Z. Li, H. Liu, B. Khargharia, V. Matossian, V. Putty, C. Schmidt, G. Zhang, S. Hariri and M. Parashar, AutoMate: Enabling Autonomic Applications on the Grid - in Proceedings of the Autonomic Computing Workshop (AMS2003) [PDF]
  • C. Schmidt and M. Parashar, A Peer-to-Peer Approach to Web Service Discovery, - CAIP Technical Report TR-271: [PDF]
  • C. Schmidt and M. Parashar, Flexible Information Discovery in Decentralized Distributed Systems, - In Proceedings of HPDC-12: [PDF], [Talk]



  • top