A Framework for Opportunistic Cluster Computing using JavaSpaces

 

Participants

 

Manish Parashar,

TASSL, Rutgers University

parashar@caip.rutgers.edu

 

Jyoti Batheja

TASSL, Rutgers University

jbatheja@caip.rutgers.edu

 

Introduction

Objective

The objective of this work is to provide large amounts of processing capacity over extended periods of time by harnessing the idle and available resources on the network in an “opportunistic” manner. In this project we present the design, implementation and evaluation of a framework that uses JavaSpaces to support this type of opportunistic (adaptive) parallel/distributed computing in a non-intrusive manner over networked clusters.

 

The framework targets applications exhibiting coarse grained parallelism and has three key features: 1) portability across heterogeneous platforms, 2) minimized overheads for configuration of participating nodes, and 3) automated monitoring of system state (using SNMP) to ensure non-intrusive behavior.

 

Traditional High Performance Computing (HPC) is based on massively parallel processors, supercomputers or high-end workstation clusters connected by high-speed networks. Most of these resources are expensive and the processing is transactional in nature. Utilizing available idle resources on a networked cluster provides a more cost effective option. However, there are challenges that must be addressed to make it a viable option. These include:

 

       To summarize, loosely coupled cycle stealing clusters warrants an adaptive parallel-computing model to account for the heterogeneity and dynamism of the cluster environments. Using standardized technologies such as Java that provides a sandbox model and JavaSpaces that provides remote communication mechanisms facilitates a model for deployment and execution of parallel code to remote hosts that addresses the issues listed above. More importantly this option leverages existing investments in computational resources, thereby utilizing these resources to the advantage of an enterprise. The framework presented uses such an approach to enable opportunistic cluster computing. It builds on JavaSpaces technology and facilitates a global deployment of the parallel execution code across all nodes participating in the computation. The deployment task is generic and designed to minimize the costs associated with configuration and management. Furthermore, the framework provides a sustained monitoring of System State using SNMP (Simple Network Management Protocol) [16][17] to enable the parallel process to react to changing environment variables that may render the system unavailable for computation. This minimizes the intrusiveness on the individual machines.

 

Experimental results presented demonstrate that for applications that can be broken into manageable components, such an opportunistic adaptive parallel-computing framework can provide performance gains. Furthermore, the results show that monitoring and reacting to System State enables us to minimize intrusiveness to the machines in the cluster.

 

Related Work

This section focuses on related work for adaptive cluster and web based computing. Existing web based computing systems such as Charlotte [5] implement distributed shared memory over the Java Virtual Machine and provides support for fine-grained distribution of the computing task. The computing task is downloaded and run as an applet on the client machine in this system. Charlotte has been ported to work within the Knitting Factory Infrastructure [6] that provides a Jini like framework for discovering idle resources and facilitates inter-applet communication. ATLAS [7] employs Java and Cilk technology to implement a hierarchical work stealing approach. Such systems are suited to tree based computations and provide reliable fault tolerance mechanisms. ParaWeb [8] centered towards solving coarse-grained problems provides two separate models. One model provides a runtime system (Java Parallel Runtime System) that requires modifying the Java Interpreter to provide global shared memory and transparent thread management mechanism across remote machines. The other model provides sets of libraries (Java Parallel Class Library) that facilitate message passing and thread management across remote machines. Javelin [9] implements a global computing infrastructure involving three entities: a broker that matches resource requirements of the clients to availability of hosts, clients that request computing resources by registering with brokers; and hosts that perform the computation via downloadable Java applets. Other systems, such as CARMI [10], employ the master-worker paradigm and rely on the job management API provided by PVM [11]. Cluster computing systems such as Condor [12] rely on queuing mechanisms to schedule jobs over heterogeneous resources of networks of workstations. The Piranha [13][14] model most closely resembles our model in that it uses the Linda [2] technology to support tuple spaces for scheduling work among processors.  Thus the systems achieve adaptation in one of two ways: one; providing a hierarchical model for work stealing such as Javelin and two; providing a centralized distribution of work using the master worker paradigm across shared space such as Charlotte and Piranha.

 

Most of the systems that exploit adaptive parallelism require manual management – for example they present the user with a graphical interface for stopping background tasks at a worker when the machine is no longer unavailable. Such event driven interfaces at the application layer are easy to implement on the worker machines. In the framework presented in this project, we attempt to automate this decision-making by monitoring and using system parameters. A dedicated network management module identifies and monitors system state parameters (using SNMP) to track resource availability. Given the heterogeneous clients involved, we felt the need to isolate this system state abstraction into the network management module and utilize SNMP services for system monitoring. We observed that standardized services such as SNMP are mature enough and make more information available in terms of statistics, monitoring and reliable messaging.

 

Furthermore, in the presented framework, we actively address configuration management issues. In our system, the worker nodes need not be involved with the intricacies of the worker code. In order to facilitate this we provide remote node configuration mechanisms. This involves remotely loading the worker classes at runtime. Integrating this with the network management module and providing runtime signals to enable the worker to react to changing system parameters has been one of our most challenging tasks.

 

A Framework for Opportunistic Parallel Computing on Clusters

The framework presented in this project employs JavaSpaces technology to facilitate parallel computing on networked clusters. The parallel workload is distributed across the worker nodes using the bag of task model with the master inputting independent problem task into the space and the worker computing results on the tasks taken from the space. The key contributions are:

 

Targeted Applications

The presented cluster computing framework and the underlying parallel computing model supports applications having the following characteristics:

(1)    High Computational Complexity: The applications must be sufficiently complex so as to require large computational resources.

(2)   Massively Partitioned Problems: The applications must be divisible into relatively coarse-grained subtasks that can be solved individually and independently, and the final solution is built based on the results of these subtasks.

(3)   Small Input and Output Sizes: Each of the subtasks must have relatively small sizes of input and output.

 

Framework Architecture

A schematic overview of the framework architecture is shown in Figure1. It consists of 3 key components: the Client-side (Master) components, the Server-side (Worker) components and the Network Management Module.

 

 

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Figure 1. Architectural diagram

 

1. Master Module

The Master component defines the problem domain for a given application. The application domain is broken down into sub tasks that are JavaSpace enabled[1]. The JavaSpace registers as a Jini service and relies on Jini for the remote lookup during the discovery phase of the service. It also inherently handles all the low-level communication issues.

 

2. JavaSpaces for parallel computation

JavaSpaces is a Java implementation of a tuple-space system [2], and is provided as a service based on Sun's Jini technology. JavaSpaces technology provides a programming model that views applications as a collection of processes cooperating via the flow of objects into and out of one or more spaces. A space is a shared, network accessible repository for object [3]. Several aspects of distributed computing are inherently handled by the JavaSpaces technology. JavaSpaces provide associative lookup of persistent objects. It also maintains security through transactions. Using space-based implementation allows transacting executable content across the network. This decouples the semantics of distributed computing from the semantics of the problem domain. A loose coupling like this means that the two elements can be managed and developed independently [4]. This eliminates worries about multithreaded server implementation, low level synchronization issues, or network communication protocols, usual requirements of distributed application design. This design offers two key advantages: Load Balancing: The workers stay busy and compute tasks in relation to their availability and ability to do the work. Scalability: The more number of workers added to the computation improves scalability. Such distributed computing that exploits networked resources and load balances the workload leads to adaptive computing.

 

3. Worker Module

The worker component provides the solution content for the application domain. The Master and Worker implementations run as processes within the application layer. The workers need not be Jini aware in order to interact with the Master processes.  Interaction with the master process is via a virtual, shared JavaSpace. 

3.1 Remote Node Configuration Engine

The Remote Node Configuration Engine resides on the worker module and addresses the issue of configuration overhead. In an effort to minimize the configuration overhead of deploying worker code, we implement a remote node configuration mechanism. This mechanism facilitates remote loading of the worker implementation classes at runtime. The mechanism is described in greater detail in the Implementation and Operation Section with the overall operation of the framework.

3.2 Network Management Module

In order to enable non-intrusiveness while using remote networked resources, we identified the need to include an idle resource monitor that checks for idle resources and monitors their state. In the presented framework, the network-monitoring agent that runs on each of these machines performs this function. The agent is operational only when the framework is operation and is dormant at all other times. Once the worker hosts are identified using Jini, they are registered with a Inference Engine module. This module queries the system state of the identified worker and dynamically manages the scheduling of tasks to worker tasks using this state information. Task scheduling management is driven by policy defined within the Rule Base. This transactional layer on the workers forms the basis of our rule base management protocol. The two entities participating in this protocol are the worker host and the network manager. The network manager manages all the workers registered with the inference engine. It is responsible for looking up available resource on the registered worker hosts, activating the worker processes, and managing worker thread priorities on the hosts.

Implementation and Operation

The framework implements the master-worker pattern with JavaSpaces as the backbone. The entire software has been developed in Java to facilitate portability across heterogeneous platforms and to leverage the write once run anywhere feature. We have employed our own implementation of the SNMP agent to obtain system CPU usage values needed for network management. Enabling Java access to C system calls providing CPU utilization information required an additional layer of Java Native Interface (JNI). An overview of the JavaSpaces implementation and its operation is shown in Figure 2. The 3 components are explained below.

 

1 Master Module

The Master iterates through all the tasks that need to be computed, creating a task entry for each and writing it into the space. This is the task-planning phase. The master then iterates again, this time removing the result for each task from the space and combining them into some meaningful form. This is the result aggregation phase. Our experience with JavaSpaces shows that these two phases can be relatively time consuming and are dependent upon the size of data that is transferred in and out of the space.

2 Worker Module

The worker repeatedly removes a task from the space, computes it, and writes the result of the computation back into the space. The result is later read and aggregated by the master process. When taking or reading objects, processes use simple value-matching look up to find the objects of interest in the space. If a matching object isn’t found immediately the process waits until one arrives. Matchmaking in JavaSpaces is achieved by referring to each object using a unique Task ID and the space within which it resides.

Figure 2.  Master Worker Paradigm within our Framework

Text Box:
 


2.1 Remote Node Configuration Engine

The required classes for remote node configuration of the worker nodes are easily downloadable from the web server residing at the master in the form of executable jar files. The worker implementation classes are loaded at runtime from within the base configuration classes, and the appropriate method to start the worker application thread is invoked. Our modification of the network launcher [15] provides mechanisms to intercept calls from the inference engine and interpret them as signals to the executing worker code. This interaction is explained in the Section describing the operation of the rule-based protocol.

2.2 Network Management Module

The Network Management Module serves two functions, namely; monitoring the worker machines for system state parameter and devising a decision-making mechanism to facilitate non-intrusive code execution. In our current implementation, we monitor CPU usage and study the load patterns from successive runs of the JavaSpace implementation. This information is used to establish triggering thresholds. If the CPU usage exceeds the established threshold, the currently executing task is completed and its results are returned to the space, and no other task assigned to the load worker. This mechanism provides smooth clean up and synchronization.

2.3 Operation of the Rule Base Protocol

The Rule Base Protocol defines the interaction between the Network Management Module and the Worker Module (see Figure 3). The protocol is implemented using Java sockets. It operation is as follows:

The SNMP Client on the worker module initiates participation into the parallel computation by registering with the SNMP Server. The server invokes the SNMP service as well as communicates with the Inference Engine. The Inference Engine maintains a list of all machines whose system parameters need to be monitored and assigns a unique ID to the new worker. Next it adds the workers IP address to the list. The SNMP server then continues to retrieve the requested system state parameter.

The SNMP value measured is the averaged value of the worker CPU utilization. As these values are returned they are added to the respective entry in the list. Based on this return value and the threshold ranges established the Inference Engine makes a decision and passes an appropriate signal back to the worker. Threshold values for the signals are based on heuristics. The Rule Base currently encoded allows for 4 types of signals in response to the varying load conditions; namely: start, stop, pause and resume.

Text Box:

Figure 3.  Sequence Diagram for the Rule-Base Protocol

Start: This signal is sent to the worker nodes to signify that the worker node is now idle and can start the parallel processing job.  The threshold CPU load values for this state are in the range of 0% - 25%. On receiving this signal, the protocol implementation on the worker initiates a new runtime process for the actual work execution. The new thread first goes through the remote class loading phase and then starts off a worker thread for task execution.

Stop: This signal is sent to the worker nodes to indicate that the worker node can no longer be used for computations. This may be due to a sustained increase in CPU load caused by a higher priority (possibly interactive) job being executed. The cutoff threshold value for the stop state is 30%. Upon receiving the stop signal the worker gets a handle into the worker thread and sends it the stop signal. At this point the worker backs off. It first interrupts the executing worker thread and then executes the shutdown/cleanup mechanism. The shutdown mechanism ensures that the currently executing task completes and its results are written into the space. After cleanup arrives at the worker thread is killed and control returns to the parent process.

Pause: This signal is sent to the worker nodes to indicate that the worker node is experiencing increased CPU loads and is not completely idle and hence it should temporarily not be used for computation. However, the load increase might be transient and node could be reused for computation in the near future. Threshold values for the pause state are in the range of 25% - 30%.  Upon receiving this signal the worker gets a handle into the worker process and sends it the pause signal. At this point the worker is supposed to back off, but unlike the stop state the back off is temporary – until it get the resume. This minimizes overheads for transient load fluctuations at the worker. As in the stop state, the pause goes into effect only after the worker writes the results of the currently executing task into the space.

Resume: This signal is sent to the worker nodes to indicate that the worker node is once again available for computation. This state is triggered when the CPU load falls below 25%. Upon receiving this signal the worker gets a handle into the worker process notified it to resume execution. The worker thread fetches the next available task from the space and proceeds with computations.

Framework Evaluation

We evaluated our JavaSpaces-based opportunistic cluster-computing framework with a real world financial application that uses Monte Carlo (MC) simulation for Option Pricing. An option is a derivative, that is, its pricing value is derived from something else. Complications such as varying interest rates and complex contingencies can prohibit analytical computation of option and other derivative prices. Monte Carlo simulation using statistical properties of assumed random sequences is an established tool for pricing of derivative securities. An options is defined by the underlying security, the option type (call or put), the strike price and the expiration date. Additionally, there are various factors that affect the pricing of an option such as interest rate and volatility. These financial terms are explained in greater depth in [10]. For our implementation we take into account the various factors, and model the behavior of options using Monte Carlo simulations based on the Broadie and Glasserman MC algorithm.

Parallel Monte Carlo Simulation for Option Pricing

The main MC simulation based on the input parameters is the core parallel computation in our experiments. Input parameters may be defined using a GUI as provided in our implementation. The simulation domain is divided into tasks and MC simulations are performed in parallel on these tasks. The number of simulations performed can change for each task. High and low estimates are obtained over a wide range of simulations. For the experimental evaluation presented below, the simulation is divided into sub tasks of size 100 each. These tasks are then dropped into the JavaSpaces pool. Each worker process picked a sub task from the pool and performed the MC simulation on the task.

Experiments and Results

Three experiments were conduced to analyze the performance of our framework. The aim of the first experiment was to study the scalability of the application and our framework, and to demonstrate the potential advantage of using clusters for parallel computing. The second experiment measures the costs of adapting to system state. It measures the overheads of monitoring the workers, signaling, and state-transitions at the workers. We used a set of load generators to simulate dynamic load conditions at different worker nodes. Finally, the third experiment demonstrates the ability of our framework to adapt to the cluster dynamics.

 

1 Scalability Analysis

This experiment measures the overall scalability of the application and the framework using a cluster of 13 Windows NT (version 4.0) workstations. Results for this experiment are plotted in Figure 4.

 

 

 

 

 


Figure 4. Graph depicting the Scalability

As shown in the figure an initial speedup is obtained as the number of workers is increased. During this part of the curve the total parallel time closely follows the maximum worker time. As the number of workers increases the model spreads the total tasks more evenly across the available workers. Hence the maximum (Max) worker time evens out as the number of workers increases. However, after a point we notice that the total parallel time is dominated by the Task Planning time. That is the workers are able to complete the assigned task and return to the space much before the master gets a chance to plan a new task and put it into the space. Hence the workers remain starved until the task is made available. As a result the scalability deteriorates. This indicates that the framework favors coarse-grained tasks that are compute intensive. As expected the task aggregation curve closely follows the maximum worker time.

 

2 Adaptation Protocol Analysis

In this experiment, we provide a time analysis to illustrate the overhead involved in signaling worker nodes and adapting to their current CPU load.


 


Figure 5(a). Graph of CPU Usage                   Figure 5(b). Graph of Worker Reaction

Simulation Signal Triggers: Start - Stop - Restart - Pause - Resume

 

As a part of the experimental setup, we built two sets of load simulators: load simulator 1 simulated varied types of data transfers such as RTP Packets of Voice Traffic, HTTP Traffic, and Multimedia traffic over HTTP via Java sockets originating from the worker hosts. This load simulator was designed to raise the CPU usage level on the worker to 30% to 50% utilization. The second load simulator (load simulator 2) raised the CPU utilization of the worker machines to 100%.  Figure 5(a) and 5(b) depict the Worker behavior under the simulation conditions. Figure 5(a) captures the CPU usage history on the Worker host throughout the run. We identify the peaks where the worker reacts to the signals sent. The first peak at 80% CPU usage occurs when the worker is started. This sudden load increase is attributed to the remote class loading of the worker implementation. Next, load simulator 2 is started which sends the CPU usage to 100%. This causes a Stop signal to be sent to the worker node. The load simulator 2 is then stopped and load simulator 1 is started which raises the CPU load to 46%. As shown in Figure 5(b) the worker reaction times to the signal received is minimal in all cases. Further, the large overhead associated with remote class loading is avoided in the case of transient load increase at the node using the pause/resume states.

 

3 Dynamic behavior patterns under varying load conditions

Figure 6 (a). Time Measurements across 12 Workers

Figure 6 (b). Measurement of the number of tasks executed across 12 Workers

The experiment consists of three runs: In the first run none of the workers were loaded. In the second and third runs, the load simulator was run to simulate high CPU loads on 3 and 6 workers respectively. As illustrated in Figure 6(a), as the number of worker hosts being loaded increases, the total parallel computation time increases. The computational tasks that would have been executed normally are now off loaded and picked up by other executing workers. The task planning and aggregation times also increase since the master needs to wait for the worker with the maximum number of tasks to return the last task back into the space. The maximum master overhead and the maximum worker time remains the same across all three runs as expected. Figure 6b illustrates how task scheduling adapts to load current load conditions. It shows that the number of task executed by each worker depends on its current load. Load workers execute fewer tasks causing the available workers to execute larger number of tasks.

Conclusions

This project presented the design, implementation and evaluation of a framework for opportunistic parallel computing on networked clusters using JavaSpace. This framework enables coarse-grained applications to be distributed across and exploit existing heterogeneous clusters. The framework builds on Jini and JavaSpaces technologies. It provides support for global deployment, configuration management and uses an SNMP system state monitor to ensure non-intrusiveness. The experimental evaluation, using an option pricing application, shows that the framework provides good scalability for coarse-grained tasks. Further, using the system state monitor and triggering heuristics the framework can support adaptive parallelism and minimize intrusiveness.  

 

While evaluating our framework prototype, we identified a number of issues for further study. Firstly, the parallel processing problem should be free from any dependencies. The Opting Pricing model depicted that there is a significant overhead in the task planning and allocation phases. Special attention needs to be given to help reduce this overhead. Further optimization can be achieved with use of proper measurement and prioritizing tools on worker threads. We have room to facilitate this optimization in our current infrastructure. The results obtained from reaction times between the Worker and Network Management Modules are promising and prove that the overhead incurred in including adaptability features is insignificant.  As future work, we envision incorporating a distributed JavaSpaces model to avoid a single point of resource contention or failure. The Jini community is investigating this area of research.

References

[1]     Sun Microsystems. Javaspaces specification http://www.javasoft.com/products/javaspaces/specs/index.html. (1998).

[2]     The Linda Group.  http://www.cs.yale.edu/HTML/YALE/CS/Linda/linda.html

[3]     E. Freeman, S. Hupfer, K. Arnold. JavaSpaces Principles, Patterns, and Practice. (June 1999).

[4]     L. Cameron. JavaSpaces: Making Distributed Computing Easier. Byte Online Magazine. http://www.byte.com/feature/BYT19990915S0001 (20 September 1999).

[5]     A. Baratloo, M. Karaul, Z. Kedem, and P. Wyckoff. Charlotte: Metacomputing on the web. In Proceedings of the 9th conference on Parallel and Distributed Computing Systems, 1996.

[6]     A. Baratloo, M. Karaul, H. Karl, Zvi M. Kedem. An Infrastructure for Network Computing with Java Applets. in Proceedings of ACM workshop on Java for High Performance Network Computing, February 1998.

[7]     J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer. ATLAS: An Infrastructure for Global Computing. In Proceedings of the 7th ACM SIGOPS European Workshop: Systems support for Worldwide Applications, September 1996.

[8]     T. Brecht, H. Sandhu, J. Talbot, and M. Shan. ParaWeb: Towards world-wide supercomputing. In European Symposium on Operating System Principles, October 1996.

[9]     B. Christiansen, P. Cappello, M.F. Ionescu, M. O. Neary, K. Schauser, and D. Wu. Javelin: Internet-based parallel computing using Java. In ACM 1997 Workshop on Java for Science and engineering Computation, June 1997.

[10]  J. Pruyne and M. Livny. Parallel processing on dynamic resources with CARMI. in Job Scheduling strategies for Parallel Processing-IPPS'95 Workshop Proceedings, 1995.

[11]  A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel virtual machine. MIT Press, 1994.

[12]  M. Lizkow, M. Livny, and M. Mukta. Condor: A hunter of idle workstations. In Proceedings International conference on distributed computing systems, 1998.

[13]  N. Carriero, D. Gelernter, D. Kaminsky, and J. Westbrook. Adaptive parallelism with Piranha. Technical Report 954, Yale University Department of Computer Science, 1993.

[14]  D. Gelernter and D. Kaminsky. Supercomputing out of recycled garbage: Preliminary experience with Piranha. Sixth ACM International Conference on Supercomputing, July 1991.

[15]  Michael. S. Noble. Tonic, A Java TupleSpaces Benchmark Project, http://hea-www.harvard.edu/~mnoble/tonic/doc/

[16]   SNMP Documentation, http://www.snmpinfo.com.

[17]  James Murray, Windows NT SNMP, O'Reilly Publications, January 1998.



[1] JavaSpace required the Objects being passed across the Space to be in a Serializable format. In order to transfer an entry to or from a remote space, the proxy to the remote space implementation first serializes the fields and then transmits it into the space.