vGrid Autonomic Application Infrastructure
The overarching goal of the proposed research is to enable the development of a new generation of realistic, scientific and engineering simulations on the Grid. These applications will symbiotically and opportunistically combine computations, experiments, observations, and real-time data, and will provide important insights into complex phenomena. However, the phenomena being modeled are inherently large, complex, multi-phased, multi-scale, dynamic and heterogeneous (in time, space, and state). Furthermore, their implementations involve multiple researchers with scores of models, hundreds of components and dynamic compositions and interactions between these components. The Grid infrastructure, globally aggregating large numbers of independent computing and communication resources, data stores and sensor networks, is similarly large, complex, heterogeneous and dynamic. The combination of the two results in application development, configuration and management complexities that break current paradigms based on passive components and static compositions.
Realizing these realistic simulations to harness the true power of the global Grid infrastructure and enabling revolutionary advances in science and engineering presents a new set of computational and computer science “grand challenges”. These challenges are due to (1) unprecedented scales in domain size and resolution, (2) unprecedented complexity in physical models, numerical formulations, and software implementations, (3) unprecedented heterogeneity and dynamics in time, space and state of applications and system, (4) unprecedented uncertainty and unreliability inherent in the nature of the Grid (typical execution times are much longer than the mean time between failures), and (5) unprecedented programmability, manageability and usability requirements in allocating, programming and managing very large numbers of resources. Clearly, the era of the isolated scientist/engineer groups writing monolithic simulation code is coming to an end – the non-physics knowledge required to write the most efficient and capable codes is now too great for a single researcher to implement in a reasonable amount of time. In fact, we have reached a level of complexity, heterogeneity, and dynamism that our programming environments and infrastructure are becoming brittle, unmanageable and insecure. There is a need for a fundamental change in how these applications are formulated, composed and managed.
We have selected four strategically important large scale simulations to demonstrate the impacts of our research on science and engineering: (1) Simulations of active flow control of turbulent flows and of pulsating flow in the cardio-vascular system; (2) Virtual groundwater basin model to integrate hydrological, geological, geophysical, and climatologic information for characterizing, monitoring, and forecasting movements of water and contaminants in large-scale groundwater basins (100 km in length and width, and 1000 m in depth); (3) Thermonuclear Combustion Supernovae that investigates the thermonuclear explosion of an accreting white dwarf star as a Type Ia supernova; and (4) Non-Born-Oppenheimer Molecular Quantum Mechanics calculations that describe configurational transformations involved in chemical reactions of molecules and clusters containing light nuclei and being promoted to excited electronic and ro-vibrational states.
Our multidisciplinary research team consists of physicists, mathematicians, computational scientists, computer scientists and engineers from academia (Univ. of Arizona, Rutgers Univ., Univ. of California at Santa Cruz, Univ. of Chicago, Univ. of New Mexico, and Univ. of Detroit-Mercy), industry (IBM Alameden Research, Bell Labs), non-profit, federally funded research and development center (The Aerospace Corporation), and government research laboratory (Argonne National Lab, ASCI/ASAP Flash Center, Sandia National Lab). This team will collectively address the challenges and the applications discussed above. The key intellectual merits and contributions include:
Advanced programming and execution models, tools and environments based on autonomic, context aware, self-configuring, self-adapting and self-optimizing components, and the dynamic and opportunistic compositions of these components. These applications will be capable of effectively managing and exploiting the heterogeneity and dynamics of the applications as well as the Grid.
vGrid infrastructure that provides autonomic middleware services to support development, composition, and management of autonomic Grid applications. These will include advanced grid-services for online composition, context awareness, dynamic information injection, resource monitoring, information management, and advanced reservation that explicitly address quality of service issues and significantly enhance the Global Grid Forum Open Grid Services Architecture (OGSA). A key innovation of the vGrid is the concept of the virtual Grid infrastructure that, analogous to virtual memory, will provide the application developer with an abstraction of an execution environment that is significantly larger than the available physical environment. The vGrid autonomic runtime will reactively and proactively manage and optimize application execution using current system and application state, online predictive models for system behavior and application performance, and an agent based control network.
Broader Impact: The proposed research will enable exponentially growing Computational Grid resources to be exploited to their maximum potential to foster deeper understanding of physical phenomena needed to make progress at the cutting edge of science and engineering research. It will provide the ability to formulate, engineer and Grid-enable the models underlying today’s physics-based simulations. This will allow the components so derived to be flexibly composed in endless new configurations that can be executed with minimal user concern for how the underlying computations are being handled. Moreover, the novel concepts and tools will enable users to develop a new generation of ground-breaking scientific and engineering simulations on the Grid. The explosion in creative potential so unleashed will foster a new era in simulation-based science.
We have selected four important large scale simulations to demonstrate the profound impacts of our research on science and engineering. The scientific impact of active flow control simulations will be a major contribution towards understanding complex turbulent flows – turbulence is still one of the least understood areas of physics. The simulations of the cardio-vascular system will provide significant insight and understanding of pulsating flows and their dynamic interaction with flexible structures. This will enable new approaches to biomedical treatment of cardio-vascular dysfunction which is still the biggest health concern in the nation. Active Flow Control will enhance efficiency of power plants, reduce drag of ground and air vehicles and thus will contribute towards conservation of fossil fuels and reduction of air pollution (environmental impact). Advances in simulation of turbulent flame propagation will offer new insights in the study of supernovae in space as well as on earth in the design of highly efficient combustion process in automobile engines. The advances in quantum mechanical simulation will provide the most accurately calculated standards ever produced in molecular physics. Large scale virtual groundwater simulations will allow coordination of information from local sources that will enable greatly improved representation and control of aquifers.
The scientific, technological and educational impact of the proposed research will extend far beyond the four scientific and engineering applications. In addition to enabling fundamental advances in the state-of-the-art of the target application domains, the proposed research will have a significant impact to computational science and engineering. The computational solutions of the proposed research will lead to fundamental innovations in the development, optimization, deployment and the management of Grid applications allowing the heterogeneity and dynamics of the applications to match that of the Grid and fully exploit its potential. These innovations will enable scientists to build high-performance, integrated end-to-end simulations that were never possible or attempted before. Furthermore, our program is based on a fundamental integration of research, instruction and outreach. As a consequence, this work will significantly impact education at all levels and will reach out to both well- and under-represented populations. We will develop an integrated educational and outreach program by replicating and extending highly successful programs developed at some of our partnering institutions. More broadly, it will provide innovative approaches for retention, recruitment, and mentorship of all students, particularly targeting undergraduates, women, and those from underrepresented communities.
The ambitious research program presented here builds on the extensive technologies and infrastructures developed by the PI's and by other researchers. The applications addressed by this proposal pose challenging computational science research problems due to their unprecedented application and system scale, complexity, heterogeneity, dynamics, unreliability, and programmability and usability requirements. Our technical approach is based on (see Figure) (1) the formulation of autonomic components that are context aware, self-configuring, self-adapting and self-optimizing, (2) the development of autonomic applications as dynamic and opportunistic compositions of autonomic components, and (3) the deployment of the vGrid infrastructure that provides autonomic middleware services on top of the emerging Grid middleware.
The proposed vGrid architecture provides application developers with a convenient abstraction of a virtual Grid that can be significantly larger and more reliable than the currently available resources. The autonomic vGrid runtime then manages physical Grid resources, allocates them “on-demand”, and spatially and temporally maps the virtual resources to these physical nodes. The mapping exploits the space, time and functional heterogeneity of the simulations and underlying numerical methods to define application “working-sets”. vGrid infrastructure services are responsible for collecting and characterizing the operational, functional and control aspects of the application and using this information to define autonomic components, decomposing the application into Natural Regions (NRs) and the NR into virtual computational units (VCUs), and applying innovative allocation and scheduling strategies to map VCUs to physical Grid resources. Together, these solutions will allow application developers to concentrate on the science and its formulations, without having to worry about explicitly addressing the number, limitations and availability of resources or targeting and tuning their implementations to specific architectures and machines – much like the convenience of virtual memory.
To explain the impact of the vGrid architecture, consider the simulation of Active Flow Control (AFC) of turbulent flows, a target application of this proposal (see Section 5). Using so-called Direct Numerical Simulations (DNS), where all relevant scales are resolved, at Reynolds numbers Rec~O(106) (based on chord length) is not feasible using current state of the art numerical techniques on existing massively parallel computers. In fact NASA has challenged the CFD research community to provide solutions to this problem. The computational requirements of this simulation are about 5000x1000x500=2.5∙109 points and approximately 107 time steps. Assuming a processor speed of 1GFLOP/s, this yields a total runtime of ~7∙106 CPU hours, or about one month on 10,000 CPUs! Note that this assumes perfect speedup among the 10,000 processors – it does not include the synchronization, inter-processor communications and parallel execution overheads. Furthermore, the memory requirement for this simulation (with ~ 700B/pt) is ~1.75TB of run time memory and ~800TB of storage.
The spatial gradient in the AFC simulations can be used to define natural regions based on the “density of activities”. This means that each natural region will have different characteristics in terms of temporal and spatial resolutions (δt, δ(x,y,z)). The value of δt for one region might be determined by the physics and stability of the numerical algorithm to be very small and be much larger for another region. By properly and dynamically choosing the local resolution, δ(x,y,z) and accordingly δt, and using them to define and spatially/temporally schedule natural regions, we can speedup the computations, reduce the synchronizations (the communication frequency can be adjusted based on the local state of the computations), and use resources more efficiently. The proposed research will address the five computational and computer science challenges.
Scalability Challenge: The scale and resolutions of applications supported can be significantly increased by exploiting the application’s temporal, spatial and functional characteristics. In the free stream region (outer natural region) of the turbulent flow, δt will be several orders of magnitude larger than in the boundary layer and the separated region (inner natural region). Consequently, we can allocate an order of magnitude fewer resources to the outer regions and multitask their execution without serious performance impacts as the solutions progress significantly slower and take fewer timesteps in these regions.
Complexity Challenge: The physical, numerical and software complexities are managed through separation of concerns and hierarchical abstractions. The computational domain is dynamically decomposed into natural regions with relatively stable and predictable local behavior. These natural regions are then used to define self-managing, adapting and optimizing autonomic components. The composition and interactions of these components are separately defined as polices and constraints based on application/system characteristics and state. For the AFC application, regions with similar temporal and spatial characteristics (δt, δ(x,y,z)) define the natural regions, and the behavior of these regions, their execution, interactions and synchronization are separately defined using polices based on state. For example, relatively less dynamic regions (where the flow is not changing rapidly) may synchronize less frequently or regions with intense activity (at the location of the flow actuator) are assigned to high performance Grid resources with low latency.
Heterogeneity and Dynamics Challenge: To address this challenge the vGrid will seek to dynamically match application characteristics and the state of the Grid using context awareness, self-adaptation, self-healing and self-optimization at the application, component, service and system levels. For example, we can exploit heterogeneity in the AFC application by dynamically adjusting dt, d(x,y,z) to track the current rate of change of the computation. In this way, the average time step in the free stream (outer natural region) can be orders of magnitude larger than dt near the actuator or in the separated region behind the hump (active natural regions), while the accuracy level remains unaffected or at least controlled to a desired level. The application physics, its current state (spatial locality of computation fronts) and available resources will dynamically determine the appropriate algorithms, configurations and parameters to meet these requirements.
Reliability Challenge: The targeted scientific and engineering simulations will take weeks and months to complete on very large collections of Grid resources. Consequently, the probability of failure during an application’s lifetime will be very high. To address this challenge we propose that simulations be developed as dynamic compositions of autonomic components that will behave and operate in an autonomic manner (self-configuring, self-healing, self-optimizing). The vGrid middleware will continuously monitor current state, and components that experience failures and/or severe degradation in performance will be seamlessly reassigned to other Grid resources.
Programmability and Usability Challenge: It is widely believed that developing large Grid applications is extremely complex and requires significant expertise in computational and computer science and systems in addition to the application domain; as one computational scientist puts it “it still needs heroes to effectively run challenging Grid applications”. The convenient abstraction of a virtual Grid provided by the proposed research will free scientists and engineers from worrying about tailoring their algorithms and solutions to satisfy the current configurations and capacities of Grid resources. The vGrid infrastructure will allocate, schedule and manage the application execution in an autonomic way, analogous to how the operating system allocates, schedules and manages the execution of its processes. The vGrid will thus revolutionize how applications are developed and executed to exploit the true potential of the Grid.