Reliable Execution on the Grid

A particular kind of system on the grid is called distributed cycle sharing systems where compute cycles are shared between multiple processes submitted by different users. They provide the promise of harvesting the abundant number of idle cycles in clusters of machines. Several applications are widely known, including, SETI@home, Distributed.net, and RSA key cracking. Users voluntarily share the compute cycles on their own host machine and allow their own processes (host processes) to co-exist with processes submitted by other users to the system (guest processes). A common model in cycle sharing systems is to have the job be submitted to a centralized scheduler which schedules the jobs on to the different machines in the cluster. The scheduler, though centralized, typically works in tandem with monitors resident on the host machines, which supply the machine characteristics to the scheduler. A natural goal of the scheduler is to minimize the execution time of a guest process by picking the appropriate machine to execute it on. Along with other metrics of suitability, such as, the operating system, the CPU, memory, and IO resources of the machine, it is useful for the scheduler to be able to predict the availability of machines for the execution duration of the job. It can then make intelligent scheduling decisions to minimize the loss of work due to the failure of a machine. If the job is long-running (a common characteristic for jobs in such systems), then the prediction over the entire length of execution of the job may be inaccurate. It may then be possible to provide incremental prediction for the availability of the host machine over a certain chunk of time in the future. If a failure is found to be impending, a fault tolerant scheduler may checkpoint the guest process¡¦ state and migrate it to another host.

The goal of the proposed project is to build a system for failure prediction in cycle sharing systems and integrate the predictor with an existing scheduler. The basic premise behind the work is that there exist some symptoms of impending machine failure, which are manifested in the nature of the resource consumption at the host machine. Many researchers have found this dependence ¡V for example, exhaustion of main memory is an indication of memory leak and if it happens for kernel memory, this is taken as an indicator of impending failure [1, 2]. However, existing work falls short on several fronts. First, no failure prediction system is general enough to handle a wide variety of workloads, which may be CPU, memory, or IO intensive, and multiple types coexisting on a machine. Most have looked at exhaustion of memory failures and all systems have to be fitted with an underlying failure model (such as, failure results from a failing disk and hence failed IO calls are taken as a symptom [3]). Second, the key parameters needed to make a reliable online predictor are missing from the design and the evaluation. These parameters include the length of time for (accurate) prediction, the amount of history to be used for prediction, and the amount of lookahead possible in the prediction. The popular NWS system [4] is not scalable to the grid system and does not explicitly target failure prediction.

Here is a schematic of our system that is incorporated within iShare.

Through our work we have demonstrated that:

¡P        The semi-Markov process model can be used to predict unavailability due to resource contention with CPU and memory being the constrained resources

¡P        A neural network model with system state parameters as input can predict software failures where the failure model is that high and fluctuating resource usage is indicative of failure

¡P        Integrated the failure prediction in a proactive scheduler and shown improvements over a failure oblivious scheduler such as that in Condor

Collaborators:
Rudi Eigenmann (ECE), Hugh Hillhouse (Chemical Engineering).

Papers:

See Publications.

References:

[1]        M. Shereshevsky, J. Crowell, B. Cukic, V. Gandikota, and L. Yan, "Software aging and multifractality of memory resources," in Proceedings of the International Conference on Dependable Systems and Networks (DSN), pp. 721-730, 2003.

[2]        K. Vaidyanathan and K. S. Trivedi, "A measurement-based model for estimation of resource exhaustion in operational software systems," in Proceedings of 10th International Symposium on Software Reliability Engineering (ISSRE), pp. 84-93, 1999.

[3]        A. Thakur and R. K. Iyer, "Analyze-NOW-an environment for collection and analysis of failures in a network of workstations," Reliability, IEEE Transactions on, vol. 45 (4), pp. 561-570, 1996.

[4]        R. Wolski, N. Spring, and J. Hayes, "The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing," Journal of Future Generation Computing Systems, vol. 15 (5-6), pp. 757-768, 1999.

 

 

Last Modified: March 8, 2007