Reliable Execution on the Grid
A
particular kind of system on the grid is called distributed cycle sharing systems
where compute cycles are shared between multiple processes submitted by
different users. They provide the promise of harvesting the abundant number of
idle cycles in clusters of machines. Several applications are widely known,
including, SETI@home, Distributed.net, and RSA key
cracking. Users voluntarily share the compute cycles on their own host machine
and allow their own processes (host processes) to co-exist with processes
submitted by other users to the system (guest processes). A common model in
cycle sharing systems is to have the job be submitted to a centralized
scheduler which schedules the jobs on to the different machines in the cluster.
The scheduler, though centralized, typically works in tandem with monitors resident on the host machines, which supply the
machine characteristics to the scheduler. A natural goal of the scheduler is to
minimize the execution time of a guest process by picking the appropriate
machine to execute it on. Along with other metrics of suitability, such as, the
operating system, the CPU, memory, and IO resources of the machine, it is
useful for the scheduler to be able to predict the availability of machines for
the execution duration of the job. It can then make intelligent scheduling
decisions to minimize the loss of work due to the failure of a machine. If the
job is long-running (a common characteristic for jobs in such systems), then
the prediction over the entire length of execution of the job may be
inaccurate. It may then be possible to provide incremental prediction for the
availability of the host machine over a certain chunk of time in the future. If
a failure is found to be impending, a fault tolerant scheduler may checkpoint
the guest process’ state and migrate it to another host.
The goal of the proposed project is to
build a system for failure prediction in cycle sharing systems and integrate
the predictor with an existing scheduler. The basic premise behind the work is
that there exist some symptoms of impending machine failure, which are
manifested in the nature of the resource consumption at the host machine. Many
researchers have found this dependence – for example, exhaustion of main memory
is an indication of memory leak and if it happens for kernel memory, this is
taken as an indicator of impending failure [1, 2]. However, existing work falls short on
several fronts. First, no failure prediction system is general enough to handle
a wide variety of workloads, which may be CPU, memory, or IO intensive, and
multiple types coexisting on a machine. Most have looked at exhaustion of
memory failures and all systems have to be fitted with an underlying failure
model (such as, failure results from a failing disk and hence failed IO calls
are taken as a symptom [3]). Second, the key parameters needed to
make a reliable online predictor are missing from the design and the
evaluation. These parameters include the length of time for (accurate)
prediction, the amount of history to be used for prediction, and the amount of lookahead possible in the prediction. The popular NWS
system [4] is not scalable to the grid system and
does not explicitly target failure prediction.
Here is a schematic of our system that is
incorporated within iShare.
Through our work we have demonstrated that:
· The semi-Markov process model can be used to predict unavailability due to resource contention with CPU and memory being the constrained resources
· A neural network model with system state parameters as input can predict software failures where the failure model is that high and fluctuating resource usage is indicative of failure
· Integrated the failure prediction in a proactive scheduler and shown improvements over a failure oblivious scheduler such as that in Condor
Collaborators:
Rudi Eigenmann (ECE), Hugh Hillhouse (Chemical
Engineering).
Papers:
See Publications.
References:
[1] M.
Shereshevsky, J. Crowell, B. Cukic,
V. Gandikota, and L. Yan,
"Software aging and multifractality of memory
resources," in Proceedings of the International Conference on Dependable
Systems and Networks (DSN), pp. 721-730, 2003.
[2] K.
Vaidyanathan and K. S. Trivedi,
"A measurement-based model for estimation of resource exhaustion in
operational software systems," in Proceedings of 10th International
Symposium on Software Reliability Engineering (ISSRE), pp. 84-93, 1999.
[3] A. Thakur and R. K. Iyer, "Analyze-NOW-an environment for
collection and analysis of failures in a network of workstations," Reliability, IEEE Transactions on, vol.
45 (4), pp. 561-570, 1996.
[4] R.
Wolski, N. Spring, and J. Hayes, "The Network
Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing," Journal
of Future Generation Computing Systems, vol. 15 (5-6), pp. 757-768, 1999.
Last Modified: March 8, 2007