Five Significant Publications

[3] Xiaojuan Ren, Seyong Lee, Rudolf Eigenmann, and Saurabh Bagchi, “Resource Failure Prediction in Fine-Grained Cycle Sharing Systems,” At the 15th IEEE International Symposium on High Performance Distributed Computing (HPDC), pp. 93-104, June 19-23, 2006, Paris, France. (Acceptance rate: 24/157~15.3%) (Runner-up for best paper award)

[ Paper in pdf ]

Problem Statement: Volunteer computing is an arrangement in which people (volunteers) provide computing resources to projects, which use the resources to do distributed computing and/or storage. Because of the huge number of personal computing devices in the world, volunteer computing can supply more computing power to science than does any other type of computing. This computing power enables scientific research that could not be done otherwise, as was demonstrated by projects such as Protein Folding and Einstein@Home, which were both executed through the volunteer computing infrastructure called BOINC. This advantage is projected to increase over time, because the laws of economics dictate that consumer products such as PCs and game consoles will advance faster than more specialized products, and that there will be more of them. Volunteer computing also encourages public interest in science, and provides the public with voice in determining the directions of scientific research. A more localized infrastructure which shared some of the features of volunteer computing is HTCondor, a workload management system for compute-intensive jobs which enable one to effectively harness wasted CPU power from otherwise idle desktop workstations, say within a campus. However, at the time of our work, this concept was threatened by the observation that these systems allowed a guest process to run concurrently with local jobs (host processes) whenever the guest process did not impact the performance of the latter noticeably. For guest users, the free compute resources came at the cost of highly fluctuating availability with the incurred resource failures leading to undesirable completion time of guest jobs. The primary victims of such resource failures were large compute-bound guest jobs, often the exact kinds of jobs meant to opportunistically run in these BOINC or HTCondor systems.

Contribution of Paper: The main contributions of this paper were the design and evaluation of an approach for predicting resource failures in resource shared systems. We developed a multi-state failure model and applied a semi-Markov Process (SMP) to predict the temporal reliability, which is the probability that no resource failure will occur on a machine in a future time window. The failure model integrated the two classes of failuresa machine becoming unavailable to the opportunistic jobs to prevent unacceptable impact on the machine owner’s job and the churn of the machines in the volunteer computing system. To compute the temporal reliability on a given time window, the parameters of the SMP are calculated from the host resource usage during the same time window on previous days. A key observation leading to our approach was that the daily patterns of host users’ workloads were comparable to those in the most recent days after accounting for weekdays and weekends. Deviations from the regular patterns were accommodated by the statistical method that calculated the SMP parameters. We showed how the prediction was implemented and utilized in the university-wide production HTCondor system. Our implementation operationally had to have low computational overhead as well as accurate failure prediction. Host users on these machines generated highly diverse workloads and yet our prediction accuracy was 86.5% on average and 73.3% in the worst case, which outperformed the prediction accuracy of linear time series models.