GridCop

GridCop: Monitoring Remotely Executing Programs for Correctness and Progress

Overview of GridCop
Papers
Funding
People

Computational workloads for many academic groups, small businesses and consumers are bursty. That is, they are characterized by long periods of little or no processing punctuated by periods of intense computation and insufficient computational resources. By aggregating large numbers of computers and users, the resource demands are ``smoothed out" across sub-groups even as demand remains bursty within sub-groups. Centrally managed software projects (e.g. Distributed.net, GENOME, NILE, SETI@Home) have sprung up to access idle machines to perform computations that would be economically infeasible to solve on committed hardware. Centrally managed systems like Condor and LoadLeveler have been developed to allow resources to be aggregated within permanent or ad-hoc organizations.

Centralized administration of resources exists because it allows a trusted entity -- system administrators -- to verify and track the trustworthiness of users given access to the resources, and it allows users to deal with a known, trusted entity. Certification of the user of a machine is almost always contingent on being an employee of the machine owner, or being certified by another organization which the user belongs to, and which is, in turn, trusted by the machine owner. This certification requires legal contracts that carefully delineate risks and responsibilities, staff to maintain accounts, accountants to monitor funding streams and tax consequences, and generally increases the overhead and real cost of acquiring and using computational resources. This in turn restricts the domain of applications that can be run on shared resources. Elimination of these overheads would allow automatic intermediation between consumers and providers of resources, allowing shared resources to blend seamlessly with locally owned resources.

The promise of increasing available processing powers during peak periods by trading cycles during idle periods, yet without significant additional cost, would motivate people to join a cycle-sharing system, but only one in which the following significant technical challenges have been solved.

How can resource consumers discover computational resources capable of hosting their job?
How can an application be run on a variety of resource provider machines without change (portability)?
How can a resource provider machine safely execute an application from an untrusted source (safety)?
How can resource providers be compensated (and cheaters punished), to enforce the fairness of the system resource usage?
How does the resource consumer know its job is being faithfully executed? How does the resource consumer know its job is making progress?

Such a system will unleash the potential of the massive computational resources that are going unused. In a cycle-sharing system, a submitter machine, i.e. a machine submitting a job request to the system, plays the role of a system resource consumer; a host machine, i.e. a machine accepting and executing a job, plays the role of a system resource provider.

Peer-to-peer (p2p) networks (e.g., CAN, Chord, Pastry, and Tapestry) have achieved widespread use as a content discovery mechanism. We propose using the same mechanisms for resource discovery and job assignment to solve the first challenge. Moreover, because of the self-organizing feature of p2p networks, it is easy for nodes to join, and leave, without the necessity of a central administrative organization and human intervention, which in turn obviates the need for a central organization and human intervention, and lowers administrative overhead.

The use of Java is extremely convenient, if not essential, for overcoming the second and third of these challenges (portability and safety). We utilize Java's universal virtual machine execution environment feature to enable applications to run on a wide variety of physical machines without any change (portability); We utilize Java virtual machine's sandboxing and security feature to provide resource provider a safe execution environment to host untrusted applications (host safety). Both of these attributes significantly lower the cost and risk for producers and consumers of cycles to join a network of shared resources. Moreover, research have shown that there are no inherent technical reasons for not using Java for high performance computing.

To solve the fourth challenge, we have designed a credit accounting system. A community of pooled resources will survive only as long as members are treated with a high (but not necessarily perfect) degree of fairness. Moreover, just as the larger economy can function well with a certain amount of fraud and noise in transactions and accounting, so should economies involved in sharing of computational resources. Thus our goal is not to produce a perfectly fair system, but instead to produce a sufficiently good system to enable wide scale sharing of computational resources. An incremental payment scheme is used in many modern economic activities to bound the amount of risk for both providers and consumers of a transaction to with the size of the incremental payment. Our credit system, along with the GridCop system, enables the risk for both submitters and hosts in our cycle-sharing system to be bounded to the size of an incremental payment.

To solve the last challenge, we have developed a remote job progress and correctness monitoring system, GridCop, to solve the last challenge. The GridCop system allows a computation on a remote, and potentially fraudulent, host system to be monitored for progress and execution correctness. Monitoring progress and correctness is especially important to the submitter during a long running application because it provides submitters confidence that their jobs will be executed by a remote, otherwise untrusted host. Moreover, along with the above credit system, GridCop enables the use of the incremental payment scheme.

GridCop Architecture Overview

Papers

CycleMeter: Detecting Fraudulent Peers in Internet Cycle Sharing.
Zheng Zhang, Y. Charlie Hu and Samuel P. Midkiff.
To appear in Proceedings of IEEE/ACM SC2006, November 11-17, 2006, Tampa, FL.
PDF
A Fair, Secure and Trustworthy Peer-to-Peer Based Cycle-Sharing System.
Shuo Yang, Ali R. Butt, Xing Fang, Y. Charlie Hu, and Samuel P. Midkiff.
To appear in Journal of Grid Computing, Special Issue on Peer-to-Peer and Grid Computing, Volume 4 (3), pp. 265-286, September 2006.
Monitoring Remotely Executing Shared Memory Programs in Software DSMs.
Long Fei, Xing Fang, Y. Charlie Hu, and Samuel P. Midkiff.
Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rhodes Island, Greece. April 25-29, 2006.
PDF
Lightweight Monitoring of the Progress of Remotely Executing Computations.
Shuo Yang, Ali R. Butt, Y. Charlie Hu and Samuel P. Midkiff.
Proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing (LCPC), Hawthorne, New York, October 20-22, 2005.
PDF
Lightweight Monitoring of the Progress of Remotely Executing Computations.
Shuo Yang, Ali R. Butt, Y. Charlie Hu and Samuel P. Midkiff.
Poster, IEEE/ACM SC2005, Seattle, WA, November 12-18, 2005.
PDF
Trust but Verify: Monitoring Remotely Executing Programs for Progress and Correctness.
Shuo Yang, Ali R. Butt, Y. Charlie Hu, and Samuel P. Midkiff.
Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Chicago, IL, June 15-17, 2005.
PDF
Java, Peer-to-Peer, and Accountability: Building Blocks for Distributed Cycle Sharing.
Ali R. Butt, Xing Fang, Y. Charlie Hu, and Samuel Midkiff.
Proceedings of the 3rd USENIX Virtual Machine Research and Technology Symposium (VM), San Jose, CA, May 6-7, 2004.
PDF
An Open Peer-to-Peer Infrastructure for Cycle-Sharing.
Ali R. Butt, Xing Fang, Y. Charlie Hu, Sam Midkiff and Jan Vitek.
Work-in-Progress, 19th ACM Symposium on Operating Systems Principles (SOSP), Bolton Landing (Lake George), NY, October 19-22, 2003.
PDF

Funding

This project is supported in part by NSF CAREER award grant ACI-0238379 and NSF grants CCR-0313026 and CCR-0313033.

People

Faculty: Y. Charlie Hu, Samuel Midkiff
Students: Xing Fang, Zheng Zhang
Former members: Ali R. Butt, Long Fei, Shuo Yang