Aggarwal receives NSF grant

IE Assistant Professor Vaneet Aggarwal
Assistant Professor Vaneet Aggarwal received a National Science Foundation (NSF) grant for a new project researching erasure codes for online storage.

The project is entitled "NeTS: Small: Collaborative Research: Rethinking Erasure Codes for Cloud Storage: A Quantitative Framework for Latency, Reliability, and Cost Optimization". Prof. Tian Lan of George Washington University will collaborate with Prof. Aggarwal, and the project  will run from Oct. 1, 2016 - Sept. 30, 2019.

SUMMARY:

This project aims to develop an analytical framework that quantifies tail latency and reliability of erasure-coded storage through investigation of novel scheduling and repair strategies, mandating rethinking of erasure codes for online storage. As erasure coding is increasingly adopted by large-scale storage systems such as Microsoft Azure and Facebook, conventional approaches that primarily rely on system design heuristics has become inadequate in pushing the performance boundaries in terms of latency and reliability optimization. Quantifying tail latency and reliability of an erasure-coded storage that employs dynamic workload scheduling and online repair is an open problem. There exists little work illuminating the design space via mathematical crystallization of key tradeoffs and associated engineering "control knobs".

This project will enable a joint optimization of latency, reliability and storage cost, which mandates rethinking of erasure-coded storage optimization and service pricing models. We plan to concentrate on the following specific aspects: (1) We will investigate a family of new probabilistic scheduling policies and extend order statistic analysis to derive closed-form bounds on tail latency for erasure-coded storage with arbitrary configurations, general service-time distributions, and potentially differentiated service classes. (2) Through a novel reliability metric, Time to Data Loss, we will investigate online repair strategies that significantly improve reliability using a class of bandwidth-efficient codes, enabling a tradeoff between repair timeliness and reliability optimization. (3) We will employ the theoretical analysis in this research to pursue a holistic solution that jointly optimizes reliability, latency, and storage costs over seven key control dimensions: choice of erasure codes, chunk placement, network resource allocation, cache management, dynamic scheduling, pricing, and online repair strategy. (4) We will integrate the proposed framework with current cloud systems to bridge the gap between the theoretical results and practical systems. By developing new analytical models and algorithms for joint optimization of latency, reliability, and storage cost, the project will mandate rethinking of erasure-coded storage design and service pricing models. It will produces novel, interdisciplinary curriculum modules for teaching both these theories and systems.