Resource Efficient Large Scale ML: Plan Before You Run

Abstract

As ML on structured data becomes prevalent across enterprises, improving resource efficiency is crucial to lower costs and energy consumption. Designing systems for learning on structured data is challenging because of the large number of models. Parameters and data access patterns. We identify that current systems are bottlenecked by data movement which results in poor resource utilization and inefficient training.

In this talk, I will describe our work on developing systems that plan data access ahead of time to yield drastic improvements in resource efficiency. I will first describe Marius, a system for training ML models on billion-edge graphs using a single machine. Marius is designed as an out-of-core, pipelined training system and includes new buffer-aware data orderings that minimize disk accesses. I will then describe BagPipe, a recently developed system that lowers remote data access overheads for distributed training of recommendation models while maintaining synchronous training semantics. Finally, I will discuss how our design approach can also be extended cluster-wide to improve resource utilization across ML training jobs.

Biography

Shivaram Venkataraman is an Assistant Professor in the Computer Science Department at the University of Wisconsin, Madison. His research interests are in designing systems and algorithms for large-scale data analysis and machine learning. Previously, he completed his PhD from UC Berkeley where he was advised by Ion Stoica and Mike Franklin. His work has been recognized with an NSF CAREER award, a SIGMOD Systems award, and a SACM Student Choice Professor of the Year award.