ECE 69500 - Optimization for Deep Learning

Course Details

Lecture Hours: 3 Credits: 3

Areas of Specialization:

  • Communications, Networking, Signal & Image Processing

Counts as:

Normally Offered:

Fall - even years

Requisites by Topic:

Undergraduate probability, calculus, and linear algebra. Basic knowledge of computer vision, NLP, machine learning, and statistics is helpful but not required.

Catalog Description:

This course discusses the optimization algorithms that have been the engine that powered the recent rise of machine learning (ML) and deep learning (DL). The "learning" in ML and DL typically boils down to non-convex optimization problems with high-dimensional parameter spaces and objective functions involving millions of terms. Given that the success of ML models is reliant on finding solutions to these problems efficiently, the needs of ML are different than those of other fields in terms of optimization. This course introduces students to the theoretical principles behind stochastic, gradient-based algorithms for DL as well as considerations such as adaptivity, generalization, distributed learning, and non-convex loss surfaces typically present in modern DL problems.

Required Text(s):


Recommended Text(s):

  1. Multiple books and articles available freely online will be referenced; details provided in Brightspace.

Lecture Outline:

Week Lecture Topics
1 Introduction, motivation, and probability overview: Motivating examples and optimization challenges in deep learning; Probability overview: Sub-gaussian random variables, concentration inequalities
2 Landscape of deep learning and relevant optimization concepts: Convex analysis and optimization, Lipschitzness, smoothness; Beyond convexity: weakly convex function, Polyak-Lojasiewicz condition, interpolation and overparameterization in deep learning
3 A modern take on stochastic gradient descent I: Stochastic optimization, empirical risk minimization, proximal operators and composite problems; SGD and its (non-convex) analysis I
4 A modern take on stochastic gradient descent II: SGD and its (non-convex) analysis II; High probability convergence bounds
5 SGD without the gradient: Zeroth-order optimization via gradient estimation; Applications to black-box adversarial attacks
6 Generalization error in DL: Rademacher complexity and empirical processes; Generalization error of neural networks trained by SGD
7 Is SGD optimal?: Lower bounds for nonconvex SGD: Le Cam's and Assouad's methods; Mean-square smoothness and suboptimality of SGD
8 Variance reduction for accelerating SGD: IFO methods including SVRG and SAGA; Convergence of non-convex SVRG
9 Momentum for accelerating SGD: SFO methods including STORM, SPIDER, and Adam; Convergence of STORM
10 Handling constraints in stochastic optimization: Handling constraints and stochastic mirror descent; Projection-free methods: Stochastic Frank-Wolfe
11 Beyond first order guarantees: Second order guarantees, conditions, and assumptions; Escaping from saddle points via noisy SGD
12 SGD for Multi-task learning and Min-max problems: SGD-based Meta-Learning; Distributionally-robust optimization, saddle-point and min-max problems
13 Parallel and distributed DL: Master-worker setup, federated, and decentralized learning; Models of heterogeneity and non-i.i.d. data, addressing heterogeneity
14 Efficiency and Resiliency in DL: Byzantine attacks, fault tolerance, and robust aggregation; Quantization and sparsification in DL, Error feedback for SGD
15 Socially responsible optimization for DL: Mathematical models of privacy; DP-SGD, normalization and clipping

Assessment Method:

Homework, midterm exam, final project. 4/22