2026-06-03 11:30:00 2026-06-03 12:30:00 America/Indiana/Indianapolis 2026 YESS Seminar A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies Phalguni Nanda. Ph.D. Student GRIS 134

June 3, 2026

2026 YESS Seminar
A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

Event Date:
June 3, 2026
Time:
11:30am EDT
Location:
GRIS 134
Priority:
No
School or Program:
Industrial Engineering
College Calendar:
Show
Phalguni Nanda. Ph.D. Student

ABSTRACT 

In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for the mean squared sup-norm error of the Q-function iterates, which implies a sample complexity of order O(1/ξ2) for achieving an accuracy level of ξ. This rate matches that of off-policy Q-learning, but with a worse dependence on exploration-related parameters. We also derive an explicit finite-time rate for the mean squared sup-norm error of the Q-function corresponding to the learning policy at iteration k. Together, these results highlight the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage since the learning policy itself converges to an optimal one. Numerical experiments corroborate our theoretical findings.


From a technical perspective, the combination of rapidly time-varying learning policies, which induce time-inhomogeneous Markovian noise, and minimal exploration assumptions presents significant analytical chall-enges. To address these challenges, we develop a Poisson-equation-based decomposition of the Markovian noise associated with a lazy transition matrix, separating it into a martingale-difference term and residual terms. We then control the residual terms through a sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These techniques may facilitate the analysis of other reinforcement learning algorithms with rapidly time-varying learning policies, such as single-timescale actor--critic methods and learning-in-games algorithms.

BIOGRAPHY

Phalguni Nanda is a PhD student in the Edwardson School of Industrial Engineering, Purdue University, where he is advised by Dr. Zaiwei Chen. His research develops rigorous non-asymptotic analyses that characterize sample complexity and learning dynamics in reinforcement learning, with the goal of designing algorithms that are practically effective and provably efficient in high-dimensional stochastic settings for sequential decision making. His work has been selected as a Best Paper Award Finalist at the ACM SIGMETRICS 2026. Phalguni also holds a MS in Mathematical Statistics from Purdue University. Prior to Purdue, he was a Project Assistant in the Department of Mathematics of Birla Institute of Technology and Science Pilani - Hyderabad Campus, India. Earlier, he obtained his B.Sc. and M.Sc. degrees in Mathematics from the National Institute of Technology, Rourkela, India.