Deep LearningStudentTeacher Learning from Clean Inputs to Noisy InputsComputer Vision and Pattern Recognition (CVPR), 2021. (Acceptance rate = 27%) Featurebased studentteacher learning, a training method that encourages the
student's hidden features to mimic those of the teacher network, is
empirically successful in transferring the knowledge from a pretrained
teacher network to the student network. Furthermore, recent empirical results
demonstrate that, the teacher's features can boost the student network's
generalization even when the student's input sample is corrupted by noise.
However, there is a lack of theoretical insights into why and when this
method of transferring knowledge can be successful between such heterogeneous
tasks. We analyze this method theoretically using deep linear networks, and
experimentally using nonlinear networks. We identify three vital factors to
the success of the method: (1) whether the student is trained to zero
training loss; (2) how knowledgeable the teacher is on the cleaninput
problem; (3) how the teacher decomposes its knowledge in its hidden features.
Lack of proper control in any of the three factors leads to failure of the
studentteacher learning method. One Size Fits All: Can We Train One Denoiser for All Noise Levels?International Conference on Machine Learning (ICML), 2020. (Acceptance rate = 21%) When training an estimator such as a neural network for tasks like image
denoising, it is generally preferred to train emph{one} estimator and apply
it to emph{all} noise levels. The de facto training protocol to achieve this
goal is to train the estimator with noisy samples whose noise levels are
uniformly distributed across the range of interest. However, why should we
allocate the samples uniformly? Can we have more training samples that are
less noisy, and fewer samples that are more noisy? What is the optimal
distribution? How do we obtain such a distribution? The goal of this paper is
to address this training sample distribution problem from a minimax risk
optimization perspective. We derive a dual ascent algorithm to determine the
optimal sampling distribution of which the convergence is guaranteed as long
as the set of admissible estimators is closed and convex. For estimators with
nonconvex admissible sets such as deep neural networks, our dual formulation
converges to a solution of the convex relaxation. We discuss how the
algorithm can be implemented in practice. We evaluate the algorithm on linear
estimators and deep networks. ConsensusNet: Optimal Combination of Image Denoisers
