Math6380p: Deep Learning

PKU

MATH 6380P. Advanced Topics in Deep Learning
Fall 2020

Course Information

Synopsis

This course is a continuition of Math 6380o, Spring 2018, inspired by Stanford Stats 385, Theories of Deep Learning, taught by Prof. Dave Donoho, Dr. Hatef Monajemi, and Dr. Vardan Papyan, as well as the Simons Institute program on Foundations of Deep Learning in the summer of 2019 and IAS@HKUST workshop on Mathematics of Deep Learning during Jan 8-12, 2018. The aim of this course is to provide graduate students who are interested in deep learning a variety of understandings on neural networks that are currently available to foster future research.
Prerequisite: there is no prerequisite, though mathematical maturity on approximation theory, harmonic analysis, optimization, and statistics will be helpful. Do-it-yourself (DIY) and critical thinking (CT) are the most important things in this course. Enrolled students should have some programming experience with modern neural networks, such as PyTorch, Tensorflow, MXNet, Theano, and Keras, etc. Otherwise, it is recommended to take some courses on Statistical Learning (Math 4432 or 5470), and Deep learning such as Stanford CS231n with assignments, or a similar course COMP4901J by Prof. CK TANG at HKUST.

Reference

Theories of Deep Learning, Stanford STATS385 by Dave Donoho, Hatef Monajemi, and Vardan Papyan

Foundations of Deep Learning, by Simons Institute for the Theory of Computing, UC Berkeley

On the Mathematical Theory of Deep Learning, by Gitta Kutyniok

Tutorials: preparation for beginners

Python-Numpy Tutorials by Justin Johnson

scikit-learn Tutorials: An Introduction of Machine Learning in Python

Jupyter Notebook Tutorials

PyTorch Tutorials

Deep Learning: Do-it-yourself with PyTorch, A course at ENS

Tensorflow Tutorials

MXNet Tutorials

Theano Tutorials

Manning: Deep Learning with Python, by Francois Chollet [GitHub source in Python 3.6 and Keras 2.0.8]

MIT: Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

Instructors:

Yuan Yao

Time and Place:

Wed 3:00PM - 5:50PM, Zoom

Homework and Projects:

No exams, but extensive discussions and projects will be expected.

Schedule

Date	Topic	Instructor	Scriber
09/09/2020, Wednesday	Lecture 01: Overview I [ slides ]	Y.Y.
09/16/2020, Wednesday	Lecture 02: Symmetry and Network Architectures: Wavelet Scattering Net, DCFnet, Frame Scattering, and Permutation Invariant/Equivariant Nets [ slides ] and Project 1. [Reference]: Vardan Papyan, X.Y. Han, David L. Donoho, Prevalence of Neural Collapse during the terminal phase of deep learning training, arXiv:2008.08186. Stephane Mallat's short course on Mathematical Mysteries of Deep Neural Networks: [ Part I video ], [ Part II video ], [ slides ] Stephane Mallat, Group Invariant Scattering, Communications on Pure and Applied Mathematics, Vol. LXV, 1331–1398 (2012) Joan Bruna and Stephane Mallat, Invariant Scattering Convolution Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012 Thomas Wiatowski and Helmut Bolcskei, A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction, 2016. Qiang Qiu, Xiuyuan Cheng, Robert Calderbank, Guillermo Sapiro, DCFNet: Deep Neural Network with Decomposed Convolutional Filters, ICML 2018. arXiv:1802.04145. Taco S. Cohen, Max Welling, Group Equivariant Convolutional Networks, ICML 2016. arXiv:1602.07576. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, Alexander Smola. Deep Sets , NIPS, 2017. arXiv:1703.06114. Akiyoshi Sannai, Yuuki Takai, Matthieu Cordonnier. Universal approximations of permutation invariant/equivariant functions by deep neural networks , NIPS, 2017. arXiv:1903.01939. Haggai Maron, Heli Ben-Hamu, Nadav Shamir, Yaron Lipman. Invariant and Equivariant Graph Networks . ICLR 2019. arXiv:1812.09902 [Public codes]: Scattering Net Matlab codes pyscatwave: Scattering Transform in Python Deep Hybrid Transform in Python DCFNet Data from "Prevalence of Neural Collapse during the terminal phase of deep learning training"	Y.Y.
09/23/2020, Wednesday	Lecture 03: Generalization of Deep Learning. [ slides ]. [Title]: From Classical Statistics to Modern Machine Learning [ slide ] [Speaker]: Misha Belkin (OSU) [Abstract]: A model with zero training error is overfit to the training data and will typically generalize poorly" goes statistical textbook wisdom. Yet, in modern practice, over-parametrized deep networks with near perfect fit on training data still show excellent test performance. As I will discuss in the talk, this apparent contradiction is key to understanding the practice of modern machine learning. While classical methods rely on a trade-off balancing the complexity of predictors with training error, modern models are best described by interpolation, where a predictor is chosen among functions that fit the training data exactly, according to a certain (implicit or explicit) inductive bias. Furthermore, classical and modern models can be unified within a single "double descent" risk curve, which extends the classical U-shaped bias-variance curve beyond the point of interpolation. This understanding of model performance delineates the limits of the usual ''what you see is what you get" generalization bounds in machine learning and points to new analyses required to understand computational, statistical, and mathematical properties of modern models. I will proceed to discuss some important implications of interpolation for optimization, both in terms of "easy" optimization (due to the scarcity of non-global minima), and to fast convergence of small mini-batch SGD with fixed step size. [Video] [ Simons link ] [ Bilibili link ] [ MIT CBMM: Beyond Empirical Risk Minimization: the lessons of deep learning ] and an [ interview with Tommy Poggio ] [Reference] Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. PNAS, 2019, 116 (32). [ arXiv:1812.11118 ] Mikhail Belkin, Alexander Rakhlin, Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? AISTATS, 2019. [ arXiv:1806.09471 ] Mikhail Belkin, Daniel Hsu, Partha Mitra. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. Neural Inf. Proc. Systems (NeurIPS) 2018. [ arXiv:1806.05161 ] [Title]: Generalization of linearized neural networks: staircase decay and double descent [ slide ] [Speaker]: MEI, Song (UC Berkeley) [Abstract]: Deep learning methods operate in regimes that defy the traditional statistical mindset. Despite the non-convexity of empirical risks and the huge complexity of neural network architectures, stochastic gradient algorithms can often find the global minimizer of the training loss and achieve small generalization error on test data. As one possible explanation to the training efficiency of neural networks, tangent kernel theory shows that a multi-layers neural network — in a proper large width limit — can be well approximated by its linearization. As a consequence, the gradient flow of the empirical risk turns into a linear dynamics and converges to a global minimizer. Since last year, linearization has become a popular approach in analyzing training dynamics of neural networks. However, this naturally raises the question of whether the linearization perspective can also explain the observed generalization efficacy. In this talk, I will discuss the generalization error of linearized neural networks, which reveals two interesting phenomena: the staircase decay and the double-descent curve. Through the lens of these phenomena, I will also address the benefits and limitations of the linearization approach for neural networks. [Video] [ HKUST Zoom ] [Reference] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. [ arXiv:1908.05355 ] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. [ arXiv:1904.12191 ]	Y.Y.
09/30/2020, Wednesday	Lecture 04: Generalization in Deep Learning: I and II. Introduction and Uniform Law of Large Numbers. [ slides ] [Speaker]: Sasha Rakhlin (Massachusetts Institute of Technology) and Peter Bartlett (UC Berkeley) [Abstract]: We review tools useful for the analysis of the generalization performance of deep neural networks on classification and regression problems. We review uniform convergence properties, which show how this performance depends on notions of complexity, such as Rademacher averages, covering numbers, and combinatorial dimensions, and how these quantities can be bounded for neural networks. We also review the analysis of the performance of nonparametric estimation methods such as nearest-neighbor rules and kernel smoothing. Deep networks raise some novel challenges, since they have been observed to perform well even with a perfect fit to the training data. We review some recent efforts to understand the performance of interpolating prediction rules, and highlight the questions raised for deep learning. [Video] Deep Learning Bootcamp, Simons Institute for the Theory of Computing at UC Berkeley, 2019 [ Weibo collection ] [ Part I ] [ Bilibili link ] [ Part II ] [ Bilibili link ]	Y.Y.
10/07/2020, Wednesday	Lecture 05: Generalization in Deep Learning: III. Classification and Rademacher Complexity [ slides ] [Speaker]: Sasha Rakhlin (Massachusetts Institute of Technology) [Abstract]: We review tools useful for the analysis of the generalization performance of deep neural networks on classification and regression problems. We review uniform convergence properties, which show how this performance depends on notions of complexity, such as Rademacher averages, covering numbers, and combinatorial dimensions, and how these quantities can be bounded for neural networks. We also review the analysis of the performance of nonparametric estimation methods such as nearest-neighbor rules and kernel smoothing. Deep networks raise some novel challenges, since they have been observed to perform well even with a perfect fit to the training data. We review some recent efforts to understand the performance of interpolating prediction rules, and highlight the questions raised for deep learning. [Video] Deep Learning Bootcamp, Simons Institute for the Theory of Computing at UC Berkeley, 2019 [ Part III ] [ Bilibili link ] [ Title ]: Rethinking Breiman's Dilemma in Neural Networks: Phase Transitions of Margin Dynamics. [ slide ] [Speaker]: ZHU, Weizhi (HKUST) [Abstract]: Margin enlargement over training data has been an important strategy since perceptrons in machine learning for the purpose of boosting the confidence of training toward a good generalization ability. Yet Breiman shows a dilemma (Breiman, 1999) that a uniform improvement on margin distribution does not necessarily reduces generalization errors. In this paper, we revisit Breiman’s dilemma in deep neural networks with recently proposed spectrally normalized margins, from a novel perspective based on phase transitions of normalized margin distributions in training dynamics. Normalized margin distribution of a classifier over the data, can be divided into two parts: low/small margins such as some negative margins for misclassified samples vs. high/large margins for high confident correctly classified samples, that often behave differently during the training process. Low margins for training and test datasets are often effectively reduced in training, along with reductions of training and test errors; while high margins may exhibit different dynamics, reflecting the trade-off between expressive power of models and complexity of data. When data complexity is comparable to the model expressiveness, high margin distributions for both training and test data undergo similar decrease-increase phase transitions during training. In such cases, one can predict the trend of generalization or test error by margin-based generalization bounds with restricted Rademacher complexities, shown in two ways in this paper with early stopping time exploiting such phase transitions. On the other hand, over-expressive models may have both low and high training margins undergoing uniform improvements, with a distinct phase transition in test margin dynamics. This reconfirms the Breiman’s dilemma associated with overparameterized neural networks where margins fail to predict overfitting. Experiments are conducted with some basic convolutional networks, AlexNet, VGG-16, and ResNet-18, on several datasets including Cifar10/100 and mini-ImageNet. [Reference] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, Understanding deep learning requires rethinking generalization. ICLR 2017. [Chiyuan Zhang's codes] Peter L. Bartlett, Dylan J. Foster, Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. [ arXiv:1706.08498 ]. NIPS 2017. Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. [ arXiv:1707.09564 ]. International Conference on Learning Representations (ICLR), 2018. Noah Golowich, Alexander (Sasha) Rakhlin, Ohad Shamir. Size-Independent Sample Complexity of Neural Networks. [ arXiv:1712.06541 ]. COLT 2018. Weizhi Zhu, Yifei Huang, Yuan Yao. Rethinking Breiman's Dilemma in Neural Networks: Phase Transitions of Margin Dynamics. [ arXiv: 1810.03389 ]. Vaishnavh Nagarajan, J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. [ arXiv:1902.04742 ]. NIPS 2019. [ Github ]. (It argues that all the generalization bounds above might fail to explain generalization in deep learning)	Y.Y.
10/14/2019, Wednesday	Lecture 06: Generalization in Deep Learning: IV. Interpolation. [ slides ] [Speaker]: Peter Bartlett (UC Berkeley) [Abstract]: We review tools useful for the analysis of the generalization performance of deep neural networks on classification and regression problems. We review uniform convergence properties, which show how this performance depends on notions of complexity, such as Rademacher averages, covering numbers, and combinatorial dimensions, and how these quantities can be bounded for neural networks. We also review the analysis of the performance of nonparametric estimation methods such as nearest-neighbor rules and kernel smoothing. Deep networks raise some novel challenges, since they have been observed to perform well even with a perfect fit to the training data. We review some recent efforts to understand the performance of interpolating prediction rules, and highlight the questions raised for deep learning. [Video] Deep Learning Bootcamp, Simons Institute for the Theory of Computing at UC Berkeley, 2019 [ Part IV ] [ Bilibili link ] [Title]: Benign Overfitting in Linear Prediction [Speaker]: Peter Bartlett (UC Berkeley) [Abstract]: Classical theory that guides the design of nonparametric prediction methods like deep neural networks involves a tradeoff between the fit to the training data and the complexity of the prediction rule. Deep learning seems to operate outside the regime where these results are informative, since deep networks can perform well even with a perfect fit to noisytraining data. We investigate this phenomenon of 'benign overfitting' in the simplest setting, that of linear prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of effective rank of the data covariance. It shows that overparameterization is essential: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. We discuss implications for deep networks and for robustness to adversarial examples. Joint work with Phil Long, Gábor Lugosi, and Alex Tsigler. [Video] [ Simons link ] [ Bilibili link ] [Reference] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler. Benign Overfitting in Linear Regression. arXiv:1906.11300 [ Gallery of Project 1 ]: Description of Project 1 FAN, Ganghua . [ report ] [ source ] FANG Linjiajie (20382284), Liu Yiyuan (20568864), Wang Qiyue (20672641), Wang Ya (20549569). [ report (pdf) ] [ source ] Hao HE, He CAO, Yue GUO, Haoyi CHENG. [ report (html) ] [ source ] WU Huimin, HE Changxiang. [ report (pdf) ] [ source ] VU Tuan-Anh . [ report (html) ] [ source ] CAO Yang, WU Jiamin . [ report (pptx) ] [ source ] DU, Yipai, Yongquan QU . [ report (pdf) ] [ source ] Zheyue Fang, Chutian Huang, Yue WU, and Lu YANG . [ report (pdf) ] [ source ] Shizhe Diao, Jincheng Yu, Duo Li, Yimin Zheng . [ report (pdf) ] Kai WANG, Weizhen DING . [ report (pdf) ] Tony C. W. Mok, Jierong Wang . [ report (pdf) ] Rongrong GAO (20619663), Junming CHEN (20750649), Zifan SHI (20619455) . [ report (pdf) ] Samuel Cahyawijaya, Etsuko Ishii, Ziwei Ji, Ye Jin Bang . [ report (pdf) ] Hanli Huang . [ report (ipynb) ] ABDULLAH, Murad . [ report (pdf) ] PANG Hong Wing and WONG, Yik Ben . [ report (pdf) ] [ source ]	Y.Y.
10/21/2020, Wednesday	Lecture 07: Is Optimization a Sufficient Language for Understanding Deep Learning? [ link ] [Speaker]: Sanjeev Arora (Princeton University) [Abstract]: In this Deep Learning era, machine learning usually boils down to defining a suitable objective/cost function for the learning task at hand, and then optimizing this function using some variant of gradient descent (implemented via backpropagation). Little wonder that hundreds of ML papers each year are devoted to various aspects of optimization. Today I will suggest that if our goal is mathematical understanding of deep learning, then the optimization viewpoint is potentially insufficient — at least in the conventional view. [Video] [ IAS Princeton ] [Seminars]: FANG Linjiajie (20382284), Liu Yiyuan (20568864), Wang Qiyue (20672641), Wang Ya (20549569) WU Huimin, HE Changxiang CAO Yang, WU Jiamin	Y.Y.
10/28/2020, Wednesday	Lecture 08: Final Project [ pdf ] [Title]: Compression and Acceleration of Pre-trained Language Models [ slide ] [Speaker]: Dr. Lu HOU, Huawei Noah’s Ark Lab [Abstract]: Recently, pre-trained language models based on the Transformer structure like BERT and RoBERTa have achieved remarkable results on various natural language processing tasks and even some computer vision tasks. However, these models have many parameters, hindering their deployment on edge devices with limited storage. In this talk, I will first introduce some basics about pre-trained language modeling and our proposed pre-trained language model NEZHA. Then I will elaborate on how we alleviate the concerns in various deployment scenarios during the inference and training period. Specifically, compression and acceleration methods using knowledge distillation, dynamic networks, and network quantization will be discussed. Finally, I will also discuss some recent progress about training deep networks on edge through quantization. [Bio]: Dr. Lu HOU is a researcher at the Speech and Semantics Lab in Huawei Noah's Ark Lab. She obtained Ph.D. from Hong Kong University of Science and Technology in 2019, under the supervision of Prof. James T. Kwok. Her current research interests include compression and acceleration of deep neural networks, natural language processing, and deep learning optimization.	Y.Y.
11/04/2020, Wednesday	Lecture 09: Overparameterization and Optimization [ slides ] [Speaker]: Prof. Jason Lee, Princeton University [Abstract]: We survey recent developments in the optimization and learning of deep neural networks. The three focus topics are on: 1) geometric results for the optimization of neural networks, 2) Overparametrized neural networks in the kernel regime (Neural Tangent Kernel) and its implications and limitations, 3) potential strategies to prove SGD improves on kernel predictors. [Video] Deep Learning Bootcamp, Simons Institute for the Theory of Computing at UC Berkeley, 2019. [ Weibo collection ] [ Part I ] [ Bilibili link] [ Part II ] [ Bilibili link]	Y.Y.
11/11/2020, Wedneday	Lecture 10: Implicit Regularization [Speaker]: Nati Srebro (TTI at University of Chicago) [Abstract]: We review the implicit regularization of gradient descent type algorithms in machine learning. [Video] Deep Learning Bootcamp, Simons Institute for the Theory of Computing at UC Berkeley, 2019. [ Weibo link ] [ Part I ] [ Bilibili link ] [ Part II ] [ Bilibili link ] [Reference] Inductive Bias and Optimization in Deep Learning at Stanford Stats385 Matus Telgarsky. A Primal-dual Analysis of Margin Maximization by Steepest Descent Methods Simons Institute Behnam Neyshabur, Ryota Tomioka, Nathan Srebro. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning. [ arXiv:1412.6614 ] Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, Nathan Srebro. Geometry of Optimization and Implicit Regularization in Deep Learning. [ arXiv: 1705.03071] An older paper that takes a higher level view of what might be going on and what we want to try to achieve. Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro. The Implicit Bias of Gradient Descent on Separable Data. [ arXiv:1710.10345 ]. ICLR 2018. Gradient descent on logistic regression leads to max margin. Matus Telgarsky. Margins, Shrinkage, and Boosting. [ arXiv:1303.4172 ]. ICML 2013. An older paper on gradient descent on exponential/logistic loss leads to max margin. Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro. Implicit Regularization in Matrix Factorization. [ arXiv:1705.09280 ] Yuanzhi Li, Tengyu Ma, Hongyang Zhang. Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations. [ arXiv:1712.09203 ] Blake Woodworth, Suriya Gunasekar, Pedro Savarese, Edward Moroshko, Itay Golan, Jason Lee, Daniel Soudry, Nathan Srebro. Kernel and Rich Regimes in Overparametrized Models. [ arXiv:1906.05827 ] Suriya Gunasekar, Jason Lee, Daniel Soudry, Nathan Srebro. Characterizing Implicit Bias in Terms of Optimization Geometry [ arXiv:1802.08246 ] Mor Shpigel Nacson, Suriya Gunasekar, Jason D. Lee, Nathan Srebro, Daniel Soudry. Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models [ arXiv:1905.07325 ] a generalization of Implicit regularization in linear conv nets: https://arxiv.org/abs/1806.00468 Greg Ongie, Rebecca Willett, Daniel Soudry, Nathan Srebro. A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case [ arXiv:1910.01635 ] Inductive bias in infinite-width ReLU networks of high dimensionality	Y.Y.
11/18/2020, Wednesday	Lecture 11: seminars. [Title]: Theory of Deep Convolutional Neural Networks. [ slides ] [Speaker]: Ding-Xuan ZHOU, City University of Hong Kong. [Time]: 3:00-4:20pm [Abstract]: Deep learning has been widely applied and brought breakthroughs in speech recognition, computer vision, and many other domains. The involved deep neural network architectures and computational issues have been well studied in machine learning. But there lacks a theoretical foundation for understanding the modelling, approximation or generalization ability of deep learning models with network architectures. Here we are interested in deep convolutional neural networks (CNNs) with convolutional structures. The convolutional architecture gives essential differences between the deep CNNs and fully-connected deep neural networks, and the classical theory for fully-connected networks developed around 30 years ago does not apply. This talk describes a mathematical theory of deep CNNs associated with the rectified linear unit (ReLU) activation function. In particular, we give the first proof for the universality of deep CNNs, meaning that a deep CNN can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. We also give explicit rates of approximation, and show that the approximation ability of deep CNNs is at least as good as that of fully-connected multi-layer neural networks for general functions, and is better for radial functions. Our quantitative estimate, given tightly in terms of the number of free parameters to be computed, verifies the efficiency of deep CNNs in dealing with big data. [Bio]: Ding-Xuan Zhou is a Chair Professor in School of Data Science and Department of Mathematics at City University of Hong Kong, serving also as Associate Dean of School of Data Science, and Director of the Liu Bie Ju Centre for Mathematical Sciences. His recent research interest is deep learning theory. He is an Editor-in-Chief of the journals ''Analysis and Application'' and ''Mathematical Foundations of Computing'', and serves editorial boards of more than ten journals. He received a Fund for Distinguished Young Scholars from NSF of China in 2005, and was rated in 2014-2017 by Thomson Reuters/Clarivate Analytics as a Highly-cited Researcher. [Reference] [ video ] [Title]: Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent [ slides ] [Speaker]: Nadav Cohen, Tel Aviv University. [Time]: 4:30-5:50pm [Abstract]: Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the dynamics of the optimizer. I will then demonstrate a manifestation of this approach, analyzing the dynamics of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank (different from any type of norm regularization), and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix completion. Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu, Yuping Luo and Noam Razin. [Reference] [ video ] Noam Razin and Nadav Cohen. Implicit Regularization in Deep Learning May Not Be Explainable by Norms. Conference on Neural Information Processing Systems (NeurIPS) 2020. Sanjeev Arora, Nadav Cohen, Wei Hu and Yuping Luo (alphabetical order). Implicit Regularization in Deep Matrix Factorization. Conference on Neural Information Processing Systems (NeurIPS) 2019. Sanjeev Arora, Nadav Cohen, Noah Golowich and Wei Hu (alphabetical order). A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks. International Conference on Learning Representations (ICLR) 2019. Sanjeev Arora, Nadav Cohen and Elad Hazan (alphabetical order). On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization. International Conference on Machine Learning (ICML) 2018.
11/25/2020, Wednesday	Lecture 12: Mean Field Theory for Neural Networks. [Title]: Mean Field Theory and Tangent Kernel Theory in Neural Networks. [ slides ] [Speaker]: Song Mei, University of California at Berkeley. [Time]: 3:00-4:20pm [Abstract]: Deep neural networks trained with stochastic gradient algorithms often achieve near vanishing training error, and generalize well on test data. Such empirical success of optimization and generalization, however, is quite surprising from a theoretical point of view, mainly due to non-convexity and overparameterization of deep neural networks. In this lecture, I will talk about the mean field theory and the tangent kernel theory on the training dynamics of neural networks, and discuss about their benefits and shortcomings in terms of both optimization and generalization.Then I will analyze the generalization error of linearized neural networks with two interesting phenomena: staircase and double-descent. Finally, I will propose challenges and open problems in analyzing deep neural networks. [Bio]: [Reference] [ video ] Mei, Montanari, and Nguyen. A mean field view of the landscape of two-layers neural networks. Proceedings of the National Academy of Sciences 115, E7665-E7671. Rotskoff and Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv:1805.00915. Chizat and Bach. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. Advances in neural information processing systems, 2018, pp. 3036–3046. Jacot, Gabriel, and Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. Advances in neural information processing systems, 2018, pp. 8571–8580. Belkin, Hsu, Ma, and Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. Proceedings of the National Academy of Sciences 116.32 (2019): 15849-15854. Bach. Breaking the Curse of Dimensionality with Convex Neural Networks. The Journal of Machine Learning Research 18 (2017), no. 1, 629–681. Ghorbani, Mei, Misiakiewicz, and Montanari. Linearized two-layers neural networks in high dimension. arXiv:1904.12191. Hastie, Montanari, Rosset, and Tibshirani. Surprises in High-Dimensional Ridgeless Least Squares Interpolation. arXiv:1903.08560. Mei and Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv:1904.12191. [Title]: A mean-field theory for certain deep neural networks [Speaker]: Roberto I. Oliveira, IMPA. [Time]: 4:30-5:50pm [Abstract]: A natural approach to understand overparameterized deep neural networks is to ask if there is some kind of natural limiting behavior when the number of neurons diverges. We present a rigorous limit result of this kind for for networks with complete connections and "random-feature-style" first and last layers. Specifically, we show that network weights are approximated by certain "ideal particles" whose distribution and dependencies are described by McKean-Vlasov mean-field model. We will present the intuition behind our approach; sketch some of the key technical challenges along the way; and connect our results to some of the recent literature on the topic. [Reference] [ video ] Dyego Araújo, Roberto I. Oliveira, Daniel Yukimura. A mean-field limit for certain deep neural networks. arXiv:1906.00193. Justin Sirignano, Konstantinos Spiliopoulos. "Mean field analysis of deep neural networks", 2020, Mathematics of Operations Research, [ArXiv:1903.04440], to appear. Jean-François Jabir, David Šiška, Łukasz Szpruch. Mean-Field Neural ODEs via Relaxed Optimal Control arXiv:1912.05475. Phan-Minh Nguyen, Huy Tuan Pham. A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks. arXiv:2001.11443. Weinan E, Stephan Wojtowytsch. On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics. arXiv:2007.15263.
12/02/2020, Wednesday	Lecture 13: seminars. [Title]: Learning assisted modeling of molecules and materials. [ slides ] [Speaker]: Linfeng ZHANG, Beijing Institute of Big Data Research and Princeton University. [Time]: 3:00-4:00pm [Abstract]: In recent years, machine learning (ML) has emerged as a promising tool for dealing with the difficulty of representing high dimensional functions. This gives us an unprecedented opportunity to revisit theoretical foundations of various scientific fields and solve problems that were too complicated for conventional approaches to address. Here we identify a list of such problems in the context of multi-scale molecular and materials modeling and review ML-based strategies that boost simulations with ab initio accuracy to much larger scales than conventional approaches. Using examples at scales of many-electron Schrödinger equation, density functional theory, and molecular dynamics, we present two equally important principles: 1) ML-based models should respect important physical constraints in a faithful and adaptive way; 2) to build truly reliable models, efficient algorithms are needed to explore relevant physical space and construct optimal training data sets. Finally, we present our efforts on developing related software packages and high-performance computing schemes, which have now been widely used worldwide by experts and practitioners in the molecular and materials simulation community. [Bio]: Linfeng Zhang is temporarily working as a research scientist at the Beijing Institute of Big Data Research. In the May of 2020, he graduated from the Program in Applied and Computational Mathematics (PACM), Princeton University, working with Profs. Roberto Car and Weinan E. Linfeng has been focusing on developing machine learning based physical models for electronic structures, molecular dynamics, as well as enhanced sampling. He is one of the main developers of DeePMD-kit, a very popular deep learning based open-source software for molecular simulation in physics, chemistry, and materials science. He is a recipient of the 2020 ACM Gordon Bell Prize for their project, “Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning.” [Reference] Weile Jia, Han Wang, Mohan Chen, Denghui Lu, Lin Lin, Roberto Car, Weinan E, Linfeng Zhang. Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. arXiv:2005.00223 [Title]: Robust Estimation via Generative Adversarial Networks. [ slides ] [Speaker]: Weizhi ZHU, HKUST. [Time]: 4:00-5:00pm [Abstract]: Robust estimation under Huber's -contamination model has become an important topic in statistics and theoretical computer science. Rate-optimal procedures such as Tukey's median and other estimators based on statistical depth functions are impractical because of their computational intractability. In this talk, we establish an intriguing connection between f-GAN, various depth functions and proper scoring rules. Similar to the derivation of f-GAN, we show that these depth functions that lead to rateoptimal robust estimators can all be viewed as variational lower bounds of the total variation distance in the framework of f-Learning. [Reference] GAO, Chao, Jiyu LIU, Yuan YAO, and Weizhi ZHU. Robust Estimation and Generative Adversarial Nets. [ arXiv:1810.02030 ] [ GitHub ] [ GAO, Chao's Simons Talk ] GAO, Chao, Yuan YAO, and Weizhi ZHU. Generative Adversarial Nets for Robust Scatter Estimation: A Proper Scoring Rule Perspective. Journal of Machine Learning Research, 21(160):1-48, 2020. [ arXiv:1903.01944 ] [ GitHub ] [Title]: Towards a mathematical understanding of supervised learning: What we know and what we don't know [ slides ] [Speaker]: Weinan E, Princeton University. [Time]: 4:30-5:50pm [Abstract]: Two of the biggest puzzles in machine learning are: Why is it so successful and why is it quite fragile? This talk will present a framework for unraveling these puzzles from the perspective of approximating functions in high dimensions. We will discuss what's known and what's not known about the approximation generalization properties of neural network type of hypothesis space as well as the dynamics and generalization properties of the training process. We will also discuss the relative merits of shallow vs. deep neural network models and suggest ways to formulate more robust machine learning models. This is joint work with Chao Ma, Stephan Wojtowytsch and Lei Wu. [Reference] [ video ] [ Gallery of Project 2 ]: Description of Project 2 (Final Project) Kaggle in-class contest on semi-conductor image classification 2 mini: [ link ] PANG, Hong Wing and Wong, Yik Ben . 1. Can Object Dectectors Generalize? [ poster ] [ video ] Ye Jin Bang, Etsuko Ishii, Samuel Cahyawijaya, and Ziwei Ji . 2. Model Generalization on COVID19 Fake News Detection [ report ] [ slides ] [ source ] [ video ] Zheyue FANG, Chutian HUANG, Yue WU, and Lu YANG . 3. Home Credit Default Risk Project [ report ] [ source ] [ video ] Yipai Du and Yongquan Qu . 4. Interpretability of Deep Learning on Home Credit Default Risk Dataset [ poster ] [ slides ] [ source ] [ video ] Shizhe Diao, Jincheng Yu, Duo Li, and Yimin Zheng . 5. Improving Batch Normalization via Scaling and Shifting Relay [ poster ] [ slides ] [ video ] ABDULLAH, Murad . 6. Classification of Nexperia Image Dataset: An Averaging Ensemble Approach [ report ] [ source ] [ video ] HE, Changxiang and XU, Yan . 7. Nexperia Image Classification [ report ] [ slides ] [ video ] Ganghua Fan . 8. Kaggle in-class Contest: Nexperia Image Classication II [ poster ] [ slides ] [ codes ] [ video ] Hanli Huang . 9.Semi-conductor defect images classification [ poster (pptx) ] [ slides (pptx) ] [ codes ] [ video ] Huimin Wu . 10. Nexperia Image Classification II with Noise Handling [ poster ] [ video ] FANG Linjiajie, Liu Yiyuan, Wang Qiyue, and Wang Ya . 11. Solving Semi-Conductor Classification Problem by Light-weighted Model with Stratified Convolutions [ report ] [ slides ] [ codes ] [ video ] Rongrong GAO, Junming CHEN, and Zifan SHI . 12. Nexperia Image Classification [ report ] [ slides ] [ codes ] [ video ] Tony C.W. Mok and Jierong Wang . 13. Toward Fast and Accurate Semi-conductor Image Classification using Deep Convolutional Neural Networks [ report ] [ codes ] Tuan Anh VU . 14. Anomaly Detection using Transfer Learning in Semiconductors [ report ] [ slides ] [ codes ] [ video ] Yang Cao and Jiamin Wu . 15. Nexperia Image Classification [ report ] [ slides ] [ codes ] [ video ] Yue Guo, Hao He, He Cao, and Haoyi Cheng (DreamDragon). 16. Image Classification of Semiconductors [ report ] [ slides (pptx) ] [ codes ] [ video ] Kai Wang and Weizhen Ding . 17. Defect Detection in Semi conductor Images [ report ] [ slides (pptx) ] [ codes ] [ video ]

by YAO, Yuan.