Math6380o: Deep Learning

PKU

MATH 6380o. Advanced Topics in Deep Learning
Fall 2019

Course Information

Synopsis

This course is a continuition of Math 6380o, Spring 2018, inspired by Stanford Stats 385, Theories of Deep Learning, taught by Prof. Dave Donoho, Dr. Hatef Monajemi, and Dr. Vardan Papyan, as well as the Simons Institute program on Foundations of Deep Learning in the summer of 2019 and IAS@HKUST workshop on Mathematics of Deep Learning during Jan 8-12, 2018. The aim of this course is to provide graduate students who are interested in deep learning a variety of understandings on neural networks that are currently available to foster future research.
Prerequisite: there is no prerequisite, though mathematical maturity on approximation theory, harmonic analysis, optimization, and statistics will be helpful. Do-it-yourself (DIY) and critical thinking (CT) are the most important things in this course. Enrolled students should have some programming experience with modern neural networks, such as PyTorch, Tensorflow, MXNet, Theano, and Keras, etc. Otherwise, it is recommended to take some courses on Statistical Learning (Math 4432 or 5470), and Deep learning such as Stanford CS231n with assignments, or a similar course COMP4901J by Prof. CK TANG at HKUST.

Reference

Theories of Deep Learning, Stanford STATS385 by Dave Donoho, Hatef Monajemi, and Vardan Papyan

Foundations of Deep Learning, by Simons Institute for the Theory of Computing, UC Berkeley

On the Mathematical Theory of Deep Learning, by Gitta Kutyniok

Tutorials: preparation for beginners

Python-Numpy Tutorials by Justin Johnson

scikit-learn Tutorials: An Introduction of Machine Learning in Python

Jupyter Notebook Tutorials

PyTorch Tutorials

Deep Learning: Do-it-yourself with PyTorch, A course at ENS

Tensorflow Tutorials

MXNet Tutorials

Theano Tutorials

Manning: Deep Learning with Python, by Francois Chollet [GitHub source in Python 3.6 and Keras 2.0.8]

MIT: Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

Instructors:

Yuan Yao

Time and Place:

Th 03:00PM - 05:50PM, Rm 2405 (Lift 17-18), Academic Bldg, HKUST

Homework and Projects:

No exams, but extensive discussions and projects will be expected.

Teaching Assistant:

Email: Mr. Mingxuan CAI deeplearning.math (add "AT gmail DOT com" afterwards)

Schedule

Date	Topic	Instructor	Scriber
09/05/2019, Thursday	Lecture 01: Overview I [ slides ] [Reference]: Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix Wichmann, Wieland Brendel, ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness, ICLR 2019 [ video ] Aleksander Madry (MIT), A New Perspective on Adversarial Perturbation, Simons Institute for Theory of Computing, 2019. [Adversarial Examples Are Not Bugs, They Are Features]	Y.Y.
09/12/2019, Thursday	Lecture 02: Symmetry and Network Architectures: Wavelet Scattering Net, Frame Scattering, DCFnet, and Permutation Invariant/Equivariant Nets [ slides ] and Project 1. [Reference]: Stephane Mallat's short course on Mathematical Mysteries of Deep Neural Networks: [ Part I video ], [ Part II video ], [ slides ] Stephane Mallat, Group Invariant Scattering, Communications on Pure and Applied Mathematics, Vol. LXV, 1331–1398 (2012) Joan Bruna and Stephane Mallat, Invariant Scattering Convolution Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012 Thomas Wiatowski and Helmut Bolcskei, A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction, 2016. Qiang Qiu, Xiuyuan Cheng, Robert Calderbank, Guillermo Sapiro, DCFNet: Deep Neural Network with Decomposed Convolutional Filters, ICML 2018. arXiv:1802.04145. Taco S. Cohen, Max Welling, Group Equivariant Convolutional Networks, ICML 2016. arXiv:1602.07576. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, Alexander Smola. Deep Sets , NIPS, 2017. arXiv:1703.06114. Akiyoshi Sannai, Yuuki Takai, Matthieu Cordonnier. Universal approximations of permutation invariant/equivariant functions by deep neural networks , NIPS, 2017. arXiv:1903.01939. Haggai Maron, Heli Ben-Hamu, Nadav Shamir, Yaron Lipman. Invariant and Equivariant Graph Networks . ICLR 2019. arXiv:1812.09902 [Public codes]: Scattering Net Matlab codes pyscatwave: Scattering Transform in Python Deep Hybrid Transform in Python DCFNet	Y.Y.
09/18/2018, Wednesday	Seminar: Asymptotic Behavior of Robust Wasserstein Profile Inference (RWPI) Function Analysis --- selecting \delta for DRO (Distributionally Robust Optimization) Problems. [ slides ] [Speaker]: XIE, Jin, Stanford University. [Time]: 3:00-4:20pm [Venue]: LTJ (Lift 33) [Abstract]: Recently, [1] showed that several machine learning algorithms, such as Lasso, Support Vector Machines, and regularized logistic regression, and many others can be represented exactly as distributionally robust optimization (DRO) problems. The uncertainty is then defined as a neighborhood centered at the empirical distribution. A key element of the study of uncertainty is the Robust Wasserstein Profile function. In [1], the authors study the asymptotic behavior of the RWP function in the case of L^p costs under the true parameter. We consider costs in more generalized forms, namely Bregman distance or in the more general symmetric format of d(x-y) and analyze the asymptotic behavior of the RWPI function in these cases. For the purpose of statistical applications, we then study the RWP function with plug-in estimators. This is a joint work with Yue Hui, Jose Blanchet and Peter Glynn. [1] Blanchet, J., Kang, Y., & Murthy, K. Robust Wasserstein Profile Inference and Applications to Machine Learning, arXiv:1610.05627, 2016. [ tutorial slides ]
09/19/2018, Thursday	Lecture 03: Robust Statistics and Generative Adversarial Networks [ slides ] [Reference] GAO, Chao, Jiyu LIU, Yuan YAO, and Weizhi ZHU. Robust Estimation and Generative Adversarial Nets. [ arXiv:1810.02030 ] [ GitHub ] [ GAO, Chao's Simons Talk ] GAO, Chao, Yuan YAO, and Weizhi ZHU. Generative Adversarial Nets for Robust Scatter Estimation: A Proper Scoring Rule Perspective. [ arXiv:1903.01944 ] []	Y.Y.
09/26/2018, Thursday	Lecture 04: Convolutional Neural Network on Graphs [ slides ] [Seminar]: Multi-Scale and Multi-Representation Learning on Graphs and Manifolds [ slides ] [Speaker]: Prof. ZHAO, Zhizhen, Department of ECE, UIUC [Time]: 4:30-5:50pm [Abstract]: The analysis of geometric (graph- and manifold-structured) data have recently gained prominence in the machine learning community. For the first part of the talk, I will introduce Lanczos network (LanczosNet), which uses the Lanczos algorithm to construct low rank approximations of the graph Laplacian for graph convolution. Relying on the tridiagonal decomposition of the Lanczos algorithm, we efficiently exploit multi-scale information via fast approximated computation of matrix power, and design learnable spectral filters. Being fully differentiable, LanczosNet facilitates both graph kernel learning as well as learning node embeddings. I will show the application of LanczosNet to citation networks and QM8 quantum chemistry dataset. For the second part of the talk, I will introduce a novel multi-representation learning paradigm for manifolds naturally equipped with a group action. Utilizing a representation theoretic mechanism, multiple associated vector bundles can be constructed over the orbit space, providing multiple views for learning the geometry of the underlying manifold. The consistency across these associated vector bundles form a common base for unsupervised manifold learning, through the redundancy inherent to the algebraic relations across irreducible representations of the transformation group. I will demonstrate the efficacy of the proposed algorithmic paradigm through dramatically improved robust nearest neighbor search in cryo-electron microscopy image analysis. [Reference]: Xavier Bresson, Convolutional Neural Networks on Graphs, IPAM, UCLA, 2017. [video][slides]	Y.Y.
10/10/2019, Thursday	Lecture 05: An Introduction to Optimization and Regularization Methods in Deep Learning [ slides ] [Reference]: Stanford CS231n [ Gallery of Project 1 ]: Description of Project 1 Peer Review requirement: Peer Review and Report Assignment Rebuttal Guideline: Rebuttal Doodle Vote for Top 3 Reports: vote link Group 1: XIAO Jiashun, LIU Yiyuan, WANG Ya, and YU Tingyu. [ report ] [ review ] Group 2: Abhinav PANDEY. [ report ][ review ] Group 3: LEI Chenyang, Yazhou XING, Yue WU, and XIE Jiaxin. [ report ] [ review ] Group 4: Oscar Bergqvist, Martin Studer, Cyril de Lavergne. [ report ] [ review ] [ rebuttal ] Group 5: Lanqing XUE, Feng HAN, Jianyue WANG, Zhiliang TIAN. [ report ] [ review ] [ rebuttal ] Group 6: CHEN Zhixian, QIAN Yueqi, and ZHANG Shunkang. [ report ] [ review ] Group 7: Zhenghui CHEN and Lei KANG. [ report ][ review ] [ rebuttal ] Group 8: Boyu JIANG. [ report ] [ review ] Group 9: LI Donghao, WU Jiamin, ZENG Wenqi and CAO Yang. [ report ][ review ] [ rebuttal ] Group 10: Shichao LI, Ziyu WANG and Zhenzhen HUANG. [ report ][ review ] [ rebuttal ] Group 11: NG Yui Hong. [ report ][ review ] Group 12: Luyu Cen, Jingyang Li, Zhongyuan Lyu and Shifan Zhao. [ report ] [ review ] Group 13: Mutian He, Qing Yang, Yuxin Tong, Ruoyang Hou. [ report ] [ review ] [ rebuttal ] Group 14: WANG, Qicheng. [ report ] [ review ]	Y.Y.
10/17/2019, Thursday	Lecture 06: The Landscape of Empirical Risk of Neural Networks [ slides ] [Reference]: Freeman, Bruna. Topology and Geometry of Half-Rectified Network Optimization, ICLR 2017. [ arXiv:1611.01540 ] [ Stanford talk video ][ slides ] Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural Networks with Finite Intrinsic Dimension Have no Spurious Valleys. [ arXiv:1802.06384 ] Rohith Kuditipudi, Xiang Wang, Holden Lee, Yi Zhang, Zhiyuan Li, Wei Hu, Sanjeev Arora and Rong Ge. Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets. [ arXiv:1906.06247 ][ Simons talk video ][ Bilibili video ][ slides ]	Y.Y.
10/24/2019, Thursday	Lecture 07: Overparameterization and Optimization [ slides ] [Speaker]: Prof. Jason Lee, Princeton University [Abstract]: We survey recent developments in the optimization and learning of deep neural networks. The three focus topics are on: 1) geometric results for the optimization of neural networks, 2) Overparametrized neural networks in the kernel regime (Neural Tangent Kernel) and its implications and limitations, 3) potential strategies to prove SGD improves on kernel predictors. [Video] Deep Learning Bootcamp, Simons Institute for the Theory of Computing at UC Berkeley, 2019. [ Weibo collection ] [ Part I ] [ Bilibili link] [ Part II ] [ Bilibili link]	Y.Y.
10/31/2019, Thursday	Lecture 08: Generalization in Deep Learning [ slides ] [Speaker]: Sasha Rakhlin (Massachusetts Institute of Technology) and Peter Bartlett (UC Berkeley) [Abstract]: We review tools useful for the analysis of the generalization performance of deep neural networks on classification and regression problems. We review uniform convergence properties, which show how this performance depends on notions of complexity, such as Rademacher averages, covering numbers, and combinatorial dimensions, and how these quantities can be bounded for neural networks. We also review the analysis of the performance of nonparametric estimation methods such as nearest-neighbor rules and kernel smoothing. Deep networks raise some novel challenges, since they have been observed to perform well even with a perfect fit to the training data. We review some recent efforts to understand the performance of interpolating prediction rules, and highlight the questions raised for deep learning. [Video] Deep Learning Bootcamp, Simons Institute for the Theory of Computing at UC Berkeley, 2019 [ Weibo collection ] [ Part I ] [ Bilibili link ] [ Part II ] [ Bilibili link ]	Y.Y.
11/07/2019, Thursday	Lecture 09: Generalization in Deep Learning (continued) [ slides ] [Speaker]: Sasha Rakhlin (Massachusetts Institute of Technology) and Peter Bartlett (UC Berkeley) [Abstract]: We review tools useful for the analysis of the generalization performance of deep neural networks on classification and regression problems. We review uniform convergence properties, which show how this performance depends on notions of complexity, such as Rademacher averages, covering numbers, and combinatorial dimensions, and how these quantities can be bounded for neural networks. We also review the analysis of the performance of nonparametric estimation methods such as nearest-neighbor rules and kernel smoothing. Deep networks raise some novel challenges, since they have been observed to perform well even with a perfect fit to the training data. We review some recent efforts to understand the performance of interpolating prediction rules, and highlight the questions raised for deep learning. [Video] Deep Learning Bootcamp, Simons Institute for the Theory of Computing at UC Berkeley, 2019 [ Part III ] [ Bilibili link ] [ Part IV ] [ Bilibili link ] [Reference] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, Understanding deep learning requires rethinking generalization. ICLR 2017. [Chiyuan Zhang's codes] Peter L. Bartlett, Dylan J. Foster, Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. [ arXiv:1706.08498 ]. NIPS 2017. Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. [ arXiv:1707.09564 ]. International Conference on Learning Representations (ICLR), 2018. Noah Golowich, Alexander (Sasha) Rakhlin, Ohad Shamir. Size-Independent Sample Complexity of Neural Networks. [ arXiv:1712.06541 ]. COLT 2018. Weizhi Zhu, Yifei Huang, Yuan Yao. On Breiman's Dilemma in Neural Networks: Phase Transitions of Margin Dynamics. [ arXiv: 1810.03389 ]. (This paper shows that when Rademacher complexity based generalization bounds can be informative to find early stopping, as well as when such bounds fail with extremely over-parameterized models)	Y.Y.
11/21/2019, Thursday	Lecture 10: Implicit Regularization [Speaker]: Nati Srebro (TTI at University of Chicago) [Abstract]: We review the implicit regularization of gradient descent type algorithms in machine learning. [Video] Deep Learning Bootcamp, Simons Institute for the Theory of Computing at UC Berkeley, 2019. [ Weibo link ] [ Part I ] [ Bilibili link ] [ Part II ] [ Bilibili link ] [Reference] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro. The Implicit Bias of Gradient Descent on Separable Data. [ arXiv:1710.10345 ]. ICLR 2018. Matus Telgarsky. Margins, Shrinkage, and Boosting. [ arXiv:1303.4172 ]. ICML 2013. Vaishnavh Nagarajan, J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. [ arXiv:1902.04742 ]. NIPS 2019. [ Github ]. (It argues that all the generalization bounds above might fail to explain generalization in deep learning)	Y.Y.
11/28/2019, Thursday	Lecture 11: Seminars [Title]: From Classical Statistics to Modern Machine Learning [ slide ] [Speaker]: Misha Belkin (OSU) [Abstract]: A model with zero training error is overfit to the training data and will typically generalize poorly" goes statistical textbook wisdom. Yet, in modern practice, over-parametrized deep networks with near perfect fit on training data still show excellent test performance. As I will discuss in the talk, this apparent contradiction is key to understanding the practice of modern machine learning. While classical methods rely on a trade-off balancing the complexity of predictors with training error, modern models are best described by interpolation, where a predictor is chosen among functions that fit the training data exactly, according to a certain (implicit or explicit) inductive bias. Furthermore, classical and modern models can be unified within a single "double descent" risk curve, which extends the classical U-shaped bias-variance curve beyond the point of interpolation. This understanding of model performance delineates the limits of the usual ''what you see is what you get" generalization bounds in machine learning and points to new analyses required to understand computational, statistical, and mathematical properties of modern models. I will proceed to discuss some important implications of interpolation for optimization, both in terms of "easy" optimization (due to the scarcity of non-global minima), and to fast convergence of small mini-batch SGD with fixed step size. [Video] [ Simons link ] [ Bilibili link ] [Reference] Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. PNAS , 2019, 116 (32). [ arXiv:1812.11118 ] [Title]: Benign Overfitting in Linear Prediction [Speaker]: Peter Bartlett (UC Berkeley) [Abstract]: Classical theory that guides the design of nonparametric prediction methods like deep neural networks involves a tradeoff between the fit to the training data and the complexity of the prediction rule. Deep learning seems to operate outside the regime where these results are informative, since deep networks can perform well even with a perfect fit to noisytraining data. We investigate this phenomenon of 'benign overfitting' in the simplest setting, that of linear prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of effective rank of the data covariance. It shows that overparameterization is essential: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. We discuss implications for deep networks and for robustness to adversarial examples. Joint work with Phil Long, Gábor Lugosi, and Alex Tsigler. [Video] [ Simons link ] [ Bilibili link ] [Reference] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler. Benign Overfitting in Linear Regression. arXiv:1906.11300	Y.Y.
10/28/2019, Thursday	Lecture 12: Final Project [ PDF ] [ Gallery of Final Project ]: Description of Final Project Group 1: XIAO Jiashun, LIU Yiyuan, WANG Ya, and YU Tingyu. Reproducible Study of Training and Generalization Performance. [ report ] [ video ] Group 2: Abhinav PANDEY. Anomaly Detection in Semiconductors. [ report ] [ video ] Group 3: LEI Chenyang, Yazhou XING, Yue WU, and XIE Jiaxin. Colorizing Black-White Movies Fastly and Automatically. [ report ] [ video ] Group 4: Oscar Bergqvist, Martin Studer, Cyril de Lavergne. China Equity Index Prediction Contest. [ report ] [ video ] Group 5: HAN, Feng, Lanqing XUE, Zhiliang Tian, and Jianyue WANG. Contextual Information Based Market Prediction using Dynamic Graph Neural Networks. [ report ] [ video ] [ Kaggle link] Group 6: CHEN Zhixian, QIAN Yueqi, and ZHANG Shunkang. Semi-conductor Image Classification. [ report ] [ video ] Group 7: Zhenghui CHEN and Lei KANG. On Raphael Painting Authentication. [ report ][ video ] Group 8: Boyu JIANG. Final project report on Nexperia Image Classification. [ report ] Group 9: LI Donghao, WU Jiamin, ZENG Wenqi and CAO Yang. On teacher-student network learning. [ report ][ video ] Group 10: LI Shichao, Ziyu WANG and Zhenzhen HUANG. Semiconductor Classification by Making Decisions with Deep Features. [ report ][ video ] Group 11: NG Yui Hong. Reproducible Study of Training and Generalization Performance. [ report ][ video ] Group 12: Luyu Cen, Jingyang Li, Zhongyuan Lyu and Shifan Zhao. Nexperia Kaggle in-class contest. [ report ] [ video ] [ source ] Group 13: Mutian He, Qing Yang, Yuxin Tong, Ruoyang Hou. Defects Recognition on Nexperia's Semi-Conductors. [ report ] [ video ] Group 14: WANG, Qicheng. Great Challenges of Reproducible Training of CNNs. [ report ]	Y.Y.

by YAO, Yuan.