Date 
Topic 
Instructor 
Scriber 
09/09/2020, Wednesday 
Lecture 01: Overview I [ slides ]

Y.Y. 

09/16/2020, Wednesday 
Lecture 02: Symmetry and Network Architectures: Wavelet Scattering Net, DCFnet, Frame Scattering, and Permutation Invariant/Equivariant Nets [ slides ] and Project 1.
[Reference]:
 Vardan Papyan, X.Y. Han, David L. Donoho,
Prevalence of Neural Collapse during the terminal phase of deep learning training, arXiv:2008.08186.
 Stephane Mallat's short course on Mathematical Mysteries of Deep Neural Networks: [ Part I video ], [ Part II video ],
[ slides ]
 Stephane Mallat, Group Invariant Scattering, Communications on Pure and Applied Mathematics, Vol. LXV, 1331–1398 (2012)
 Joan Bruna and Stephane Mallat, Invariant Scattering Convolution Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012
 Thomas Wiatowski and Helmut Bolcskei, A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction, 2016.
 Qiang Qiu, Xiuyuan Cheng, Robert Calderbank, Guillermo Sapiro, DCFNet: Deep Neural Network with Decomposed Convolutional Filters, ICML 2018. arXiv:1802.04145.
 Taco S. Cohen, Max Welling, Group Equivariant Convolutional Networks, ICML 2016. arXiv:1602.07576.
 Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, Alexander Smola. Deep Sets , NIPS, 2017. arXiv:1703.06114.
 Akiyoshi Sannai, Yuuki Takai, Matthieu Cordonnier. Universal approximations of permutation invariant/equivariant functions by deep neural networks
, NIPS, 2017. arXiv:1903.01939.
 Haggai Maron, Heli BenHamu, Nadav Shamir, Yaron Lipman. Invariant and Equivariant Graph Networks . ICLR 2019. arXiv:1812.09902

Y.Y. 

09/23/2020, Wednesday 
Lecture 03: Generalization of Deep Learning. [ slides ].
[Title]: From Classical Statistics to Modern Machine Learning
[ slide ]
[Abstract]:
A model with zero training error is overfit to the training data and will typically generalize poorly" goes statistical textbook wisdom. Yet, in modern practice, overparametrized deep networks with near perfect fit on training data still show excellent test performance. As I will discuss in the talk, this apparent contradiction is key to understanding the practice of modern machine learning.
While classical methods rely on a tradeoff balancing the complexity of predictors with training error, modern models are best described by interpolation, where a predictor is chosen among functions that fit the training data exactly, according to a certain (implicit or explicit) inductive bias. Furthermore, classical and modern models can be unified within a single "double descent" risk curve, which extends the classical Ushaped biasvariance curve beyond the point of interpolation. This understanding of model performance delineates the limits of the usual ''what you see is what you get" generalization bounds in machine learning and points to new analyses required to understand computational, statistical, and mathematical properties of modern models.
I will proceed to discuss some important implications of interpolation for optimization, both in terms of "easy" optimization (due to the scarcity of nonglobal minima), and to fast convergence of small minibatch SGD with fixed step size.
[Reference]
 Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal. Reconciling modern machine learning practice and the biasvariance tradeoff.
PNAS, 2019, 116 (32). [ arXiv:1812.11118 ]
 Mikhail Belkin, Alexander Rakhlin, Alexandre B Tsybakov.
Does data interpolation contradict statistical optimality?
AISTATS, 2019.
[ arXiv:1806.09471 ]
 Mikhail Belkin, Daniel Hsu, Partha Mitra.
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate.
Neural Inf. Proc. Systems (NeurIPS) 2018.
[ arXiv:1806.05161 ]
[Title]: Generalization of linearized neural networks: staircase decay and double descent
[ slide ]
[Abstract]:
Deep learning methods operate in regimes that defy the traditional statistical mindset. Despite the nonconvexity of empirical risks and the huge complexity of neural network
architectures, stochastic gradient algorithms can often find the global minimizer of the training loss and achieve small generalization error on test data.
As one possible explanation to the training efficiency of neural networks, tangent kernel theory shows that a multilayers neural network — in a proper large
width limit — can be well approximated by its linearization. As a consequence, the gradient flow of the empirical risk turns into a linear dynamics and
converges to a global minimizer. Since last year, linearization has become a popular approach in analyzing training dynamics of neural networks. However,
this naturally raises the question of whether the linearization perspective can also explain the observed generalization efficacy. In this talk, I will
discuss the generalization error of linearized neural networks, which reveals two interesting phenomena: the staircase decay and the doubledescent curve.
Through the lens of these phenomena, I will also address the benefits and limitations of the linearization approach for neural networks.
[Reference]
 Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve.
[ arXiv:1908.05355 ]
 Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized twolayers neural networks in high dimension.
[ arXiv:1904.12191 ]

Y.Y. 

09/30/2020, Wednesday 
Lecture 04: Generalization in Deep Learning: I and II. Introduction and Uniform Law of Large Numbers. [ slides ]
[Abstract]: We review tools useful for the analysis of the generalization performance of deep neural networks on classification and regression problems. We review uniform convergence properties, which show how this performance depends on notions of complexity, such as Rademacher averages, covering numbers, and combinatorial dimensions, and how these quantities can be bounded for neural networks. We also review the analysis of the performance of nonparametric estimation methods such as nearestneighbor rules and kernel smoothing. Deep networks raise some novel challenges, since they have been observed to perform well even with a perfect fit to the training data. We review some recent efforts to understand the performance of interpolating prediction rules, and highlight the questions raised for deep learning.

Y.Y. 

10/07/2020, Wednesday 
Lecture 05: Generalization in Deep Learning: III. Classification and Rademacher Complexity [ slides ]
[Abstract]: We review tools useful for the analysis of the generalization performance of deep neural networks on classification and regression problems. We review uniform convergence properties, which show how this performance depends on notions of complexity, such as Rademacher averages, covering numbers, and combinatorial dimensions, and how these quantities can be bounded for neural networks. We also review the analysis of the performance of nonparametric estimation methods such as nearestneighbor rules and kernel smoothing. Deep networks raise some novel challenges, since they have been observed to perform well even with a perfect fit to the training data. We review some recent efforts to understand the performance of interpolating prediction rules, and highlight the questions raised for deep learning.
[ Title ]: Rethinking Breiman's Dilemma in Neural Networks: Phase Transitions of Margin Dynamics.
[ slide ]
[Speaker]: ZHU, Weizhi (HKUST)
[Abstract]:
Margin enlargement over training data has been an important strategy since perceptrons in machine learning for the purpose of boosting the confidence
of training toward a good generalization ability. Yet Breiman shows a dilemma (Breiman, 1999) that a uniform improvement on margin distribution does not
necessarily reduces generalization errors. In this paper, we revisit Breiman’s dilemma in deep neural networks with recently proposed spectrally normalized
margins, from a novel perspective based on phase transitions of normalized margin distributions in training dynamics. Normalized margin distribution of a
classifier over the data, can be divided into two parts: low/small margins such as some negative margins for misclassified samples vs. high/large margins
for high confident correctly classified samples, that often behave differently during the training process. Low margins for training and test datasets are
often effectively reduced in training, along with reductions of training and test errors; while high margins may exhibit different dynamics, reflecting the
tradeoff between expressive power of models and complexity of data. When data complexity is comparable to the model expressiveness, high margin distributions
for both training and test data undergo similar decreaseincrease phase transitions during training. In such cases, one can predict the trend of generalization
or test error by marginbased generalization bounds with restricted Rademacher complexities, shown in two ways in this paper with early stopping time exploiting
such phase transitions. On the other hand, overexpressive models may have both low and high training margins undergoing uniform improvements, with a distinct
phase transition in test margin dynamics. This reconfirms the Breiman’s dilemma associated with overparameterized neural networks where margins fail to predict
overfitting. Experiments are conducted with some basic convolutional networks, AlexNet, VGG16, and ResNet18, on several datasets including Cifar10/100 and
miniImageNet.
[Reference]
 Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals,
Understanding deep learning requires rethinking generalization.
ICLR 2017.
[Chiyuan Zhang's codes]
 Peter L. Bartlett, Dylan J. Foster, Matus Telgarsky. Spectrallynormalized margin bounds for neural networks.
[ arXiv:1706.08498 ]. NIPS 2017.
 Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. A pacbayesian approach to spectrallynormalized
margin bounds for neural networks. [ arXiv:1707.09564 ]. International Conference on Learning Representations (ICLR), 2018.
 Noah Golowich, Alexander (Sasha) Rakhlin, Ohad Shamir. SizeIndependent Sample Complexity of Neural Networks.
[ arXiv:1712.06541 ]. COLT 2018.
 Weizhi Zhu, Yifei Huang, Yuan Yao. Rethinking Breiman's Dilemma in Neural Networks: Phase Transitions of Margin Dynamics.
[ arXiv: 1810.03389 ].
 Vaishnavh Nagarajan, J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning.
[ arXiv:1902.04742 ]. NIPS 2019.
[ Github ].
(It argues that all the generalization bounds above might fail to explain generalization in deep learning)

Y.Y. 

10/14/2019, Wednesday 
Lecture 06: Generalization in Deep Learning: IV. Interpolation. [ slides ]
[Abstract]: We review tools useful for the analysis of the generalization performance of deep neural networks on classification and regression problems. We review uniform convergence properties, which show how this performance depends on notions of complexity, such as Rademacher averages, covering numbers, and combinatorial dimensions, and how these quantities can be bounded for neural networks. We also review the analysis of the performance of nonparametric estimation methods such as nearestneighbor rules and kernel smoothing. Deep networks raise some novel challenges, since they have been observed to perform well even with a perfect fit to the training data. We review some recent efforts to understand the performance of interpolating prediction rules, and highlight the questions raised for deep learning.
[Title]:
Benign Overfitting in Linear Prediction
[Abstract]:
Classical theory that guides the design of nonparametric prediction methods like deep neural networks involves a tradeoff between the fit to the training data and the complexity of the prediction rule. Deep learning seems to operate outside the regime where these results are informative, since deep networks can perform well even with a perfect fit to noisytraining data. We investigate this phenomenon of 'benign overfitting' in the simplest setting, that of linear prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has nearoptimal prediction accuracy. The characterization is in terms of two notions of effective rank of the data covariance. It shows that overparameterization is essential: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. We discuss implications for deep networks and for robustness to adversarial examples.
Joint work with Phil Long, Gábor Lugosi, and Alex Tsigler.
[Reference]
 Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler. Benign Overfitting in Linear Regression. arXiv:1906.11300
[ Gallery of Project 1 ]:
 Description of Project 1
 FAN, Ganghua .
[ report ]
[ source ]
 FANG Linjiajie (20382284), Liu Yiyuan (20568864), Wang Qiyue (20672641), Wang Ya (20549569).
[ report (pdf) ]
[ source ]
 Hao HE, He CAO, Yue GUO, Haoyi CHENG.
[ report (html) ]
[ source ]
 WU Huimin, HE Changxiang.
[ report (pdf) ]
[ source ]
 VU TuanAnh .
[ report (html) ]
[ source ]
 CAO Yang, WU Jiamin .
[ report (pptx) ]
[ source ]
 DU, Yipai, Yongquan QU .
[ report (pdf) ]
[ source ]
 Zheyue Fang, Chutian Huang, Yue WU, and Lu YANG .
[ report (pdf) ]
[ source ]
 Shizhe Diao, Jincheng Yu, Duo Li, Yimin Zheng .
[ report (pdf) ]
 Kai WANG, Weizhen DING .
[ report (pdf) ]
 Tony C. W. Mok, Jierong Wang .
[ report (pdf) ]
 Rongrong GAO (20619663), Junming CHEN (20750649), Zifan SHI (20619455) .
[ report (pdf) ]
 Samuel Cahyawijaya, Etsuko Ishii, Ziwei Ji, Ye Jin Bang .
[ report (pdf) ]
 Hanli Huang .
[ report (ipynb) ]
 ABDULLAH, Murad .
[ report (pdf) ]
 PANG Hong Wing and WONG, Yik Ben .
[ report (pdf) ]
[ source ]

Y.Y. 

10/21/2020, Wednesday 
Lecture 07: Is Optimization a Sufficient Language for Understanding Deep Learning? [ link ]
[Abstract]: In this Deep Learning era, machine learning usually boils down to defining a suitable objective/cost function for the learning task at hand,
and then optimizing this function using some variant of gradient descent (implemented via backpropagation). Little wonder that hundreds of ML papers
each year are devoted to various aspects of optimization. Today I will suggest that if our goal is mathematical understanding of deep learning, then
the optimization viewpoint is potentially insufficient — at least in the conventional view.
[Seminars]:
 FANG Linjiajie (20382284), Liu Yiyuan (20568864), Wang Qiyue (20672641), Wang Ya (20549569)
 WU Huimin, HE Changxiang
 CAO Yang, WU Jiamin

Y.Y. 

10/28/2020, Wednesday 
Lecture 08: Final Project [ pdf ]
[Title]: Compression and Acceleration of Pretrained Language Models
[ slide ]
[Speaker]: Dr. Lu HOU, Huawei Noah’s Ark Lab
[Abstract]:
Recently, pretrained language models based on the Transformer structure like BERT and RoBERTa have achieved remarkable results on various natural language processing tasks and even some computer vision tasks. However, these models have many parameters, hindering their deployment on edge devices with limited storage. In this talk, I will first introduce some basics about pretrained language modeling and our proposed pretrained language model NEZHA. Then I will elaborate on how we alleviate the concerns in various deployment scenarios during the inference and training period. Specifically, compression and acceleration methods using knowledge distillation, dynamic networks, and network quantization will be discussed.
Finally, I will also discuss some recent progress about training deep networks on edge through quantization.
[Bio]:
Dr. Lu HOU is a researcher at the Speech and Semantics Lab in Huawei Noah's Ark Lab. She obtained Ph.D. from Hong Kong University of Science and Technology in 2019, under the supervision of Prof. James T. Kwok. Her current research interests include compression and acceleration of deep neural networks, natural language processing, and deep learning optimization.

Y.Y. 

11/04/2020, Wednesday 
Lecture 09: Overparameterization and Optimization [ slides ]
[Speaker]: Prof. Jason Lee, Princeton University
[Abstract]: We survey recent developments in the optimization and learning of deep neural networks. The three focus topics are on:
 1) geometric results for the optimization of neural networks,
 2) Overparametrized neural networks in the kernel regime (Neural Tangent Kernel) and its implications and limitations,
 3) potential strategies to prove SGD improves on kernel predictors.

Y.Y. 

11/11/2020, Wedneday 
Lecture 10: Implicit Regularization
[Abstract]: We review the implicit regularization of gradient descent type algorithms in machine learning.
[Reference]
 Inductive Bias and Optimization in Deep Learning at
Stanford Stats385
 Matus Telgarsky. A Primaldual Analysis of Margin Maximization by Steepest Descent Methods
Simons Institute
 Behnam Neyshabur, Ryota Tomioka, Nathan Srebro. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning.
[ arXiv:1412.6614 ]
 Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, Nathan Srebro. Geometry of Optimization and Implicit Regularization in Deep Learning.
[ arXiv: 1705.03071] An older paper that takes a higher level view of what might be going on and what we want to try to achieve.
 Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro. The Implicit Bias of Gradient Descent on Separable Data.
[ arXiv:1710.10345 ]. ICLR 2018. Gradient descent on logistic regression leads to max margin.
 Matus Telgarsky. Margins, Shrinkage, and Boosting. [ arXiv:1303.4172 ]. ICML 2013. An older paper on gradient descent on exponential/logistic loss
leads to max margin.
 Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro. Implicit Regularization in Matrix Factorization.
[ arXiv:1705.09280 ]
 Yuanzhi Li, Tengyu Ma, Hongyang Zhang. Algorithmic Regularization in Overparameterized Matrix Sensing and Neural Networks with Quadratic Activations.
[ arXiv:1712.09203 ]
 Blake Woodworth, Suriya Gunasekar, Pedro Savarese, Edward Moroshko, Itay Golan, Jason Lee, Daniel Soudry, Nathan Srebro. Kernel and Rich Regimes in Overparametrized Models.
[ arXiv:1906.05827 ]
 Suriya Gunasekar, Jason Lee, Daniel Soudry, Nathan Srebro. Characterizing Implicit Bias in Terms of Optimization Geometry
[ arXiv:1802.08246 ]
 Mor Shpigel Nacson, Suriya Gunasekar, Jason D. Lee, Nathan Srebro, Daniel Soudry. Lexicographic and DepthSensitive Margins in Homogeneous and NonHomogeneous Deep Models
[ arXiv:1905.07325 ] a generalization of Implicit regularization in linear conv nets: https://arxiv.org/abs/1806.00468
 Greg Ongie, Rebecca Willett, Daniel Soudry, Nathan Srebro. A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case
[ arXiv:1910.01635 ] Inductive bias in infinitewidth ReLU networks of high dimensionality

Y.Y. 

11/18/2020, Wednesday 
Lecture 11: seminars.
[Title]: Theory of Deep Convolutional Neural Networks.
[ slides ]
[Speaker]: DingXuan ZHOU, City University of Hong Kong.
[Abstract]:
Deep learning has been widely applied and brought breakthroughs in speech recognition, computer vision, and many other domains. The involved deep neural network architectures and computational issues have been well studied in machine learning. But there lacks a theoretical foundation for understanding the modelling, approximation or generalization ability of deep learning models with network architectures. Here we are interested in deep convolutional neural networks (CNNs) with convolutional structures. The convolutional architecture gives essential differences between the deep CNNs and fullyconnected deep neural networks, and the classical theory for fullyconnected networks developed around 30 years ago does not apply. This talk describes a mathematical theory of deep CNNs associated with the rectified linear unit (ReLU) activation function.
In particular, we give the first proof for the universality of deep CNNs, meaning that a deep CNN can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. We also give explicit rates of approximation, and show that the approximation ability of deep CNNs is at least as good as that of fullyconnected multilayer neural networks for general functions, and is better for radial functions. Our quantitative estimate, given tightly in terms of the number of free parameters to be computed, verifies the efficiency of deep CNNs in dealing with big data.
[Bio]:
DingXuan Zhou is a Chair Professor in School of Data Science and Department of Mathematics at City University of Hong Kong, serving also as Associate Dean of School of Data Science, and Director of the Liu Bie Ju Centre for Mathematical Sciences. His recent research interest is deep learning theory.
He is an EditorinChief of the journals ''Analysis and Application'' and ''Mathematical Foundations of Computing'', and serves editorial boards of more than ten journals. He received a Fund for Distinguished Young Scholars from NSF of China in 2005, and was rated in 20142017 by Thomson Reuters/Clarivate Analytics as a Highlycited Researcher.
[Title]: Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent
[ slides ]
[Abstract]:
Understanding deep learning calls for addressing the questions of: (i) optimization  the effectiveness of simple gradientbased algorithms in solving neural network training programs that are nonconvex and thus seemingly difficult; and (ii) generalization  the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the dynamics of the optimizer. I will then demonstrate a manifestation of this approach, analyzing the dynamics of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradientbased algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of nonconvexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank (different from any type of norm regularization), and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix completion.
Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu, Yuping Luo and Noam Razin.



11/25/2020, Wednesday 
Lecture 12: Mean Field Theory for Neural Networks.
[Title]: Mean Field Theory and Tangent Kernel Theory in Neural Networks.
[ slides ]
[Speaker]: Song Mei, University of California at Berkeley.
[Abstract]:
Deep neural networks trained with stochastic gradient algorithms often achieve near vanishing training error, and generalize well on test data. Such empirical success of optimization and generalization, however, is quite surprising from a theoretical point of view, mainly due to nonconvexity and overparameterization of deep neural networks.
In this lecture, I will talk about the mean field theory and the tangent kernel theory on the training dynamics of neural networks, and discuss about their benefits and shortcomings in terms of both optimization and generalization.Then I will analyze the generalization error of linearized neural networks with two interesting phenomena: staircase and doubledescent. Finally, I will propose challenges and open problems in analyzing deep neural networks.
[Reference]
 [ video ]
 Mei, Montanari, and Nguyen. A mean field view of the landscape of twolayers neural networks. Proceedings of the National Academy of Sciences 115, E7665E7671.
 Rotskoff and VandenEijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv:1805.00915.
 Chizat and Bach. On the Global Convergence of Gradient Descent for Overparameterized Models using Optimal Transport. Advances in neural information processing systems, 2018, pp. 3036–3046.
 Jacot, Gabriel, and Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. Advances in neural information processing systems, 2018, pp. 8571–8580.
 Belkin, Hsu, Ma, and Mandal. Reconciling modern machine learning practice and the biasvariance tradeoff. Proceedings of the National Academy of Sciences 116.32 (2019): 1584915854.
 Bach. Breaking the Curse of Dimensionality with Convex Neural Networks. The Journal of Machine Learning Research 18 (2017), no. 1, 629–681.
 Ghorbani, Mei, Misiakiewicz, and Montanari. Linearized twolayers neural networks in high dimension. arXiv:1904.12191.
 Hastie, Montanari, Rosset, and Tibshirani. Surprises in HighDimensional Ridgeless Least Squares Interpolation. arXiv:1903.08560.
 Mei and Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv:1904.12191.
[Title]: A meanfield theory for certain deep neural networks
[Abstract]:
A natural approach to understand overparameterized deep neural networks is to ask if there is some kind of natural limiting behavior when the number of neurons diverges. We present a rigorous limit result of this kind for for networks with complete connections and "randomfeaturestyle" first and last layers. Specifically, we show that network weights are approximated by certain "ideal particles" whose distribution and dependencies are described by McKeanVlasov meanfield model. We will present the intuition behind our approach; sketch some of the key technical challenges along the way; and connect our results to some of the recent literature on the topic.
[Reference]
 [ video ]
 Dyego Araújo, Roberto I. Oliveira, Daniel Yukimura.
A meanfield limit for certain deep neural networks.
arXiv:1906.00193.
 Justin Sirignano, Konstantinos Spiliopoulos.
"Mean field analysis of deep neural networks", 2020, Mathematics of Operations Research,
[ArXiv:1903.04440], to appear.
 JeanFrançois Jabir, David Šiška, Łukasz Szpruch.
MeanField Neural ODEs via Relaxed Optimal Control
arXiv:1912.05475.
 PhanMinh Nguyen, Huy Tuan Pham.
A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks.
arXiv:2001.11443.
 Weinan E, Stephan Wojtowytsch.
On the Banach spaces associated with multilayer ReLU networks: Function representation, approximation theory and gradient descent dynamics.
arXiv:2007.15263.



12/02/2020, Wednesday 
Lecture 13: seminars.
[Title]: Learning assisted modeling of molecules and materials.
[ slides ]
[Speaker]: Linfeng ZHANG, Beijing Institute of Big Data Research and Princeton University.
[Abstract]:
In recent years, machine learning (ML) has emerged as a promising tool for dealing with the difficulty of representing high dimensional functions. This gives us an unprecedented opportunity to revisit theoretical foundations of various scientific fields and solve problems that were too complicated for conventional approaches to address. Here we identify a list of such problems in the context of multiscale molecular and materials modeling and review MLbased strategies that boost simulations with ab initio accuracy to much larger scales than conventional approaches. Using examples at scales of manyelectron Schrödinger equation, density functional theory, and molecular dynamics, we present two equally important principles: 1) MLbased models should respect important physical constraints in a faithful and adaptive way; 2) to build truly reliable models, efficient algorithms are needed to explore relevant physical space and construct optimal training data sets. Finally, we present our efforts on developing related software packages and highperformance computing schemes, which have now been widely used worldwide by experts and practitioners in the molecular and materials simulation community.
[Bio]:
Linfeng Zhang is temporarily working as a research scientist at the Beijing Institute of Big Data Research. In the May of 2020, he graduated from the Program in Applied and Computational Mathematics (PACM), Princeton University, working with Profs. Roberto Car and Weinan E. Linfeng has been focusing on developing machine learning based physical models for electronic structures, molecular dynamics, as well as enhanced sampling. He is one of the main developers of DeePMDkit, a very popular deep learning based opensource software for molecular simulation in physics, chemistry, and materials science. He is a
recipient of the 2020 ACM Gordon Bell Prize for their project,
“Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning.”
[Reference]
 Weile Jia, Han Wang, Mohan Chen, Denghui Lu, Lin Lin, Roberto Car, Weinan E, Linfeng Zhang.
Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning.
arXiv:2005.00223
[Title]: Robust Estimation via Generative Adversarial Networks.
[ slides ]
[Speaker]: Weizhi ZHU, HKUST.
[Abstract]:
Robust estimation under Huber's contamination model has become an important topic in statistics and theoretical computer science. Rateoptimal procedures such as Tukey's median and other estimators based on statistical depth functions are impractical because of their computational intractability. In this talk, we establish an intriguing connection between fGAN, various depth functions and proper scoring rules. Similar to the derivation of fGAN, we show that these depth functions that lead to rateoptimal robust estimators can all be viewed as variational lower bounds of the total variation distance in the framework of fLearning.
[Reference]
 GAO, Chao, Jiyu LIU, Yuan YAO, and Weizhi ZHU.
Robust Estimation and Generative Adversarial Nets.
[ arXiv:1810.02030 ] [ GitHub ] [ GAO, Chao's Simons Talk ]
 GAO, Chao, Yuan YAO, and Weizhi ZHU.
Generative Adversarial Nets for Robust Scatter Estimation: A Proper Scoring Rule Perspective. Journal of Machine Learning Research, 21(160):148, 2020.
[ arXiv:1903.01944 ] [ GitHub ]
[Title]: Towards a mathematical understanding of supervised learning: What we know and what we don't know
[ slides ]
[Speaker]: Weinan E, Princeton University.
[Abstract]:
Two of the biggest puzzles in machine learning are: Why is it so successful and why is it quite fragile?
This talk will present a framework for unraveling these puzzles from the perspective of approximating functions in high dimensions. We will discuss what's known and what's not known about the approximation generalization properties of neural network type of hypothesis space as well as the dynamics and generalization properties of the training process. We will also discuss the relative merits of shallow vs. deep neural network models and suggest ways to formulate more robust machine learning models.
This is joint work with Chao Ma, Stephan Wojtowytsch and Lei Wu.
[ Gallery of Project 2 ]:
 Description of Project 2 (Final Project)
 Kaggle inclass contest on semiconductor image classification 2 mini: [ link ]
 PANG, Hong Wing and Wong, Yik Ben .
1. Can Object Dectectors Generalize?
[ poster ]
[ video ]
 Ye Jin Bang, Etsuko Ishii, Samuel Cahyawijaya, and Ziwei Ji .
2. Model Generalization on COVID19 Fake News Detection
[ report ]
[ slides ]
[ source ]
[ video ]
 Zheyue FANG, Chutian HUANG, Yue WU, and Lu YANG .
3. Home Credit Default Risk Project
[ report ]
[ source ]
[ video ]
 Yipai Du and Yongquan Qu .
4. Interpretability of Deep Learning on Home Credit Default Risk Dataset
[ poster ]
[ slides ]
[ source ]
[ video ]
 Shizhe Diao, Jincheng Yu, Duo Li, and Yimin Zheng .
5. Improving Batch Normalization via Scaling and Shifting Relay
[ poster ]
[ slides ]
[ video ]
 ABDULLAH, Murad .
6. Classification of Nexperia Image Dataset: An Averaging Ensemble Approach
[ report ]
[ source ]
[ video ]
 HE, Changxiang and XU, Yan .
7. Nexperia Image Classification
[ report ]
[ slides ]
[ video ]
 Ganghua Fan .
8. Kaggle inclass Contest: Nexperia Image Classication II
[ poster ]
[ slides ]
[ codes ]
[ video ]
 Hanli Huang .
9.Semiconductor defect images classification
[ poster (pptx) ]
[ slides (pptx) ]
[ codes ]
[ video ]
 Huimin Wu .
10. Nexperia Image Classification II with Noise Handling
[ poster ]
[ video ]
 FANG Linjiajie, Liu Yiyuan, Wang Qiyue, and Wang Ya .
11. Solving SemiConductor Classification Problem by Lightweighted Model with Stratified Convolutions
[ report ]
[ slides ]
[ codes ]
[ video ]
 Rongrong GAO, Junming CHEN, and Zifan SHI .
12. Nexperia Image Classification
[ report ]
[ slides ]
[ codes ]
[ video ]
 Tony C.W. Mok and Jierong Wang .
13. Toward Fast and Accurate Semiconductor Image Classification using Deep Convolutional Neural Networks
[ report ]
[ codes ]
 Tuan Anh VU .
14. Anomaly Detection using Transfer Learning in Semiconductors
[ report ]
[ slides ]
[ codes ]
[ video ]
 Yang Cao and Jiamin Wu .
15. Nexperia Image Classification
[ report ]
[ slides ]
[ codes ]
[ video ]
 Yue Guo, Hao He, He Cao, and Haoyi Cheng (DreamDragon).
16. Image Classification of Semiconductors
[ report ]
[ slides (pptx) ]
[ codes ]
[ video ]
 Kai Wang and Weizhen Ding .
17. Defect Detection in Semi conductor Images
[ report ]
[ slides (pptx) ]
[ codes ]
[ video ]


