Adaptive Computation and Machine Learning

Machine Learning Algorithms

MIT Press

2016

775

**About the Book:
**If you are interested in implementing Deep Learning algorithms, then you must get this book! This is the current best documentation on basic principles of Deep Learning Algorithms that exists! In fact, you can view this book for free at: www.deeplearningbook.org.

Chapters 1, 2, 3, 4 provide background on Deep Learning, Linear Algebra, Probability Theory, and Iterative Deterministic Optimization algorithms.

Chapter 5 is a basic introduction to Machine Learning algorithms which covers topics such as the importance of having a training and test data set, hyperparameters, maximum likelihood estimation, maximum a posteriori (MAP) estimation, and stochastic gradient descent algorithms.

Chapter 6 introduces feedforward multilayer perceptrons, different types of hidden units as well as different types of output unit representations corresponding to different types of probabilistic modeling assumptions regarding the distribution of the target given the input pattern. Chapter 6 also discusses the concept of differentiation algorithms using computational graphs. The computational graph methodology is extremely powerful for the derivation of derivatives of objective functions for complicated network architectures.

Chapter 7 discusses the all-important concept of regularization. In the mid-1980s, the guiding philosophy was to start with smaller more simpler network architectures and ``grow’’ those architectures. Later on, however, in the early 21^{st} century it was found that for certain types of various complicated problems that a much more effective approach was to start with a very complicated network architecture and “shrink” that architecture without sacrificing predictive performance. The basic underlying idea is that rather than find the parameters of the learning machine which merely reduce the learning machine’s predictive error, the learning rule is designed to minimize an objective function which takes on large values when either the predictive error is large or when the learning machine has an excessive number of free parameters. Thus, the learning machine essentially ``designs itself’’ during the learning process! This type of methodology is called ``regularization’’. The standard L1 regularization (penalize if the sum of the absolute values of the parameter values are large) and the standard L2 regularization (penalize if the sum of the squares of the parameter values are large). Adversarial training is also introduced as a regularization technique.

Chapter 8 introduces optimization methods for training deep learning network architectures. Such methods include: batch, minibatch, stochastic gradient descent (SGD), SGD with momentum, parameter initialization strategies, ADAgrad, RMSprop and their variants, Limited Memory BFGS algorithms, updating groups of parameters at a time (block coordinate descent), averaging the last few parameter estimates in an on-line adaptive learning algorithm to improve the estimate (Polyak Averaging), supervised pretraining, continuation methods and curriculum learning. Anyone who is using a deep learning algorithm should be familiar with all of these methods and when they are applicable because these different techniques were specifically designed to commonly encountered problems in specific deep learning applications.

Chapter 9 discusses convolutional neural networks. Analogous to the convolution operator in the time domain, a convolutional neural network implements a convolution operator in the spatial domain. The basic idea of the convolutional neural network is that one applies a collection of spatial convolution operators which have free parameters to the image which process very small regions of an image. The output of this collection of spatial convolution operators then forms the input pattern for the next layer of spatial convolution operators which are looking for statistical regularities in the outputs of the first layer.

Chapter 10 discusses the concept of recurrent and recursive neural networks and how to derive learning algorithms for them using computational graphs. Chapter 11 provides a guide for a practical methodology for applying deep learning methods. Chapter 12 discusses applications in the areas of: computer vision, speech recognition, natural language processing, machine translation, recommender systems.

The remaining third of the book constitutes Part 3 which includes a discussion of important topics in deep learning which are active research topics.

**Target Audience:**

Someone without any knowledge of machine learning probably would benefit from studying other introductory machine learning texts before tackling this one. Still, the book is relatively self-contained with relevant reviews of linear algebra, probability theory, and numerical optimization. And, although some prior background in linear algebra, calculus, probability theory, and machine learning is helpful, readers without this math background should still find the book reasonably accessible but may struggle a bit with the few sections in the text which involve mathematical notation for the purposes of specifying or explaining the behavior of specific algorithms.

**About the Authors:
**Dr. Ian Goodfellow (https://www.linkedin.com/in/ian-goodfellow-b7187213/) received his Ph.D. in Machine Learning from the University of Montreal in 2014 and has made substantial contributions to the field of Deep Learning.

Professor Yoshua Bengio (https://www.linkedin.com/in/yoshuabengio/) and Dr. Aaron Courville (https://aaroncourville.wordpress.com/) are also leaders in the field of Deep Learning and pursue research and teaching activities at the University of Montreal.

All three are elite experts actively pursuing research in the field of Deep Learning. All three have strong practical and theoretical experience with these algorithms.