LM101-029: How to Modernize Deep Learning with Rectilinear units, Convolutional Nets, and Max-Pooling

By | May 25, 2015
Deep learning neural network is projected on projection screen.

LM101-029: How to Modernize Deep Learning  with Rectilinear units,  Convolutional Nets, and Max-Pooling

Episode Summary

This podcast discusses the topics of rectilinear units, convolutional nets, and max-pooling relevant to deep learning which were inspired by my recent visit to the 3rd International Conference on Learning Representations (May 7-9, 2015) in San Diego. Specifically, commonly used techniques shared by many successful deep learning algorithms such as: rectilinear units, convolutional filters, and max-pooling are discussed.

Show Notes

Hello everyone! Welcome to the twenty-ninth podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.

Recently, I attended two machine learning conferences called the “3rd International Conference on Learning Representations” also known as ICLR and the “18th International Conference on Artificial Intelligence and Statistics” also known as AISTATS. Both of these conferences are relatively small conferences with no more than about 200 or 300 participants each. The attendees at these conferences consist of researchers from the top machine learning groups in Google, Microsoft, and Amazon as well as professors, postdoctoral scholars, and doctoral students pursuing machine learning research. The ICLR conference was held May 7 through 9 at the Hilton San Diego Resort and Spa and then was immediately followed by the AISTATS conference which was held May 9 through 12th at the Hilton as well. The links to these conferences may be found on the website: www.learningmachines101.com.  The ICLR conference, in particular, was sponsored by the Computational and Biological Learning Society.

In the next sequence of podcasts, I will discuss and review aspects of these conferences which have influenced my thinking in profound ways. I will also provide background information relevant to understanding this review. In this episode we will focus on three commonly used strategies for training deep learning machines. The first strategy for building deep learning networks is to consider an alternative type of computing unit which is different from the function approximation units presented in Episode 14. The second strategy for building deep learning networks is to impose considerable structure on the network architecture despite the fact that the network has millions of free parameters. We will refer to this second strategy as the convolutional network strategy. The third strategy is called “max-pooling” which is often used in conjunction with convolutional networks. Actually, all three of these techniques have appeared in various forms in the biological cybernetics and neural network modeling literature for many decades but only within the past five years have these techniques been used together to enhance the computational performance of deep learning machines. In Episode 23 of Learning Machines 101, I introduced the basic concept of “deep learning strategies” but in this episode we discuss strategies for deep learning which extend and complement the ideas introduced in Episode 23.

The ICLR conference might be described as a conference which has a strong focus on the problem of “deep learning” using primarily an empirical methodology. An important characteristic of “deep learning” and perhaps the defining characteristic of “deep learning” is the ability of a system to create and extract its own “features” from a large data set. The problem of identifying the correct features to facilitate learning and inference has existed since the beginning of the field of Artificial Intelligence. Every introductory textbook in Artificial Intelligence emphasizes the fundamental importance of the “representation problem”. The exact same learning problem can either be incredibly impossible to solve or incredibly easy to solve simply depending upon how the problem is represented. Smart biological systems such as humans can quickly learn to solve many complex problems whose solution can not be learned by state-of-the-art artificially intelligent learning machines. Undoubtedly, this advantage is due in part to millions of years of evolutionary pressures from the environment which have provided biological systems with “smart” representations of their environments intended to optimize learning efficiency within specific ecological niches.

Traditionally, engineers have focused their attention on designing the correct features to support learning in a learning machine. However, researchers in the field of Machine Learning such as Dr. Yann LeCun and Dr. Geoffrey Hinton have emphasized that an important alternative to the design of smart representations is to learn “smart features” and “smart representations”. Recent advances in machine learning algorithms combined with the availability of large data sets as well as enhanced computing capabilities have demonstrated the possibility that such smart features can be learned from experience without human intervention.  So this idea is essentially the core philosophy of the ICLR conference.  Many of the talks at the ICLR conference involved identifying a particular machine learning problem, using a multi-layer feedforward convolutional network with rectified linear units to solve the problem by training a network with millions of free parameters for days or weeks on a supercomputer using a gigantic data set consisting of possibly thousands or even millions of examples, and then evaluating the performance of the system using training data and test data.

There was also a lot of interest in understanding cases where the generalization performance of such systems varied abruptly from spectacular to horrible. Such cases where referred to as “adversarial examples”.  Ms. Ran Bi is a masters student in the Data Science program at New York University. Ms. Bi has a very nice blog which has some great illustrations of the adversarial example problem. If you go to the transcripts of this episode at: www.learningmachines101.com I have a link to Ms. Bi’s blog. She shows several pairs of images which to the human eye look identical yet many deep learning machines can only recognize one member of each pair but not the other member of each pair! So the big point here is that we still don’t truly understand issues associated with generalization performance in deep learning networks.

However, it is important to understand that the accomplishments of empirical work in the area of deep learning is fundamentally important. I view such empirical studies as proving mathematical theorems which are called “existence theorems” in mathematics. If we construct a deep learning machine with millions of parameters in a particular way, then there exists a set of parameter values for that learning machine such that the learning machine is capable of solving a specific problem in artificial intelligence. The proof of such an existence theorem of course is accomplished through the use of these computer experiments such as those reported at the ICLR conference.

There were several striking commonalities shared among many of the deep learning machines discussed at the ICLR conference. First, many of these deep learning machines used a new type of unit which is different than the traditional linear, sigmoidal, or radial basis function units discussed in the Function Approximation Episode 14 at: www.learningmachines101.com . This new type of unit is called a rectified linear unit or rectilinear unit and apparently is now very widely used.  In Episode 14, we described the linear unit which is a computing unit which computes a weighted sum of its inputs and returns that weighted sum as its output. Like the linear computing unit, the rectified linear unit computes a weighted sum of its inputs but it only returns that weighted sum as an output if that weighted sum is positive. If the rectified linear unit computes a weighted sum of its inputs and that weighted sum is negative, then the rectified linear unit returns the value of zero.

One technical problem associated with “rectified linear units” is that they are not differentiable everywhere so this is potentially a problem in applying gradient descent methods. To address this problem, one can use subgradient projection optimization methods which essentially are a generalization of the standard gradient descent method described in the Gradient Descent Episode 16 of this podcast series. Such methods permit gradient descent to take place even though the rectified linear units are not differentiable everywhere.

Alternatively, one can simply use a “smoothed version” of a rectified linear unit in the same way that we can interpret a sigmoidal function as a “smoothed version” of a McCulloch-Pitts formal neuron as described in the Function Approximation Episode 14. The formula for a smoothed version of a rectified linear unit is simply given as follows. The output of the unit is the natural logarithm of one plus e raised to the netinput of the unit where the “netinput” is defined as the weighted sum of the unit’s inputs. When the netinput is large, then e raised to the netinput is large compared to one and the output of the smoothed rectified linear unit is approximately equal to the rectified linear unit’s netinput. When the netinput of the rectified linear unit is a large negative number, then e raised to the large negative number is close to zero and so the output of the smoothed rectified linear unit is close to zero.

This simple nonlinearity turns out to be very powerful for learning nonlinear mappings. Suppose we have a complex nonlinear function we want to approximate. For simplicity, assume that the function which takes as input a number and returns a new number. This function could be very complicated but in many cases it should be possible to approximate using a set of linear functions or linear units. For example, suppose we have the function
x squared which takes as input a number and returns that number multiplied by itself. If a large positive number or a large negative number is input to this function, then the function returns a large positive number. If a number close to zero is input to this function, then the function returns a number close to zero. One can approximate this function using a set of linear functions. When a positive number is an input then one returns a number using a linear function defined by multiplying the input by a positive number and adding another number. When a negative number is an input then one returns a number using a different linear function defined by multiplying the input by a NEGATIVE number instead of a positive number and then adding another number. The above operations can naturally be represented as a sum of two rectilinear units assuming the input pattern is a number.  One of these two rectilinear units is defined by taking the input number and multiplying it by a positive weight and then returning the netinput if the netinput is positive and returning zero otherwise. The other rectilinear unit is defined by having its netinput defined as the product of the input number and a negative weight. When the input pattern number is positive, one rectilinear unit responds while when the input pattern number is negative the other rectilinear unit responds. This illustrates the idea that a weighted sum of rectilinear units can be used to approximate a nonlinear function which takes a number as input and returns another number as output.

Let’s now consider the more typical case where the input pattern is not a single unit with a numerical state but a large collection of input units where each unit is associated with a different numerical state. Additionally assume one has a “first layer” of hidden units where each hidden unit is a rectilinear unit which computes a weighted sum of the states of a subset of the input units. Then, one may have a “second layer” of hidden units where each hidden unit in the second layer is a rectilinear unit which computes a weighted sum of the states of a subset of the hidden units in the first layer. Then, one could have a “third layer” of hidden units where each hidden unit in the third layer is a rectilinear unit which computes a weighted sum of the states of a subset of the hidden units in the second layer. The output of the network consists of one or more units which compute weighted sums of the states of the hidden units in the third layer. Many of the systems presented at ICLR had many layers including some systems which had over 15 layers of hidden units! Conceptually, the resulting input to output function may be interpreted as a piecewise linear multidimensional function which has the capability of closely approximating a wide range of important functions.

Unfortunately I have not had an opportunity to think deeply about this particular choice of unit so I don’t currently have any theoretical results regarding its function approximation performance. However, it seems clear that a weighted combination of rectilinear units could be used to approximate either the response of a radial basis function unit or approximate the response of a sigmoidal unit. From Episode 14, we learned that a feedforward network with a single layer of radial basis function or sigmoidal hidden units can approximate any arbitrary smooth function.

These observations give rise to the conjecture that an arbitrary nonlinear smooth function can be approximated closely by two hidden layers of rectilinear units in a feedforward network consisting of one layer of input units feeding into one layer of rectilinear units and then the outputs of that first layer of rectilinear units feeding into a second layer of rectilinear units and then the outputs of the second layer of rectilinear units feeding into the output layer. I suspect that the theorems cited in Episode 14 can be used to establish this but I have not had time to see if a rectilinear unit satisfies the assumptions of the function approximation theorems cited in Episode 14. The take-home message, at least for now, is one should generally use at least two layers of rectilinear units. One layer of rectilinear units may not be adequate. This message has the status of a mathematical conjecture or a heuristic engineering argument.

Note that the rectified linear unit is not a panacea and it is still important to consider the use of linear, sigmoidal, and radial basis function units as discussed in Episode 14 but clearly the rectified linear unit within the past decade is now playing a dominant role in the development of deep learning machines! Ironically, the rectified linear unit was discussed decades earlier in the early 1970s and earlier in the context of the development of mathematical models intended to characterizing information processing systems in the brain such as the von der Malsburg’s (1973) mathematical model of the self-organization of neurons in brain cortex.

But what is so great about rectified linear units? Why is everyone using these units? Why don’t we just stick to sigmoidal units or radial basis function units for the purpose of approximating nonlinear mappings as described in the Function Approximation Episode 14? To understand this, we need to understand a problem called the “vanishing gradient problem”. If we have a network with many layers of sigmoidal or radial basis function units and we do not apply a greedy layer by layer unsupervised learning strategy but rather try to learn all layers simultaneously, the derivative of the prediction error with respect to parameters of the learning machine becomes vanishingly small for parameters associated with the early layers of processing. This problem becomes more pronounced in the final stages of learning when the sigmoidal units have responses close to zero or one. The problem can also occur for radial basis function units when the radial basis function is detecting an unfamiliar pattern. A greedy layer by layer unsupervised learning strategy solves this problem to some extent but ideally it would be nice to have a strategy for reducing or avoiding the vanishing gradient problem.  The rectilinear unit does not have these difficulties. Even though a type of vanishing gradient problem is present in a rectilinear unit when the rectilinear unit is entirely not responsive to an input pattern. A partially responsive rectilinear unit, however, is relatively immune to the vanishing gradient problem.

Another key ingredient common to many of the deep learning machines was the use of convolutional neural networks. A convolutional neural network which has also been around for a long time as a mathematical theory of neural processing in brains as well as signal processing but now is being exploited as a key idea in the recent successes of deep learning. The concept of a convolutional neural network is best understood by first considering the concept of convolution in signal processing. Wikipedia (http://en.wikipedia.org/wiki/Convolution) reports one of the earliest discussions of the concept of convolutional coding in the time domain in a paper by the French Mathematician Lacroix at the end of the 18th century. Here is the essential idea.

Suppose that we want to build a system which can solve the signal detection problem. The system is providing a sequence of signal amplitudes generated by a microphone. At time slice 1, the microphone generates a voltage or equivalently a signal amplitude. At time slice 2, the microphone generates another signal amplitude. And so on. To keep things simple, let’s assume that a particular sequence of 5 different signal amplitudes represents a particular “wind chime sound” while another particular sequence of 5 different signal amplitudes represents a “doorbell chime sound”.

The system works as follows. Assume that we have a feature detector (also known as a “filter” or “kernel”) which looks at a sequence of 5 time slices and generates the number 1 if the signal amplitudes associated with those 5 time slices correspond to either a “wind chime sound” or a “doorbell chime sound”.  On the other hand, if the signal amplitudes associated with those 5 time slices do not correspond to either a “wind chime sound” or a “doorbell chime sound” then the response of the feature detector will be equal to zero. Now suppose a sequence of 100 signal amplitudes associated with 100 distinct sequential time slices is presented to the system. We can apply this feature detector to time slices: 1, 2, 3, 4, and 5 then we can apply this feature detector to time slices: 2, 3, 4, 5, and 6  and so on until we finally apply this feature detector to time slices: 96, 97, 98, 99, 100. For each of these 96 applications of the feature detector we generate an output from the feature detector resulting in a sequence of 96 numbers generated in response to applying the 5-input feature detector to a sequence of 100 time slices. This response pattern of 96 numbers generated in response to the 100 signal amplitudes is called a “feature map”. If one examines this feature map which is a sequence of 96 zeros and ones, one will find a number 1 appears at the point in time where either a “doorbell chime sound” or “wind chime sound” was detected by the system. We will refer to this operation of constructing a feature map as “scanning” the input signal.

To train this system, we can imagine presenting the system with a long sequence of time slices where each time slice is associated with a signal amplitude. To be concrete, suppose that we train the system with a sequence of 10000 time slices. We now construct a linear or nonlinear learning machine which has five inputs and 1 output. This learning machine will become our feature detector. We train the system with the signal amplitudes from time slices: 1, 2, 3, 4, and 5 to produce the number “1”, then we train the system with signal amplitudes from time slices: 2, 3, 4, 5, and 6 to produce the number “1”. Then we train the system with signal amplitudes from time slices: 3, 4, 5, 6, and 7 to produce the number “1”. Thus, the feature detector learns to respond to patterns presented over 5 consecutive time slices. In this example, there will be 10000 minus 4 different input patterns which can be used for training and the desired response will always be equal to “1”. We can use a gradient descent method as described in Episode 16 to train up this unsupervised learning machine but we will probably have to include a regularization constraint in the objective function which biases the learning process to generate a “0” when the magnitude of the pattern of 5 signal amplitudes at the learning machine’s is close to zero.

A very important consequence of this learning process is that this type of training procedure has the property that if the system is trained with a sequence of two wind chimes followed by four doorbell chimes, this is effectively equivalent to training the feature detector recognize both wind chimes and doorbell chimes. Thus, novel sequences of wind chimes and doorbell chimes can be identified by using the feature detector to generate a feature map for a given input sequence of auditory signals.

In practice, we often want to detect multiple features or patterns of feature values so instead of just using one feature we might use a collection of features. If we have two feature detectors, this generates two feature maps. If we have 10 feature detectors, this generates 10 feature maps.

Note that this methodology for constructing feature maps not only dramatically improves generalization performance but uses much fewer parameters compared to a system where all time slices of the input pattern of microphone signal amplitudes are processed. To illustrate this point, suppose we had a sequence of microphone signal amplitudes whose length is 1000 time slices.  Also assume we have 10 hidden units and each hidden unit is connected to the 1000 input units associated with these 1000 time slices, this would involve approximately 10,000 free parameters.

On the other hand, suppose each of the 10 hidden units is scanning only a small region of the 1000 time slice input pattern in order to detect the presence or absence of a pattern. For example, each of the 10 hidden units might be a rectilinear unit which computes a weighted sum of 20 consecutive time slices from the input pattern of 1000 time slices. The number of free parameters for each of these 10 hidden units is only 20 free parameters! So if we have 10 hidden units and each unit has only 20 inputs, we only have 200 free parameters! So we have replaced a system with 10,000 free parameters with a system consisting of only 200 free parameters!

These same ideas are naturally extended to the case of image processing. In an image processing scenario, the input pattern is not a sequence of numbers corresponding to the output of a microphone over time but rather the input pattern is a two-dimensional array of numbers where a number in row I and column J of the array specifies the intensity-level or gray-scale level of a pixel located in the ith row and jth column of an image. If the image is a color image, then the input pattern is a two-dimensional array of number triplets where each number triplet is a set of three numbers specifying the magnitude of the red color intensity, the magnitude of the green color intensity, and the magnitude of the blue color intensity at a particular pixel. Thus, one can visualize the input pattern as a 5-dimensional array of numbers where the first two indices of the array specify the location of a pixel in the image and the remaining three indices specify the “color” of that pixel.

A feature detector then “scans” the five-dimensional array of numbers in a manner analogous to the scanning process in the acoustic signal processing case. However, unlike the one-dimensional acoustic signal case, the inputs to the feature detector now typically correspond to a rectangular array of numbers. For example, a feature detector might have 25 inputs corresponding to a set of numbers which are adjacent to one another on the original two-dimensional array of numbers.  For example, if the original color image has 1000 rows of pixels and 1000 columns of pixels then it is represented by an array of color triplets where the triplet in row I and column j specifies the color shade of a pixel in row I and column j of the original color image.

We now want to use our feature detector to “scan” the input pattern and generate a feature map. So first the feature detector looks at the first 5 rows and first 5 columns of number triplets which define the color image and generates a single numerical value which is recorded on the “feature map”. Next, the feature detector is moved so that it looks at the first 5 rows and columns 6, 7, 8, 9, and 10 of the image. The response of the feature detector to this new input is then recorded on the “feature map”. And so on. In a scientific paper, the standard terminology for describing this type of network architecture would be to say that a convolutional neural network (CNN) has been used with a single square filter with an edge size of 5. Typically one has multiple feature detectors and thus multiple feature maps generated for a given two-dimensional input pattern. For example, the terminology: “…in forward order, the convolutional layers have 64, 128, 256 square filters with edge sizes 3, 4, and 5 respectively” would imply that 64 feature maps are generated from the original input pattern where the feature detectors for each of the 64 feature maps each have 9 inputs arranged in a 3 by 3 array. This statement would also imply that 128 feature maps are generated by applying 128 feature detectors where each feature detector in the second convolutional layer has 16 inputs arranged in a 4 by 4 square array. Each of these inputs is specified by 64 numbers associated with the 64 feature maps generated from the first convolutional layer.

Note that all of the issues discussed in the one-dimensional acoustic signal case transfer to the multi-dimensional image processing case in a straightforward manner. One obtains improved generalization performance because the system can learn about a particular feature in one part of the image and then use that knowledge to recognize that same feature in a different part of the image. The number of free parameters of the system is dramatically decreased by orders of magnitude.

Also note that despite the vast reduction in the number of free parameters using convolutional neural networks, the large input patterns and multiple layers rapidly increases the number of parameters and units to astronomical sizes. A common strategy to reduce dimensionality and possibly improve generalization performance by dividing a given feature map into M regions and then transform that feature map into a feature map of lower dimensionality by selecting the unit which is most active in each of the M regions. This generates a new feature map of reduced dimensionality which only has M units. This is called “max-pooling”.

A good introduction to the concept of convolutional neural networks may be found at Tim Dettmer’s blog page on convolution deep learning. Tim provides a very clear description of the key ideas with some very nice graphical illustrations. Another good blog describing convolutional coding for deep learning may be found at the website: deeplearning.net/tutorial/lenet.html . Another good blog describing convolutional neural networks for deep learning is the blog by Andrew Gibiansky  at the website: andrew.gibiansky.com/blog/machine-learning/convolutional-neural-networks/. The links to these pages may be found by visiting the website: www.learningmachines101.com and then checking out the transcript for this episode.

Thank you again for listening to this episode of Learning Machines 101! I would like to remind you also that if you are a member of the Learning Machines 101 community, please update your user profile.

You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!

Or if you are not a member of the Learning Machines 101 community, when you join the community by visiting our website at: www.learningmachines101.com you will have the opportunity to update your user profile at that time.

From time to time, I will review the profiles of members of the Learning Machines 101 community and do my best to talk about topics of interest to the members of this group!

Also consider joining the “Statistical Machine Learning Forum” group on LinkedIn! Your are encouraged to post comments about episodes at: www.learningmachines101.com or comments at the “Statistical Machine Learning Forum” group on LinkedIn. In addition to responding to your comments, those comments might be used as the basis for a future episode of “Learning Machines 101”!!

And finally, I noticed I have been getting some nice reviews on ITUNES. Thank you so much.  Your feedback and encouragement is greatly valued!

Keywords:  Deep Learning, Perceptrons, Max-Pooling, Convolutional Neural Networks, Rectified Linear Units, Rectilinear Units

Further Reading:

3 thoughts on “LM101-029: How to Modernize Deep Learning with Rectilinear units, Convolutional Nets, and Max-Pooling

  1. Pingback: Best podcasts about data science - Multiele

  2. Pingback: You’ve Read the Book, Now Listen to the Podcast

  3. Pingback: d1 sources for further research – AI

Leave a Reply

Your email address will not be published. Required fields are marked *