## Introduction

It’s been a while since I’ve been able to sit down and write a post. Classes started up, and I’m grading for another course, and while I’ve been keeping up with my at least surface level reading of papers, the density of the material in said papers means a post would be difficult to write in one sitting. However, the Deep Learning area as piqued my interest recently and a lot of the papers are relatively easy to digest, so here we go…

Deep Learning, a subfield of Machine Learning is a red hot area right now, with record breaking results being shown every year utilizing these deep neural architectures. If you look at the leaderboard from the ImageNet competition, since the introduction of CNNs into the competition ~2012, where one team used a CNN, now all the teams competing utilize CNNs. The team that first introduced the architecture into the contest placed first by a large margin with something on the order of ~5% error, while the other teams were closer to ~20%.

Now, while the performance of the deep architectures is phenomenal, it still seems that there is not a ton of mathematical reasoning to back up the performance [2,3]. Intuitively, one can imagine that since we are training with an enormous volume of data, that each class we desire to learn will have seen all the different linear and nonlinear distortions of that class, and that if our architecture is deep enough we can accurately represent the function that can map all these permutations of a class correctly.

I’ve introduced the topic of neural networks, but I haven’t explained what they are yet, see Figure 1.

Fig 1. Example Neural Network: Note that we can add a bias to the hidden layer by introducing a 1-node with no inputs

Basically, we take our input/training data, , multiply by a set of weights, , throw that result into a nonlinear activation function, then take that output, multiply by a new set of weights and get our outputs . So, for a given layer we have,

where, is some nonlinear function, like a sigmoid, , or a rectified linear unit ReLu, . Now, our goal is given some labeled training set, , in the supervised setting, we must learn the weights that minimize our error with respect to some cost function, . But, we need an algorithm to learn these weights, and it so happens that around the late 80’s, the legendary Yann LeCun (NYU, FAIR), showed that Fukushima’s neocognitron could be trained with the backpropagation algorithm utilizing the rule we all know and love from Calc I, the chain rule (see [4] for details).

## The Paper

All this introduction and no talk about the actual paper, which apparently some people started it all, although Yann’s paper for classifying MNIST came well before this one [1]. CNNs have shown to be extremely powerful in image recognition tasks. Every year, the ImageNet competition, a dataset with millions of images and ~22k categories, beckons some of the best and brightest in the biz to test their algorithms.

### Preprocessing

The data was preprocessed by downsampling all images to 256 x 256 and cropping any nonsquare images. Interestingly, they kept the RGB values for the pixels, which in my mind would increase the dimensionality without adding any information beyond grayscale.

### Architecture

The net used 8 learned layers with 5 being convolutions and 3 being fully-connected. The nonlinearity the researchers used was the ReLu, as they found it converged to 25% error rate about 6x faster than the tanh function. Even though the ReLu doesn’t require normalization to prevent saturation, they still found that a normalization scheme aided in reducing the error rate as follows,

where the sum is over the adjacent kernel maps and is the output of the nonlinear function at a given layer. The pooling layers for this CNN used overlaps of 2 with 3 x 3 pools which increased performance over the non-overlapping method. The final layer used softmax,

for a 1000 node fully connected final output layer. The result of the softmax being that a distribution is fit over the 1k class labels. Ultimately, the network had 60e6 parameters.

Fig 2. The overall architecture from the paper [1]

### Overfitting

A major problem in fitting models to data is overfitting. Basically, tailoring your weights too much to the training data set such that your test error increases. In other words, the model fits the noise in the training set such that small fluctuations at the input cause large perturbations at the output. To reduce this effect, the input data was perturbed by introducing random translations and mirroring along the x-axis. A noisy perturbation was also introduced to each RGB layer by randomly magnifying the principle components as follows,

where,

This random weighting of the eigenvalues supposedly build invariance to lighting/brightness into the model.

### Dropouts

A good way to build models is to train a large variety of architectures and then combine their outputs into a single final answer. However, because training these networks can take days, building multiples is too time consuming. One way to approximately do this, is to randomly with, p=0.5, set the output of a neuron to 0. This causes that neuron to have no effect in the learning of the weights. Then when testing, all the neurons are used but with a multiple of 0.5 to approximate the geometric mean of all the trained networks.

### The Learning Algorithm & Results

Using stochastic gradient descend, the weight update rule was defined as,

where, is the momentum variable, is the learning rate, and $latex <\frac{\partial L}{\partial w}|_{w_i}>_{D_i}$ is the averaged over the ith batch.

The training took place over 5 to 6 days for 90 cycles over 1.2e6 images, using GTX 580 GPUs.

Ultimately, this architecture led to following error rate as seen in Figure 3.

Fig 3. Performance results [1]

## Take Away

This was the first time this type of learning was used at the ImageNet competition, and they blew the competition away. Since then, it seems almost all entries use CNNs, and it appears that the error rate for the machine is less than human error rate (~5.1%).

Hopefully, this post introduced the topic of Deep Learning, and showed the usefulness of the area by briefly summarizing one of the papers that started it all. This field is extremely exciting as it not only provides record breaking performance on many tasks, but also has a huge engineering and theoretical component. Anytime you see the word ‘architecture’ the mind immediately thinks of engineering, and clearly the building, training, and coding, of these neural networks provide an exciting playground for engineers. There even seems to be a hardware component, with companies like NVIDIA developing a new DL specific super-computer. On the theoretical side, DL has many open problems and draws from so many areas, which I feel makes the field super complex and exciting. How do we compress architectures and maintain performance, why do these networks work so well, can we come up with better training algorithms and optimization strategies, can these techniques be used on even more complex tasks? etc. To be honest, it’s hard for me to even put it all down there’s just so many angles you can look at this stuff from.

In the future, hopefully next week, I will move more towards the theoretical side. I’m thinking something on stochastic gradient descent might be nice. There are many papers like this, that briefly introduce the problem, then describe the architecture and show some results. But, these are the types of papers where you’re better off trying to implement something yourself to get a feel for the engineering aspects.

## References

[1] ImageNet Classification with Deep Convolutional Neural Networks- Hinton, et al.

[2] Understanding Deep Convolutional Networks – Mallat

[3] Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity – Daniely, et al.

[4] Andrew Ng’s Excellent Tutorial on Neural Networks

[5] Gradient-Based Learning Applied to Document Recognition – LeCun