Just preordered this, New DL Book Awesome

It’s by Bengio and Goodman and it is the first DL book I’ve seen. It’s also free online, but I like to have the physical copy and am kind of a book fiend, so I bought it. I really hope it comes in next week, and I can read it after my last class ends this December 😦  I’d really like to go for the PhD and study Applied Harmonic Analysis and how it relates to high dimensional statistics and machine learning etc. But alas, life and bills have to be paid, what a bunch of dream crushers.

Anyways, I have some things to talk about but I haven’t found the time to really write a good post, but here is a preview.

• I have a bunch of random unproved thoughts about DL I’d like to jot down here
• I read the google brain paper about expressivity of networks vs their deepness which was pretty good
• I’m going to be doing some reinforcement learning project using the OpenAI gym and hopefully examining the natural gradients usefulness. This reminds me that I watched a LeCun lecture recently where he talks about what the loss surface actually looks like for these networks.
• I also have the natural gradient paper that I should summarize soon.

Finally, I found this conference about a year ago and thought it would be a dream come true if I could go. A bunch of the big names were there, Candes, Bruna, Recht, Maggioni, LeCun, Mallat, etc. They posted all their videos online and I’ve watched 1.5 so far, and they are INCREDIBLE! It’s really nice to listen to their thoughts on things that I’ve read and probably didn’t fully understand at the time.

DL Math Vids

# Two youtube lectures on quantum algorithms and deep learning I watched this week.

Here’s the first one by Seth Lloyd that has to do Quantum Machine Learning Algorithms. Did you know he was a Mech E by degree, like me? So he might be my new hero.

To summarize, he basically introduces quantum computing by defining a qubit. Which is basically a bit defined by the superposition of two vectors.

$x = \alpha_1\left|0\right> + \alpha_1 \left|1\right>$

I apologize for the poor looking bra-kets I don’t think I have the package on here. So each on of those, $\left|-\right>$ are just vectors in a 2D Hilbert Space, and the $\alpha_i \in \mathbb{C}$, where the magnitude squared is equal to the probability that the quantum particle is in that position. So with a qubit you have more information per bit than a traditional computer, and when we have multiple qubits, you actually get the tensor product of Hilbert spaces such that, $\mathbf{ x} \in \mathbb{C}^{2^n}$. Then we can imagine that if we want to represent a length n vector in a traditional computer, we require, n bits, but using a quantum computer we need $\log n$ bits. Thus, we may argue that since Machine Learning is about manipulating vectors to create a model, we can get exponential speed up on many machine learning tasks like KNN, SVM, etc.

The second video was by Joan Bruna and was all about Convolutional Neural Networks (CNNs). He talks about scattering networks, sparse recovery, and a whole bunch of other goodies. I’m not going to go into too much detail here, you should just watch the video.

# Classic Paper from the ImageNet Competition

## Introduction

It’s been a while since I’ve been able to sit down and write a post. Classes started up, and I’m grading for another course, and while I’ve been keeping up with my at least surface level reading of papers, the density of the material in said papers means a post would be difficult to write in one sitting. However, the Deep Learning area as piqued my interest recently and a lot of the papers are relatively easy to digest, so here we go…

Deep Learning, a subfield of Machine Learning is a red hot area right now, with record breaking results being shown every year utilizing these deep neural architectures. If you look at the leaderboard from the ImageNet competition, since the introduction of CNNs into the competition ~2012, where one team used a CNN, now all the teams competing utilize CNNs. The team that first introduced the architecture into the contest placed first by a large margin with something on the order of ~5% error, while the other teams were closer to ~20%.

Now, while the performance of the deep architectures is phenomenal, it still seems that there is not a ton of mathematical reasoning to back up the performance [2,3]. Intuitively, one can imagine that since we are training with an enormous volume of data, that each class we desire to learn will have seen all the different linear and nonlinear distortions of that class, and that if our architecture is deep enough we can accurately represent the function that can map all these permutations of a class correctly.

I’ve introduced the topic of neural networks, but I haven’t explained what they are yet, see Figure 1.

Fig 1. Example Neural Network: Note that we can add a bias to the hidden layer by introducing a 1-node with no inputs

Basically, we take our input/training data, $\mathbf{x}$, multiply by a set of weights, $\mathbf{w}$, throw that result into a nonlinear activation function, then take that output, multiply by a new set of weights and get our outputs $\mathbf{y}$. So, for a given layer we have,

$\hat{x_i}=f_i(\mathbf{w}^T\mathbf{x}+b)$

where, $f:\mathbb{R} \rightarrow \mathbb{R}$ is some nonlinear function, like a sigmoid, $(1+\exp(-x))^{-1}$, or a rectified linear unit ReLu, $f(x)=max\{0,x\}$. Now, our goal is given some labeled training set, $\{x_i,y_i \}$, in the supervised setting, we must learn the weights $\mathbf{w}$ that minimize our error with respect to some cost function, $J(\hat{\mathbf{y}})=\frac{1}{2}||\hat{\mathbf{y}}-\mathbf{y}||^2$. But, we need an algorithm to learn these weights, and it so happens that around the late 80’s, the legendary Yann LeCun (NYU, FAIR), showed that Fukushima’s neocognitron could be trained with the backpropagation algorithm utilizing the rule we all know and love from Calc I, the chain rule (see [4] for details).

## The Paper

All this introduction and no talk about the actual paper, which apparently some people started it all, although Yann’s paper for classifying MNIST came well before this one [1]. CNNs have shown to be extremely powerful in image recognition tasks. Every year, the ImageNet competition, a dataset with millions of images and ~22k categories, beckons some of the best and brightest in the biz to test their algorithms.

### Preprocessing

The data was preprocessed by downsampling all images to 256 x 256 and cropping any nonsquare images. Interestingly, they kept the RGB values for the pixels, which in my mind would increase the dimensionality without adding any information beyond grayscale.

### Architecture

The net used 8 learned layers with 5 being convolutions and 3 being fully-connected. The nonlinearity the researchers used was the ReLu, as they found it converged to 25% error rate about 6x faster than the tanh function. Even though the ReLu doesn’t require normalization to prevent saturation, they still found that a normalization scheme aided in reducing the error rate as follows,

$b^i_{x,y}=a^i_{x,y}/\big( k+\alpha \sum\limits{_{j=max(0,i-n/2)}^{min(N-1,i+n/2)}} (\alpha ^j_{x,y})^2 \big) ^\beta$

where the sum is over the $n$ adjacent kernel maps and $/alpha$ is the output of the nonlinear function at a given layer. The pooling layers for this CNN used overlaps of 2 with 3 x 3 pools which increased performance over the non-overlapping method. The final layer used softmax,

$\sigma(\mathbf{z})=\frac{e^{\mathbf{z}_j}}{\sum_k e^{\mathbf{z}_i}}$

for a 1000 node fully connected final output layer. The result of the softmax being that a distribution is fit over the 1k class labels. Ultimately, the network had 60e6 parameters.

Fig 2. The overall architecture from the paper [1]

### Overfitting

A  major problem in fitting models to data is overfitting. Basically, tailoring your weights too much to the training data set such that your test error increases. In other words, the model fits the noise in the training set such that small fluctuations at the input cause large perturbations at the output. To reduce this effect, the input data was perturbed by introducing random translations and mirroring along the x-axis. A noisy perturbation was also introduced to each RGB layer by randomly magnifying the principle components as follows,

$R= x_ix_i^T,\qquad x \in \mathbb{R}^3$

$R =U\Lambda U^T$

$[\mathbf{u}_1, \mathbf{u}_1, \mathbf{u}_1][\alpha_1 \lambda_1, \alpha_2 \lambda_2, \alpha_2 \lambda_2]^T$ where, $\alpha_I \sim \mathcal{N}(0,0.1)$

This random weighting of the eigenvalues supposedly build invariance to lighting/brightness into the model.

### Dropouts

A good way to build models is to train a large variety of architectures and then combine their outputs into a single final answer. However, because training these networks can take days, building multiples is too time consuming. One way to approximately do this, is to randomly with, p=0.5, set the output of a neuron to 0. This causes that neuron to have no effect in the learning of the weights. Then when testing, all the neurons are used but with a multiple of 0.5 to approximate the geometric mean of all the trained networks.

### The Learning Algorithm & Results

Using stochastic gradient descend, the weight update rule was defined as,

$v_{i+1}:=0.8v_i-0.0005\epsilon w_i-\epsilon<\frac{\partial L}{\partial w}|_{w_i}>_{D_i}$

$w_{i+1}:=w_i+v_{i+1}$

where, $v$ is the momentum variable, $\epsilon$ is the learning rate, and $latex <\frac{\partial L}{\partial w}|_{w_i}>_{D_i}$ is the averaged over the ith batch.

The training took place over 5 to 6 days for 90 cycles over 1.2e6 images, using GTX 580 GPUs.

Ultimately, this architecture led to following error rate as seen in Figure 3.

Fig 3. Performance results [1]

## Take Away

This was the first time this type of learning was used at the ImageNet competition, and they blew the competition away. Since then, it seems almost all entries use CNNs, and it appears that the error rate for the machine is less than human error rate (~5.1%).

Hopefully, this post introduced the topic of Deep Learning, and showed the usefulness of the area by briefly summarizing one of the papers that started it all. This field is extremely exciting as it not only provides record breaking performance on many tasks, but also has a huge engineering and theoretical component. Anytime you see the word ‘architecture’ the mind immediately thinks of engineering, and clearly the building, training, and coding, of these neural networks provide an exciting playground for engineers. There even seems to be a hardware component, with companies like NVIDIA developing a new DL specific super-computer. On the theoretical side, DL has many open problems and draws from so many areas, which I feel makes the field super complex and exciting. How do we compress architectures and maintain performance, why do these networks work so well, can we come up with better training algorithms and optimization strategies, can these techniques be used on even more complex tasks? etc. To be honest, it’s hard for me to even put it all down there’s just so many angles you can look at this stuff from.

In the future, hopefully next week, I will move more towards the theoretical side. I’m thinking something on stochastic gradient descent might be nice. There are many papers like this, that briefly introduce the problem, then describe the architecture and show some results. But, these are the types of papers where you’re better off trying to implement something yourself to get a feel for the engineering aspects.

# Introduction

Hi, my name is Kevin and I’m a perpetual student. I’m currently pursuing a professional MS in ECE part time while I work full time as an engineer. I will be done with the MS this December 2016, after I finish up my coursework in Advanced Signal Processing, and Multivariate Analysis. At work I mostly try to design statistical signal processing algorithms with a side of some basic machine learning.

In my intellectual pursuits, I constantly try to understand the mathematical theory that current trendy topics are built on. Where we define trendy to be anything involving machine learning (especially deep learning), data science, computer vision, AI, quantum computing, etc. In my opinion, this involves trying to be competent in harmonic analysis, optimization theory, and statistics.

What little free time I have is spent exercising, biking, hiking, visiting family, playing guitar, hanging out with my wife:), drinking craft beer, and trying to keep up with house projects (they never end).

Ultimately, my goal for this blog is to document the vast array of interesting technical papers, and books that I read. The hope being that I reinforce my learning, understand the details better, and maybe help out the few readers who happen to stumble on this blog. I’m also hoping to demonstrate some of the algorithms I discover along the way, and in this way force myself to implement them not only in MATLAB, but also Python and maybe C if I’m feeling really adventurous. Hopefully, this will blog will be the catalyst that forces me to go beyond just MATLAB (although it’s so easy prototype in, how could I not start with it?).

The interested reader may also wonder about the name of this blog. Well, as an electrical engineer, Euler’s formula is extremely important as we come in contact with the strange beast that is the complex field. However, I can’t take all the credit as this name is a common handle that my father uses (thanks Dad!).

Finally, a test demonstrating why I chose to use wordpress over Jekyll, ease of LaTex. A word of warning, I might just be an idiot because I tried to use gitpages and Jekyll but failed repeatedly to get going. And without further ado, I give you one of Euler’s formulas…

$e^{j\omega}=\cos(\omega)+j\sin(\omega)$