Orthogonal Initialization in Convolutional Layers

In Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Saxe, McClelland, and Ganguli investigate the question of how to initialize the weights in deep neural networks by studying the learning dynamics of deep linear neural networks. In particular, they suggest that the weight matrix should be chosen as a random orthogonal matrix, i.e., a square matrix $W$ for which $W^TW = I$.

In practice, initializing the weight matrix of a dense layer to a random orthogonal matrix is fairly straightforward. For the convolutional layer, where the weight matrix isn’t strictly a matrix, we need to think more carefully about what this means.

In this post we briefly describe some properties of orthogonal matrices that make them useful for training deep networks, before discussing how this can be realized in the convolutional layers in a deep convolutional neural network.

Introduction

Two properties of orthogonal matrices that are useful for training deep neural networks are

  1. they are norm-preserving, i.e., $||Wx||_2 = ||x||_2$, and
  2. their columns (and rows) are all orthonormal to one another, i.e., $w_{i}^Tw_{j} = \delta_{ij}$, where $w_{i}$ refers to the $i$th column of $W$.

At least at the start of training, the first of these should help to keep the norm of the input constant throughout the network, which can help with the problem of exploding/vanishing gradients. Similarly, an intuitive understanding of the second is that having orthonormal weight vectors encourages the weights to learn different input features.

Dense Layers

Before we discuss how the orthogonal weight matrix should be chosen in the convolutional layers, we first revisit the concept of dense layers in neural networks. Each dense layer contains a fixed number of neurons. To keep things simple, we assume that the two layers $l$ and $l+1$ each have $m$ neurons. Let $W$ be an $m \times m$ matrix where the $i$th row contains the weights incoming to neuron $i$ in layer $l+1$, $\mathbf{x}$ be a $m \times 1$ vector containing the $m$ outputs from the neurons in layer $l$, and $\mathbf{b}$ be the vector of biases. Then we can describe the pre-activation of layer $l+1$ as $\mathbf{z}$,

where the activation function $f(\mathbf{z})$ is applied to produce the final output of layer $l+1$. In the following figure we show how the outputs from the previous layer, $x_1, \ldots, x_j, \ldots, x_m$, are multiplied by the $i$th neuron’s weights, $w_1, \ldots, w_j, \ldots, w_m$, the bias is added, and the activation function applied.

In this formulation, the idea behind orthogonal initialization is to choose the weight vectors associated with the neurons in layer $l+1$ to be orthogonal to each other. In other words, we want the rows of $W$ to be orthogonal to each other.

Convolutional Layers

In a convolutional layer, each neuron is sparsely connected to several small groups of neurons in the previous layer. Even though each convolutional kernel is technically a $k \times k$ matrix, in practice the convolution can be thought of as an inner product between two $k^2$-dimensional vectors, i.e., $\mathbf{x} \ast \mathbf{w} = \sum\limits_{i=1}^{k^2}x_i \cdot w_i$.

Below we show how a different convolutional kernel is applied to each channel in layer $l$, the results summed, the bias added, and the activation function applied.

Now it becomes clear that we should choose each channel in layer $l+1$ to have a “weight vector” that is orthogonal to the weight vectors of the other channels. When we say “weight vector” in the convolutional layer, we really mean the kernel-as-vector for each of the channels stacked next to each other. This is similar to how it was done in the dense layer, but now the weight vector is no longer connected to every neuron in layer $l$.

Implementation

We can check this intuition by examining how the orthogonal weight initialization is implemented in a popular neural network library such as Lasagne. Suppose we want a matrix with $m$ rows, with $m$ the number of channels in layer $l+1$, where the rows are all orthogonal to one another. Each row is a vector of dimension $nk^2$, where $n$ is the number of channels in layer $l$, and $k$ is the dimension of the convolutional kernel. For practical purposes, let us choose $m = 64$, $n = 32$, and $k = 3$, i.e., layer $l + 1$ has $64$ channels, each learning a $3 \times 3$ kernel for each of the $32$ channels in layer $l$.

Similar to how it’s implemented in Lasagne, we can use the Singular Value Decomposition (SVD). Given a random matrix $X$, we compute the reduced SVD, $X = \hat{U}\hat{\Sigma}\hat{V}^T$, and then use the rows of the matrix $V^T$ as our orthogonal weight vectors (in this case the weight vectors are also orthonormal). So, in Python:

>>> import numpy as np
>>> X = np.random.random((64, 32 * 3 * 3))
>>> U, _, Vt = np.linalg.svd(X, full_matrices=False)
>>> Vt.shape
(64, 288)
>>> np.allclose(np.dot(Vt, Vt.T), np.eye(Vt.shape[0]))
True
>>> W = Vt.reshape((64, 32, 3, 3))

Finally, we can use the matrix $W$ to initialize the weights of the convolutional layer. Keep in mind, however, that in practice the elements of $W$ are often scaled to compensate for the change in norm brought about by the activation function.

Conclusion

Orthogonal initialization has shown to provide numerous benefits for training deep neural networks. It is easy to see which vectors should be orthogonal to one another in a dense layer, but less straightforward to see where this orthogonality should happen in a convolutional layer, because the weight matrix is no longer really a matrix. By considering that neurons in a convolutional layer serve exactly the same purpose as neurons in a dense layer but with sparse connectivity, the analogy becomes clear.

Further Reading

  1. Great discussion on Google+ about orthogonal initialization.
  2. Discussion on /r/machinelearning about initialization.
  3. Very interesting paper where the idea of using unitary matrices, $W^*W = I$ with $*$ indicating the conjugate transpose, in recurrent neural networks is investigated to avoid the problem of vanishing and exploding gradients.

Kaggle's Grasp and Lift EEG Detection Competition

I recently participated in Kaggle’s Grasp-and-Lift EEG Detection, as part of team Tokoloshe (Hendrik Weideman and Julienne LaChance). None of the team members had ever used deep learning for EEG data, and so we were eager to see how well techniques that are generally applied to problems in computer vision and natural language processing would generalize to this new domain. Overall, it was a fun challenge to work on and gave us a renewed appreciation for the wide range of problem domains that could potentially benefit from the incredible progress recently made in deep learning research.

Those wishing to skip ahead might be interested in the following key sections:

  1. The Data
  2. System Architecture
  3. Results
  4. Code

Background

Patients who have suffered from amputation or other neurological disabilities often have trouble performing tasks that are an essential part of everyday life. Research in devices like brain-computer interfaces aims to provide these people with prosthetic limbs that may be controlled by means of an interface to the brain. Ideally, this would enable these people to regain abilities that are often taken for granted, thereby providing them with greater mobility or independence.

Problem Statement

The goal of the challenge is to predict when a hand is performing each of six different actions given electroencephalography (EEG) signals. The EEG signals are obtained from sensors placed on a subject’s head, and the subject is then instructed to perform each of the six actions in sequence.

The Data

We are provided with EEG signals for 12 different subjects, each consisting of 10 series of trials. Each series consists of a variable number of trials, but typically around 30. One trial is defined as the progressive sequence of actions from the first to the sixth action. The six actions that we wish to predict are

  1. HandStart
  2. FirstDigitTouch
  3. BothStartLoadPhase
  4. LiftOff
  5. Replace
  6. BothReleased

In particular, the EEG signal for each trial consists of a real value for each of the 32 channels at every time step in the signal. The subject’s EEG responses are sampled at 500 Hz, and so the time steps are 2ms from each other. For each time step, we are provided with six labels, describing which of the six actions are active at that time step. Note that an action is labeled as active if it occurred within 150ms of the current time step (future or past). The implication of this is that multiple actions may be labeled as active simultaneously. Below we show the eight series from sensor 13 for subject 1 provided for training. The colors indicate the different actions, gray indicates that no action is active. Note the amount of variation between the signals, even for a single sensor and a single subject.

subject1-sensor13

Because we are working with time data, it is critical to note that we are not allowed to use data from the future when making predictions. It is simpler to think of this in terms of the practical usage of such a system - when predicting the action that a user is performing, the system will not have access to EEG signal responses that have not occurred yet. In practice, this means that we are free to train on any of the training data that we want. However, when we make a prediction for a particular time step in the test set, we may only use EEG responses from time steps that occurred at or before that time step. This is important to keep in mind, because we are provided with the EEG responses for all time steps in the test set. Care should be taken that these are in no way used when making submissions to the competition (such as when centering or scaling the signals).

More detailed information about the data may be found on Kaggle or in the original paper, Multi-channel EEG recordings during 3,936 grasp and lift trials with varying weight and friction.

Challenges

As with all machine learning problems, there are some challenges with this data set. Primarily, EEG signals are notoriously noisy, and we are given 32 channels, several of which likely do not correlate well with which of the actions are active. Additionally, the signals vary considerably from person to person and even series to series.

Our Team’s Solution

Given that our primary goal of participating in this challenge was to explore a new problem domain using deep learning, we wanted to build a system that is neither subject nor action-specific. Thus, our system is trained to predict actions given EEG signals, with no regard for which person or action it is working with. We also experimented with subject-specific models, but achieved worse performance. We suspect that the additional data from other subjects helps to regularlize the very large capacity of our deep learning models, thereby improving generalization across all subjects and actions.

Data Preparation

To build our training set, we chunk the EEG signals into fixed length sequences, that we refer to as time windows. We label each time window with six binary values indicating which of the six actions are active at the time step corresponding to the last time step in the time window. To normalize the data, we subtract the mean computed over all series from all subjects, and divide by the standard deviation computed similarly. For validation data, we keep two series from each subject separate and train on the remaining six. We do this so that we can monitor that our model is not fitting certain subjects while neglecting others.

System Architecture

Below we describe the implementation of our solution.

Deep Convolutional Neural Network

Our deep convolutional neural network has eight one-dimensional convolutional layers, four max-pooling layers, and three dense layers. The final dense layer has six output neurons, each with a sigmoid activation function that predicts the probability that a given action is active. We use the rectified linear unit (ReLU) as the activation function in all layers except for the output layer. Dropout with p = 0.5 is applied in the first two dense layers.

Input

We use time windows of 2000 time steps, that is, each time window observes four seconds of EEG signal. As described below, we subsample this time window so that only 200 time steps are actually used when feeding the example to the deep convolutional neural network.

Loss

Because the challenge’s evaluation metric, the mean column-wise area under the ROC curve, is not differentiable, we instead minimize the binary cross-entropy loss, taking the mean across the loss for each of the six actions. Even though these two are not exactly equivalent, minimizing this loss function should generally lead to better performance on the challenge’s evaluation metric.

Implementation

Our solution uses Lasagne and Theano for the implementation of the convolutional neural network. We also use scikit-learn, pandas, and matplotlib for various utilities. We developed our solution using Ubuntu 15.04. The code is available on GitHub.

Hardware

We trained our deep convolutional neural network on a computer with an NVIDIA Quadro K4200 and 16GB of RAM.

Performance Tricks

While designing our model, there were a few simple tricks that we came up with that improved our model’s performance on the held-out validation data. These are briefly described below.

Subsampling Layer

When feeding each time window into the deep convolutional neural network, we first subsample every Nth point. Not only does this greatly reduce the computational burden, but it also helps to reduce overfitting. Our initial idea was to use this as a form of data augmentation, where we would sample every Nth point starting from different time steps in the time window, but this did not seem have any effect.

Window Normalization Layer

We found that normalizing the values in each time window to be between 0 and 1 greatly improved generalization. This was done by simply using the minimum and maximum values in each time window to compute a transformation to the desired range. We suspect that this is because the relative difference between points in the time window is more important than the actual amplitudes, and so by normalizing for this difference we encourage the model to fit the actual signal, rather than the noise.

Results

Our final solution earned us a spot in the Top 10% on the challenge’s private leaderboard. top spots final standing

Other Ideas

Throughout this challenge, we had numerous other ideas that we could never quite get to work. One in particular was the concept of data augmentation, which is the generation of more training examples by transforming existing training examples in such a way that the relationship to the corresponding label is preserved. There were two key ideas that we tried, namely

  1. subsample each time window from a random starting position, so that the network only occasionally sees the exact time window twice, and
  2. increase the number of positive training examples (training examples where an action is active) by duplicating existing training examples, where each duplicated example gets a small amount of Gaussian noise added to it.

Unfortunately, neither of these ideas enabled our model to generalize better to unseen test data.

Final Words

We are very excited to see the scores achieved on the challenge’s leaderboard, as this certainly indicates that deep learning has the potential to contribute to further progress in the field of brain-computer interfaces and the analysis of EEG signals. We would also like to congratulate the top three teams:

  1. Cat & Dog
  2. daheimao
  3. HEDJ

Anyone interested in their solutions may follow the above links to their own descriptions of their solutions. Finally, we would also like to give special thanks to Alexandre Barachant from team Cat & Dog, for publicly sharing so much of his knowledge related to brain-computer interfaces and EEG signals.

A Minimalist Image Viewer

Often when I want to use Ubuntu’s eye of gnome image viewer to view image files, I find that it becomes very slow and unresponsive when trying to open an image in a directory containing many other images. Recently I decided to look for an alternative. I wanted something minimal that would open images quickly and without fuss, and this is when I discovered feh.

feh

feh is an incredibly lightweight image viewer that can be used from the terminal. Simply pass the filename of an image or directory as an argument, or even a text file containing a list of filenames, and it will open the images one-at-a-time for viewing. Installing is as simple as

sudo apt-get install feh

Montages

Despite its minimalism, feh has some nice features. Creating a montage is as simple as running

feh -m images/ -b trans -W 200

where an optional background file (or trans, for transparent) can be specified with the -b flag, and the width of the montage can be limited with the -W flag. As shown below, feh nicely tiles the images in a grid, even when they are of different shapes and sizes. example montage made with feh

Labeling Images

More related to the topic of machine learning, it is also possible to sort images into categories using feh. This is especially useful for quickly creating a dataset of class-labeled images when developing a learning model. feh allows you to specify an action, which will execute every time a key is pressed. So using this simple bash script, it is possible to iterate over all images in a directory, and copying them to the appropriate directory by pressing the number keys.

#!/usr/bin/env bash
feh \
  --cycle-once \
  --action1 \
  "cp '%f' ~/data/cats/%n" \
  --action2 \
  "cp '%f' ~/data/dogs/%n" \
  --action3 \
  "cp '%f' ~/data/birds/%n" \
  --action4 \
  "cp '%f' ~/data/bears/%n" \
  "$1"  # this is the directory from which to read images

This script may be run as ./feh_labeling.sh ~/data/all-images/, and by pressing keys 1 through 4, the currently shown image may be copied to the corresponding directory. More sophisticated tasks are also possible using feh’s actions, for more information see the Ubuntu manual entry.

Conclusion

Thanks to its quick response time and flexible features, feh has now replaced eye of gnome as my image viewer of choice. Despite its apparent minimalism, it provides some very useful features for boosting productivity when working with image files.

Efficient Image Loading for Deep Learning

When doing any kind of machine learning with visual data, it is almost always necessary first to transform the images from raw files on disk to data structures that can be efficiently iterated over during learning. Python’s numpy arrays are perfect for this.

The Lasagne neural network library, of which I’ve grown very fond, expects the data to be in four-dimensional arrays, where the axes are, in order, batch, channel, height, and width.

The Naïve Way

Previously, I have always simply loaded the images into a list of numpy arrays, stacked them, and reshaped the resulting array accordingly. Something like this:

for fpath in fpaths:
  img = cv2.imread(fpath, cv2.CV_LOAD_IMAGE_COLOR)
  image_list.append(img)

data = np.vstack(image_list)
data = data.reshape(-1, 256, 256, 3)
data = data.transpose(0, 3, 1, 2)

However, this comes with a higher-than-necessary memory usage, for the simple reason that numpy’s vstack needs to make a copy of the data. For a small dataset, this might be fine, but for something like the images from Kaggle’s Diabetic Retinopathy Detection challenge, we need to be much more conservative with memory. In fact, even after I resized all 35126 images to 256x256, they still used 587MB of hard drive space. While this might not seem like much, consider that jpeg images are compressed. When we actually load these color images into memory we will need to allocate 3 bytes per pixel - one for each color channel. Thus, we need 35126x3x256x256 = 6.43 GB to store them in numpy arrays. Suddenly I realized that my workstation would not be able to apply a vstack to the image data without using swap memory.

The Better Way

Fortunately, we can avoid this by pre-allocating the data array, and then loading the images directly into it:

data = np.empty((N, 3, 256, 256), dtype=np.uint8)
for i, fpath in enumerate(fpaths):
  img = cv2.imread(fpath, cv2.CV_LOAD_IMAGE_COLOR)
  data[i, ...] = img.transpose(2, 0, 1)

Evaluation

We can demonstrate the improved memory usage by inspecting the memory usage of the two functions. Below we load 10000 RGB images, each of dimensions 256x256, into memory. This should require approximately 10000x256x256x3 = 1.83 GB of memory using our improved technique. Note how the memory usage for the first snippet roughly doubles when the copy is made. memory usage for the two functions

Conclusion

It is not a daily occurrence to halve memory usage with such a simple trick. This solution in its simplicity and elegance helps to free some valuable resources when doing learning, which is already quite computationally intensive by itself.