CortexNet

CortexNet: a robust predictive deep neural network trained on videos

Most recent feed-forward deep neural networks for artificial vision are trained with supervision on data and labels from large collections of static images. These networks are missing the time variable present in video streams, and are not used to see the smooth transformation of scenes in videos. As a result, when applied to video streams, standard feed-forward networks provide poor output stability. This issue is direct consequence of their feed-forward architecture and the training framework. This project addresses both architectural and training shortcomings of standard feed-forward deep neural networks by proposing a novel network model and two training schemes. Inspired by the human visual system, CortexNet provides robust visual temporal representations by adding top-down feedback and lateral connections to bottom-up feed-forward connections, all of which are present in our visual cortex.

In the figure above we see (a) the full CortexNet architecture, which is made of several (b) discriminative and (c) generative blocks. The logits are a linear transformation of the embedding, which is obtained by (d) spatial averaging the output of the last discriminative block.

CortexNet can be trained in two ways to provide MatchNet and TempoNet. Details below.

TempoNet live example

CortexNet, in the form of TempoNet, can provide a much more stable output representation, as can be see in the animation below.

In the two charts above we can see how the full CortexNet architecture (middle) compares to a classical convolutional net (top) in terms of temporal stability. We can notice how CortexNet, trained as TempoNet, is able to predict the correct target class, even when its discriminative branch would not. TempoNet automatically learns how to track and attend to an object over time (bottom), and therefore provides a more stable temporal prediction.

MatchNet live example

MatchNet implements future predictions in CortexNet, and is trained to reproduce the next frames in a video stream. Here below is an example of MatchNet predictive ability in the input plane.

The μ-matching loss tells us how far the model output h[ t ] is from perfectly matching the next input frame x_v[t + 1] for video v. We can keep an eye on the ρ-replica loss to see if the model is simply replicating its input frame x_v[ t ]. Finally, you can check and compare these losses with the temporal signal, i.e. the difference between next and current frame.

Here MatchNet was trained to reproduce future input frames in a video. A more interesting approach is to be able to forecast the representation of higher layers. We ask your help and ideas in this exciting and active area of research. Please send us your ideas and contribute to our GitHub project.

MatchNet and TempoNet training schemes

Let's peek into the proposed training schemes.

In order to train our model we use a combination of several loss functions. While we feed our video stream, we use a μ-matching and ρ-replica mean squared error losses, τ-temporal and π-periodic cross entropy losses to tune the model parameters.

In MatchNet mode the network will try to generate the next frame in a video clip in a completely unsupervised learning framework. In TempoNet mode we ask the net to learn to see by tracking moving objects with some weak supervision.

The training procedure is a slightly altered back-propagation through time (BPTT) with selective state resetting. Feel free to muck around with the code by forking this project on GitHub and read more technicalities on the arXiv-ed paper. A red and blue linking buttons are provided at the top of the website. For more information about CortexNet, also see this Medium post.