CortexNet: a robust predictive deep neural network trained on videos

Most recent feed-forward deep neural networks for artificial vision are trained with supervision on data and labels from large collections of static images. These networks are missing the time variable present in video streams, and are not used to see the smooth transformation of scenes in videos. As a result, when applied to video streams, standard feed-forward networks provide poor output stability. This issue is direct consequence of their feed-forward architecture and the training framework. This project addresses both architectural and training shortcomings of standard feed-forward deep neural networks by proposing a novel network model and two training schemes. Inspired by the human visual system, CortexNet provides robust visual temporal representations by adding top-down feedback and lateral connections to bottom-up feed-forward connections, all of which are present in our visual cortex.

In the figure above we see (a) the full CortexNet architecture, which is made of several (b) discriminative and (c) generative blocks. The logits are a linear transformation of the embedding, which is obtained by (d) spatial averaging the output of the last discriminative block.

CortexNet can be trained in two ways to provide MatchNet and TempoNet. Details below.

TempoNet live example

CortexNet, in the form of TempoNet, can provide a much more stable output representation, as can be see in the animation below.

In the two charts above we can see how the full CortexNet architecture (middle) compares to a classical convolutional net (top) in terms of temporal stability. We can notice how CortexNet, trained as TempoNet, is able to predict the correct target class, even when its discriminative branch would not. TempoNet automatically learns how to track and attend to an object over time (bottom), and therefore provides a more stable temporal prediction.

MatchNet live example

MatchNet implements future predictions in CortexNet, and is trained to reproduce the next frames in a video stream. Here below is an example of MatchNet predictive ability in the input plane.

The μ-matching loss tells us how far the model output h[ t ] is from perfectly matching the next input frame xv[t + 1] for video v. We can keep an eye on the ρ-replica loss to see if the model is simply replicating its input frame xv[ t ]. Finally, you can check and compare these losses with the temporal signal, i.e. the difference between next and current frame.

Here MatchNet was trained to reproduce future input frames in a video. A more interesting approach is to be able to forecast the representation of higher layers. We ask your help and ideas in this exciting and active area of research. Please send us your ideas and contribute to our GitHub project.