Online appendix for Whitepaper prepared for DARPA Mind's Eye Kickoff

Video Action Recognition as Lossy Data Compression

Purdue-Sarnoff-South Carolina-Toronto Team


We have recorded four video datasets. The first dataset, Enter-Exit, depicts ten people entering and exiting five doorways. One of the doorways was at the right of the field of view while the remaining doorways were at the left of the field of view. The second dataset, Plays, depicts five people performing two different sequences (A and B) of six different event classes in the hallway outside of one office: entering the field of view from the doorway on the left, picking up the suitcase and walking rightward, putting down the suitcase and continuing to walk rightward, sitting down in a chair, standing up, and walking leftward exiting the field of view through the doorway on the left. Sequence A consists of enter, pick up, put down, sit down, stand up, and exit while sequence B consists of stand up, exit, enter, pick up, put down, and sit down. The third dataset, Desktop, depicts three people performing two different sequences (A and B) of eleven different event classes from a bird's eye view above a single desktop: placing a piece of paper on the desktop, writing on the paper, removing the paper from the desktop, typing on the keyboard, moving the mouse closer to the person, moving the mouse away from the person, placing a book on the desktop, opening and closing the book, removing the book from the desktop, pressing buttons on a cell phone, and opening a wallet to remove and return a credit card. Sequence A consists of get paper, write, remove paper, type, move mouse closer, move mouse away, get book, read, remove book, use cellphone, and check wallet while sequence B consists of get book, read, remove book, use cellphone, check wallet, get paper, write, remove paper, type, move mouse closer, and move mouse away. The fourth dataset, Office, depicts five people performing two different sequences (A and B) of sixteen different event classes from a sagittal view at three different desks: standing up, sitting down, picking a book up off the bookshelf and placing it on the desk, opening and closing the book, putting the book back on the bookshelf, answering the phone, hanging up the phone, inserting a CD into the computer, removing the CD from the computer, moving a Coke bottle closer to the person, picking up the Coke bottle, putting down the Coke bottle, returning the Coke bottle to its original location, moving a stapler closer to the paper, stapling some paper, and returning the stapler to its original location. Sequence A consists of stand up, sit down, pick up book, read book, put down book, pick up phone, put down phone, insert CD, remove CD, move drink closer, pick up drink, put down drink, move drink away, move stapler closer, staple papers, and move stapler away while sequence B consists of move drink closer, pick up drink, put down drink, move drink away, move stapler closer, staple papers, move stapler away, stand up, sit down, pick up book, read book, put down book, pick up phone, put down phone, insert CD, and remove CD. We collectively refer to doorway and desk as location. These datasets exhibit variation along subsets of the variables: sequence, event class, person, and location.

All four datasets were recorded at 320x240 as MJPEG AVI movies with the camera mounted on a tripod. Enter-Exit was recorded with a Casio Exilim EX-Z3 while the remaining datasets were recorded with a Canon S1 IS. Enter-Exit was recorded at 12.471 fps, Plays was recorded at 30 fps, and Desktop and Office were recorded at 15 fps. Enter-Exit contains 1160 movies total: seventy movies of one person outside one doorway for each class, thirty movies of each of the other nine people outside that doorway for each class, and six movies of each of the ten people at the other four locations for each class. Plays contains 150 movies total: fifteen movies for each combination of person and sequence. Desktop contains 90 movies total: fifteen movies for each combination of person and sequence. Office contains 150 movies total: five movies for each combination of person, location, and sequence.

The following constitutes a small sample of our dataset containing one movie for each combination of location and sequence for each dataset.

Enter-Exit


Plays


Desktop


Office


Web Figures

Web Figure 1: Camera

This video was compressed by the camera to a total of 16,252,932 bytes or approximately 2,880,874 bps with no loss of semantic information.

Web Figure 2: Grayscale H.264

Extracting the frames from this video, converting to 8-bit greyscale, and re-encoding as H.264 with the default quality settings for ffmpeg results in a file of 1,306,952 bytes or approximately 231,661 bps with no loss of semantic information.

Web Figure 3: 160x120, 5fps, H.264

One can reduce the spatial resolution of this greyscale video to 160×120 and the temporal resolution to 5 fps and re-encode as H.264 allowing ffmpeg to compress as tightly as it can. This yields a file of 19,800 bytes that can barely be interpreted by humans at a bit rate of approximately 3,510 bps, indicating an apparent limit to this approach.

Web Figure 4: Thresholded Berkeley edge maps

We then apply the Berkeley edge detector (PB; Maire et al. 2008) to the 8-bit greyscale images extracted from the original video captured by the camera yielding 8-bit graded edge maps and then threshold these graded edge maps at pixel value 1 to yield binary edge maps. If one encodes these binary edge maps as video in a lossless fashion, it is easy to see that there is no loss of semantic information.

Web Figure 5: Traces with intensity proportional to saliency

If we render the traces c at intensity proportional to S(c) quantized as 8-bit greyscale images and encode these images as video in a lossless fashion, it is easy to see that we have largely eliminated background edges while preserving foreground edges so that there is no loss of semantic information.

Web Figure 6: Five most-salient contours

If we render the traces corresponding to the solid edges of these contours as binary images (each containing multiple contours) and encode these images as video in a lossless fashion, one observes that we have largely removed the edge fragments that correspond to texture interior to the object boundaries yet the video still contains sufficient information to support action recognition.

Web Figure 7: Most salient contours

Blindly extracting a fixed number of contours, as we do above, often overcompensates, yielding some contours that do not correspond to the agent. We thus discard contours where the total motion saliency for every trace in that contour is below a threshold, computed as a specified fraction (currently 0.25) of the maximum motion saliency for that frame. Rendering this smaller set of contours as a video still preserves the desired semantic information.

Web Figure 8: Closed splines

We then fit a closed piecewise cubic B-spline to each contour in each frame. We then quantize the knot coordinates as integer pixel locations within the image boundaries. Rendering these splines as binary images encoded as video in a lossless fashion illustrates that the signal still contains sufficient information to support action recognition.