Online appendix for Whitepaper prepared for DARPA Mind's Eye Kickoff
Video Action Recognition as Lossy Data Compression
Purdue-Sarnoff-South Carolina-Toronto Team
We have recorded four video datasets. The first dataset, Enter-Exit, depicts ten people entering and exiting five doorways. One of the doorways was at the right of the field of view while the remaining doorways were at the left of the field of view. The second dataset, Plays, depicts five people performing two different sequences (A and B) of six different event classes in the hallway outside of one office: entering the field of view from the doorway on the left, picking up the suitcase and walking rightward, putting down the suitcase and continuing to walk rightward, sitting down in a chair, standing up, and walking leftward exiting the field of view through the doorway on the left. Sequence A consists of enter, pick up, put down, sit down, stand up, and exit while sequence B consists of stand up, exit, enter, pick up, put down, and sit down. The third dataset, Desktop, depicts three people performing two different sequences (A and B) of eleven different event classes from a bird's eye view above a single desktop: placing a piece of paper on the desktop, writing on the paper, removing the paper from the desktop, typing on the keyboard, moving the mouse closer to the person, moving the mouse away from the person, placing a book on the desktop, opening and closing the book, removing the book from the desktop, pressing buttons on a cell phone, and opening a wallet to remove and return a credit card. Sequence A consists of get paper, write, remove paper, type, move mouse closer, move mouse away, get book, read, remove book, use cellphone, and check wallet while sequence B consists of get book, read, remove book, use cellphone, check wallet, get paper, write, remove paper, type, move mouse closer, and move mouse away. The fourth dataset, Office, depicts five people performing two different sequences (A and B) of sixteen different event classes from a sagittal view at three different desks: standing up, sitting down, picking a book up off the bookshelf and placing it on the desk, opening and closing the book, putting the book back on the bookshelf, answering the phone, hanging up the phone, inserting a CD into the computer, removing the CD from the computer, moving a Coke bottle closer to the person, picking up the Coke bottle, putting down the Coke bottle, returning the Coke bottle to its original location, moving a stapler closer to the paper, stapling some paper, and returning the stapler to its original location. Sequence A consists of stand up, sit down, pick up book, read book, put down book, pick up phone, put down phone, insert CD, remove CD, move drink closer, pick up drink, put down drink, move drink away, move stapler closer, staple papers, and move stapler away while sequence B consists of move drink closer, pick up drink, put down drink, move drink away, move stapler closer, staple papers, move stapler away, stand up, sit down, pick up book, read book, put down book, pick up phone, put down phone, insert CD, and remove CD. We collectively refer to doorway and desk as location. These datasets exhibit variation along subsets of the variables: sequence, event class, person, and location.
All four datasets were recorded at 320x240 as MJPEG AVI movies with the camera mounted on a tripod. Enter-Exit was recorded with a Casio Exilim EX-Z3 while the remaining datasets were recorded with a Canon S1 IS. Enter-Exit was recorded at 12.471 fps, Plays was recorded at 30 fps, and Desktop and Office were recorded at 15 fps. Enter-Exit contains 1160 movies total: seventy movies of one person outside one doorway for each class, thirty movies of each of the other nine people outside that doorway for each class, and six movies of each of the ten people at the other four locations for each class. Plays contains 150 movies total: fifteen movies for each combination of person and sequence. Desktop contains 90 movies total: fifteen movies for each combination of person and sequence. Office contains 150 movies total: five movies for each combination of person, location, and sequence.
The following constitutes a small sample of our dataset containing one movie for each combination of location and sequence for each dataset.