The team has developed a system that searches videos for specific types of objects and activities without any special labeling or annotations. “Think of it like Googling for video,” Siskind says.
Searching 10 Hollywood movies in the genre of Westerns, his team was able to find a variety of horse-related images and video segments.
“Our system has an object detector for people and for horses, and it has semantic definitions of concepts like riding a horse, leading a horse, approaching a horse,” Siskind says. “It has concepts like toward and away from, leftward and rightward, and quickly and slowly.”
The algorithms first isolate objects of interest by drawing bounding boxes, then analyze the content. “I can search for, ‘the person rode the horse,’ and I get hits — pictures and video clips from our 10 Hollywood movies — of people riding horses. I can search for, ‘the person rode the horse quickly,’ and you get hits, or ‘slowly,’ or ‘the person led the horse,’ or, ‘the person approached the horse.’ And there is no annotation in this video. This is just straight stock video, Hollywood movies.”
Siskind compares the new capability to searching for a video on the Internet. “Current systems allow you to search for a video of something you want to see online. But the search engine is not actually searching for the video. It’s searching for the captions under it, the words used to describe it,” he says. “What we’re doing with our research is actually recognizing what’s going on in the video.”
The work was conducted with $3.6 million in funding from DARPA under the Mind’s Eye program.
His team also is creating a system to make sense of the dynamic scene of a parking lot from surveillance camera data. The work is funded through the Deep Intermodal Video Analytics, or DIVA, program supported by the Office of the Director of National Intelligence’s Intelligence Advanced Research Projects Activity, or IARPA, program.
“People are getting into and out of cars, opening and closing the doors and trunks, and loading and unloading things from the cars,” Siskind says. “We want to have a system that can recognize when various different activities happen and detect people carrying objects and taking objects in and out of cars and so forth.”
One major challenge is being able to discern relatively small objects, such as people and cars, within the large field of view.
“The resolution is pretty high, but the field of view covers an entire parking lot,” he says. “There are many cars, and each car is very small in the field of view, and people are even smaller. They may be, in fact, only five or 10 pixels wide, and it’s just very difficult to reliably detect and track people and particularly the small objects they carry.”
The DIVA work has received $2.35 million in funding under a subcontract from IBM.
The ‘Force’ of Algorithms
Siskind’s team is also giving machines the ability to understand and use language, allowing his robot, Darth Vader, to create its own path of movement based on previous experiences.
Algorithms enable the bot to learn the meanings of words from example sentences, using the words to generate a sentence to describe a path of movement and to comprehend the sentence in order to produce a new path of movement. “It’s our hope that this technology can be applied to a host of applications in the future, potentially including autonomous vehicles,” Siskind says.
The robot was outfitted with several cameras and ran numerous trials on an enclosed course containing several objects, such as a chair, a traffic cone and a table. Using the algorithms, the robot was able to recognize words associated with objects within the course and words associated with directions of travel based on its sensory data.
“It was able to generate its own sentences to describe the paths it had taken. It was also able to generate its own sentences to describe a separate path of travel on the same course,” says Siskind, whose autonomous vehicles research is supported with $650,000 in funding from a National Science Foundation National Robotics Initiative award. “The robot aggregated its sensory data over numerous experiences.”
By learning the meaning of various words, the robot took a step beyond conventional autonomous vehicles that work by using a vehicle’s electrical system to control driving based on a computerized map of existing roads. Conventional autonomous vehicles are equipped with cameras and other sensors to detect potential hazards, such as stoplights, pedestrians and the edges of the road. However, they cannot recognize everyday landmarks off the road based on the sensory data. Nor can they associate words with the objects.
An important goal of this research is to create voice-controlled robots that can recognize a variety of objects and words and phrases even if they’ve only heard them once. This required development of new machine-learning algorithms.
“We have developed a number of different systems,” Siskind says. “Some learn verbs, some learn nouns, some learn prepositions, adjectives. And different kinds of words have different properties.”
All of these parameters relate to sensory data in one form or another. The sensory data might describe the navigational path taken by the robot, which is derived from a combination of information from an odometer and a device called an inertial measurement unit, or IMU.
The research sidesteps a limitation inherent in more conventional object-detection methods. “The conventional way this is usually done is let’s say you want to detect a chair. You get lots of training images for chair. Hundreds or thousands of images, and these chairs are each surrounded by a ‘bounding box,’ or annotated,” Siskind says. “This is a very labor-intensive task involving perhaps millions of images spanning several classes of objects. So, there are hundreds if not thousands of samples of objects of each class. We’d like to be able to construct models without having to annotate large quantities of images.”
These models would make it possible for a machine to autonomously carry out a task from a verbal or written description.