Multimedia Edges: Finding Hierarchy in all Dimensions

Malcolm Slaney
IBM Almaden Research Center
650 Harry Road
San Jose, CA 95120
Dulce Ponceleon
IBM Almaden Research Center
650 Harry Road
San Jose, CA 95120
James Kaufman
IBM Almaden Research Center
650 Harry Road
San Jose, CA 95120


This paper describes a new unified representation for the information in a video. We reduce the dimensionality of the signal with either a singular-value decomposition (on the semantic and image data) or mel-frequency cepstral coefficients (on the audio data) and then concatenate the vectors to form a multi-dimensional representation of the video. Using scale-space techniques we find large jumps in the video's path, which we call edges. We use these techniques to analyze the temporal properties of the audio and image data in a video. This analysis creates a hierarchical segmentation of the video, or a table-of-contents, from the audio, semantic and image data.


Automatic segmentation, temporal properties, multimedia, video, audio, images, hierarchy, scale space, latent semantic indexing, singular-value decomposition, color space, semantic space


Here are two movies which illustrate our approach

Semantic Position Movie

The first movie shows a two-dimensional semantic representation of the sentences in Chapter 4 of a book on tomography. Using the algorithms presented in this paper, we found the two most important semantic dimensions. The sentences of the text were then projected onto these dimensions and the resulting location is shown as a point in a two-dimensional plane.

Semantic Position Movie: AVI format ( 250kBytes) or QuickTime format (1055kByte)

Scale-Space Segmentation Movie

Our approach represents each position in the text as a multi-dimensional semantic vector (10-D in this paper). We then low-pass filter the vector, using a longer and longer window, to remove the high-frequency detail in the semantic position. This movie shows the resulting 10-dimensional semantic position. There are 10 dimensions along the vertical axis, corresponding to the 10-dimensional semantic space. The horizontal axis shows the sentence number. The movie shows as a function of time, longer and longer scales (more smoothing). We have placed a green line in the movie at the location of the largest segmentation boundary (as found at the largest scale, and then traced back to the finest scale at the start of the movie.)

Scale-Space Segmentation Movie: AVI format (1267kBytes) or QuickTime format (3808kBytes)


Browsing videotapes of image and sound (hereafter referred to as 'videos') is difficult. Often, there is an hour or more of material, and there is no roadmap to help viewers find their way through the medium. It would be tremendously helpful to have an automated way to create a hierarchical table of contents that listed major topic changes at the highest level, with subsegments down to individual shots. DVDs provide the chapter indices; we would like to find the position of the sub-chapter boundaries. Realization of such an automated analysis requires the development of algorithms which can detect changes in the video or semantic content of a video as a function of time. We propose a technology that performs this indexing task by combining the three major sources of data--images, words and sounds--from the video into one unified representation.

With regard to analysis of image data, shot detection is nearly a solved problem. A shot is a temporally contiguous set of frames taken at the same time from the same camera. Current algorithms are not perfect, but they offer many successful ways to process the information in a video signal to determine shot changes (either cuts or other transitions) [19]. We used these techniques, and added information from the audio signal, to find changes in the content or tone that indicated higher-level structures within a video. These techniques allow us to create a table of contents for a video using all the information in the signal.

With regard to the words in the sound track of a video, the information-retrieval world has used, with great success, statistical techniques to model the meaning, or semantic content, of a document. These techniques, such as latent semantic indexing (LSI), allow us to cluster related documents, or to pose a question and find the document that most closely resembles the query. We can apply the same techniques within a document or, in the present case, the transcript of a video. These techniques allow us to describe the semantic path of a video's transcript as a signal, from the initial sentence to the conclusions. Thinking about this signal in a scale space allows us to find the semantic discontinuities in the audio signal and to create a semantic table of contents for a video.

Our technique is analogous to one that detects edges in an image. Instead of trying to find similar regions of the video, called segments, we think of the audio-visual content as a signal and look for "large" changes in this signal or peaks in its derivative. The location of these changes are edges; they represent the entries in a table of contents.

1.1 Temporal Properties of Video

The techniques we describe in this paper allow us to characterize the temporal properties of both the audio and image data in the video. The color information in the image signal and the semantic information in the audio signal provide different information about the content.

Color provides robust evidence for a shot change in a video signal. An easy way to convert the color data into a signal that indicates scene changes is to compute each frame's color histogram and to note the frame-by-frame differences [19]. In general, however, we do not expect the colors of the images to tell us anything about the global structure of the video. The color balance in a video does not typically change systematically over the length of the film. Thus, over the long term, the video's overall color often does not tell us much about the overall structure of the video.

Random words from a transcript, on the other hand, do not reveal much about the low-level features of the video. Given just a few words from the audio signal, it is difficult to define the current topic. But the words indicate a lot about the overall structure of the story. A documentary script may, for instance, progress through topic 1, then topic 2, and finally topic 3.

The audio--what is left after removing the words--tells us other things about a video. The music sets a tone for the video and environmental sounds fill in the details.

We describe any time point in the video by its position in an acoustic-color-semantic vector space. We represent the audio, color and the semantic information in the video as three separate vectors as a function of time. We concatenate these three vectors to create a single vector that encodes the acoustics, color and the semantic data. Using scale-space techniques we can then talk about the changes that the acoustic-color-semantic vector undergoes as the video unwinds over time. We label as segment boundaries large jumps in the combined acoustic-color-semantic vector. "Large jumps" are defined by a scale-space algorithm that we describe in Section 3.

1.2 Literature Review

Our work extends previous work on text and video analysis and segmentation in several different ways.

LSI has a long history, starting with Deerwester's paper [5], as a powerful means to summarize the semantic content of a document and measuring the similarity of two documents. We use LSI because it allows us to capture the synonymy and polysemy, but, more importantly, LSI allows us to quantify the position of a portion of the document in a multi-dimensional semantic space.

Hearst [10] proposes to use the dips in a similarity measure of adjacent sentences in a document to identify topic changes. Her method is powerful because the size of the dip is a good indication of the relative amount of change in the document. We extend this idea using scale-space techniques to allow us to talk about similarity or dissimilarity over larger portions of the document.

Miller and her colleagues proposed Topic Islands [15], a visualization and segmentation algorithm based on a wavelet analysis of text documents. Their wavelets are localized in both time (document position) and frequency (spectral content) and allow them to find and visualize topic changes at many different scales. The localized nature of their wavelets makes it difficult to isolate and track segmentation boundaries through all scales. We propose to summarize the text with LSI and analyze the signal with smooth Gaussians, which are localized in time but preserve the long-term correlations of the semantic path.

Choi [4], for text, and Foote [8], for audio, represent a document in terms of its self-similarity matrix. Their task is then to search for and identify the square regions of this matrix that are self-similar. Using scale-space methods, we instead find the edges of these regions and characterize their strength.

Segmentation is a popular topic in the signal and image processing worlds. Witkin [20] introduced scale-space ideas to the segmentation problem and Lyon [14] extended Witkin's approach to multi-dimensional signals. A more theoretical discussion of the scale-space segmentation ideas was published by Leung [12]. The work described here extends the scale-space approach by using LSI as a basic feature and changing the distance metric to fit semantic data.

Current video shot detectors [19] look at local changes in the color histogram and luminance patterns to detect shot boundaries. We use the same color information but extend their techniques by analyzing the changes over many different time scales.

The key concept in this paper is to think about the video's path through space, and detect the jumps at multiple scales. The signal processing analysis proposed in this paper is just one part of a complete system. We use a singular-value decomposition (SVD) to do the basic analysis, but more sophisticated techniques are also applicable. Any method which allows us to summarize the image and semantic content of the document can also be used in conjunction with the techniques described here.

1.3 Overview of paper

This paper proposes a unified representation for the audio-visual information in a video. We use this representation to compare and contrast the temporal properties of the audio and images in a video. We form a hierarchical segmentation with this representation and compare the hierarchical segmentation to other forms of segmentation. By unifying the representations we have a simpler description of the video's content and can more easily compare the temporal information content in the different signals.

As we have explained, we combine two well-known techniques to find the edges or boundaries in a video. We reduce the dimensionality of the data and put them all into the same format. The SVD and its application to acoustic, color and word data are described in Section 2.

Scale-space techniques give us a way to analyze temporal regions of the video that span a time range from a few seconds to tens of minutes. Properties of scale spaces and their application to segmentation are described in Section 3.

In Section 4, we describe our algorithm, which combines these two approaches.

We discuss several temporal properties of video, and present simple segmentation results, in Section 5. Our representation of video allows us to measure and compare the temporal properties of the color and words. We perform a hierarchical segmentation of the video, automatically creating a table of contents for the video.

We conclude in Section 6 with some observations about this representation.


Because of the mathematics in our paper, please see the PDF for the rest of the paper.

PDF of ACM Multimedia Paper, 268KBytes

Copyright 2001 ACM.