A Critique of Pure Audition

Malcolm Slaney

Interval Research Corporation
1801 Page Mill Road; Building C; Palo Alto, CA 94304 USA

This chapter will be published in the book Computational Auditory Scene Analysis, Dave Rosenthal and Hiroshi Okuno, editors, Erlbaum, 1998. Postscript and PDF reprints of this chapter are available (but I recommend the whole book, it is good.)


All sound-separation systems based on perception assume a bottom-up or Marr-like view of the world. Sound is processed by a cochlear model, passed to an analysis system, grouped into objects, and then passed to higher-level processing systems. The information flow is strictly bottom up, with no information flowing down from higher-level expectations. Is this approach correct? In this chapter, I first summarize existing bottom-up perceptual models. Then, I examine evidence for top-down processing, describing many of the auditory and visual effects that indicate top-down information flow. I hope that this chapter generates discussion about what the role of top-down processing is, whether this information should be included in sound-separation models, and how we can build testable architectures.


Several of the stimulai described in this chapter are available for your viewing pleasure. Audio examples are AIFF files, while the movies are in QuickTime format.
Figure 3.3
Alternating white and black dots that create an illusion. Subjects see one uniform motion---either motion up and down, or left and right---and never a combination of the two directions. A QuickTime movie (77k) is available. (Source: Adapted with permission from Churchland et al., 1994).
Figure 3.5
An auditory illustion proposed by Peter Lagafoged. "What Vowel is This" has been translated into a sequence of HTML pages by Malcolm Slaney.
Figure 3.6
A sine-wave speech (40k AIFF) example by Richard Remez and the original (natural) speech (40k AIFF).
Figure 3.6
Miriam Makeba's Click Song (1.5M AIFF) illustrates how clicks are perceived differently in speech and in music, at least by the author's native english ears.
Figure 3.7
Three experiments demonstrating illusory motion. The first movie (66k QuickTime) appears to be three dots moving to the right, with the middle dot occluded by the square. In the second movie (61k QuickTime), the outer dots are removed and there is no longer a sense of motion. Finally, in the third movie (862k QuickTime), tones alternate in the left and right speakers and the illusion of motion (and occlusion) returns. (Source: Adapted with permission from Churchland et al., 1994).
Figure 3.8
The McGurk effect (188k QuickTime). Listen to how the sound changes as you open and close your eyes while this movie is playing. (Source: Movie courtesy of Michael Cohen, University of California, Santa Cruz.)