Mach1 for Nonuniform Time-Scale Modification of Speech: Theory, Technique, and Comparisons

The audio samples provided here were created as described in our technical report, IRC-TR 1997-061 and as summarized in Covell, Withgott, Slaney, "Mach1: Nonuniform Time-Scale Modification of Speech," Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Seattle WA, May 12-15 1998.

Mach1 for Nonuniform Time-Scale Modification of Speech:
Theory, Technique, and Comparisons

Michele Covell, Margaret Withgott,¹ and Malcolm Slaney
Interval Research Corporation, 1801-C Page Mill Road, Palo Alto, CA 94304

Abstract

We propose a new approach to nonuniform time compression, called Mach1, designed to mimic the natural timing of fast speech. At identical overall compression rates, listener comprehension for Mach1-compressed speech increased between 5 and 31 percentage points² over that for linearly compressed speech, and response times dropped by 15%. For rates between 2.5 and 4.2 times real time, there was no significant comprehension loss with increasing Mach1 compression rates. In A-B preference tests, Mach1-compressed speech was chosen 95% of the time. This technical report describes the Mach1 technique and our listener-test results. Audio examples are given below.

Examples of Mach1 and of linear time compression, at the same compression rates

The audio examples here were compressed by Mach1 and by a linear technique, both to the same overall compression rate. Mach1 compression was completed "open-loop" (that is, without a feedback loop to enforce a specific global compression rate). The compression rates achieved by Mach1 were measured after compression and the same utterance was recompressed using linear compression to the same overall compression rate. A more detailed description of our method is provided in Section 3 of our technical report. The examples given here are sorted by discourse type (short dialog, long dialog, or monolog) and by compression rate. The final speaking rate, in words per minute (wpm), is also given.

Linear compression	Mach1 compression	Compression rate (x faster than real time)	Speaking rate (wpm)
Short dialogs
LC_S18_2	M1_S18_2	3.97	481
LC_S09_3	M1_S09_3	3.95	490
LC_S21_1	M1_S21_1	3.66	521
LC_S04_1	M1_S04_1	3.59	495
LC_S19_2	M1_S19_2	3.48	572
LC_S09_1	M1_S09_1	3.40	450
LC_S22_1	M1_S22_1	3.35	472
LC_S10_1	M1_S10_1	2.96	546
Long Dialogs
LC_L21	M1_L21	2.94	591
LC_L29	M1_L29	2.87	545
LC_L09	M1_L09	2.73	566
LC_L37	M1_L37	2.65	572
LC_L05	M1_L05	2.61	551
Monologs
LC_M09	M1_M09	2.86	544
LC_M05	M1_M05	2.80	430
LC_M25	M1_M25	2.77	464
LC_M13	M1_M13	2.56	391
The examples provided in this table are based on audio from the compact disks in the Kaplan TOEFL review materials. See: M. Rymniak, G. Kurlandski, et al., 1997. The Essential Review: TOEFL (Test of English as a Foreign Language), Kaplan Educational Centers and Simon & Schuster, New York. We thank Kaplan Educational Centers and Simon & Schuster for providing us with permission to use these excerpts in this manner.

Why is this problem important?

Voice mail makes it easy and attractive to leave impromptu messages. In contrast, listening to voice mail messages is often painful. While we can time compress the messages, current techniques typically are viable only up to 2 times real time. More specifically, human comprehension of linearly time-compressed speech typically degrades at compression rates around 2.0 to 2.5 times real time. These degradations are not due to the speech rate per se: Comprehension of linearly compressed speech often breaks down above 225 to 270 wpm, which is well below the rates at which long passages of natural speech are comprehensible.

Instead, the incomprehensibility of linearly time-compressed speech is due to its unnatural timing. Our new nonuniform time-compression technique, called Mach1, compresses the components of an utterance to resemble closely the natural timing of fast speech. The resulting compressed speech remains comprehensible at much higher rates: as high as 2.56 to 4.15 times real time and 390 to 673 wpm.

How does Mach1 compare with previous approaches?

Most previous work in compressed playback of speech concentrated on linear time compression at rates below 2.5 times real time. Previous efforts in nonuniform time compression have not attempted rates above 3 times real time. Nor have they described any formal comparisons of comprehensibility between their proposed methods and linear compression.

Mach1 offers statistically significant improvements in comprehensibility over linear time compression: At compression rates between 2.5 and 4.2 times real time, comprehension of Mach1-compressed speech is 17 percentage points better than that of linearly compressed speech. This difference in comprehension increased with increasing compression rate. Short dialogs provided the greatest improvement in comprehension: These improvements averaged 23 percentage points and ranged as high as 31 percentage points for naive listeners. The comprehension improvements were less with the longer clips: 10 percentage points with monologs and 5 percentage points with long dialogs.

This research is the first to maintain comprehensibility with time-scale modification at such high compression rates. It is also the first report of statistically significant improvements in comprehensibility over linear time compression.

Copies of our technical report are available in HTML, Postscript (583k), and Adobe PDF (117k).