margins

Mark Dolson

Computer Audio Research Laboratory

Center for Music Experiment, Q-037

University of California, San Diego

La Jolla, California 92093

The phase vocoder is a digital signal processing technique of potentially great musical significance. It can be used to perform very high fidelity time scaling, pitch transposition, and myriad other modifications of recorded sounds. In this tutorial, I attempt to explain the operation of the phase vocoder in terms that musicians can understand.

Figure 1. The Filter Bank Interpretation

The filter bank itself has only three constraints. First, the frequency response characteristics of the individual bandpass filters are identical except that each filter has its passband centered at a different frequency. Second, these center frequencies are equally spaced across the entire spectrum from 0 Hz to half the sampling rate. Third, the individual bandpass frequency response is such that the combined frequency response of all the filters in parallel is essentially flat across the entire spectrum. This ensures that no frequency component is given disproportionate weight in the analysis, and that the phase vocoder is in fact an analysis-synthesis identity. As a consequence of these constraints, the only issues in the design of the filter bank are the number of filters and the individual bandpass frequency response. The number of filters must be sufficiently large so that there is never more than one partial within the passband of any single filter. For harmonic sounds, this amounts to saying that the number of filters must be greater than the sampling rate divided by the pitch. For inharmonic and polyphonic sounds, the number of filters may need to be much greater. If this condition is not satisfied, then the phase vocoder will not function as intended because the partials within a single filter will constructively and destructively interfere with each other, and the information about their individual frequencies will be coded as an unintended temporal variation in a single composite signal. The design of the representative bandpass filter is dominated by a single consideration: the sharper the filter frequency response cuts off at the band edges (i.e., the less overlap between adjacent bandpass filters), the longer its impulse response will be (i.e., the longer the filter will ``ring''). Thus, to get sharp cut-offs with minimal overlap, one must use filters whose time response is very sluggish. In the phase vocoder, this tradeoff is ever-present, and the best solution is generally discovered experimentally by simply trying different filter settings for the sound in question. A Closer Look at the Filter Bank The above paragraphs provide an adequate description of the phase vocoder from the standpoint of the user, but they leave unanswered the question of how it actually works. In this section, I show in detail how the output of a single bandpass filter is expressed as a time-varying amplitude and a time-varying frequency. The actual operation of a single phasevocoder bandpass filter is shown in Figure 2. This diagram may appear complicated, but it can easily be broken down into a series of fairly simple mathematical steps.

Figure 2. An Individual Bandpass Filter

In the first step, the incoming signal is routed into two parallel
paths. In one path, the signal is multiplied by a sine wave with an
amplitude of 1.0 and a frequency equal to the center frequency of the
bandpass filter; in the other path, the signal is multiplied by a
cosine wave of the same amplitude and frequency. Thus, the two parallel
paths are identical except for the phase of the multiplying
waveform. Then, in each path, the result of the multiplication is fed
into a lowpass filter. The multiplication operation itself should be
familiar to musicians as simple ring modulation. Multiplying any signal
by a sine (or cosine) wave of constant frequency has the effect
of simultaneously shifting all the frequency components in the original
signal by both plus and minus the frequency of the sine wave. An
example of this is shown in Figure 3 in which a 100 Hz sine wave
multiplies an input signal of 101 Hz. The result is a sine wave at 1
Hz (i.e., 101 Hz - 100 Hz) and a sine wave at 201 Hz (i.e., 101 Hz +
100 Hz). Furthermore, if this result is now passed through an
appropriate lowpass filter, only the 1 Hz sine wave will remain.
This sequence of operations (i.e., multiplying by a sine wave of
frequency *f* and then lowpass filtering) is useful in a variety
of signal processing applications and is known as *heterodyning*.
Any input frequency components in the vicinity of frequency *f*
are shifted down to the vicinity of 0 Hz and allowed to pass; input
frequency components not in the vicinity of frequency *f* are similarly
shifted but not by enough to get through the lowpass filter. The result
is a type of bandpass filtering in which the passband is
frequency-shifted down to very low frequencies.

Figure 3. Multiplying Two Sine Waves

In the phase vocoder, heterodyning is performed in each of the two parallel paths. But since one path heterodynes with a sine wave while the other path uses a cosine wave, the resulting heterodyned signals in the two paths are out of phase by 90 degrees. Thus, in the above example, both paths will produce a 1 Hz sinusoidal wave at the outputs of their respective lowpass filters, but the two sinusoids will be 90 degrees out of phase with respect to each other. To understand what the phase vocoder does next with these signals, we can consider the rotating wheel illustrated in Figure 4.

Figure 4. Rectangular and Polar Coordinates

Suppose that we wish to plot the position of some point on the wheel
as a function of time. We have a choice of using ``rectangular''
coordinates (e.g., horizontal position and vertical position) or
``polar'' coordinates (e.g., radial position and angular position—also
known as magnitude and phase). With rectangular coordinates we find
that both the horizontal position and the vertical position are
varying sinusoidally, but the maximum vertical displacement occurs one
quarter cycle later than the maximum horizontal displacement. With
polar coordinates we simply have a linearly increasing angular
position and a constant radius. Clearly, the latter description is
simpler. The situation within the phase vocoder is very much
analogous. The two heterodyned signals can be viewed as the
horizontal and vertical signals of the rectangular representation,
whereas the desired representation is in terms of a time-varying
amplitude (i.e., radius) and a time-varying frequency (i.e., rate of
angular rotation). Happily, the translation between the two different
representations is easily accomplished. As shown in Figure 4, the
amplitude at each point in time is simply the square root of the sum of
the squares of the two rectangular coordinates. The frequency cannot
be calculated directly, but it can be very well approximated by taking
the difference in successive values of angular position and then
dividing by the time between these successive values. To see this, we
can note that the difference between two successive values of
angular position is some fraction of an entire cycle (i.e., a complete
revolution), and that ``frequency'' is simply the number of cycles
which occur during some unit time interval. As a result, we need only
worry about how to calculate the angular position. Figure 4 also
gives a formula for the angular position, but it produces answers only
in the range of 0 to 360 degrees. Thus, if we examine successive
values of angular position, we may find a sequence such as 180, 225,
270, 315, 0, 45, 90. This suggests that the instantaneous frequency
(i.e., rate of angular rotation) is given by the sequence: (225
- 180)/T = 45/T, (270 - 225)/T = 45/T, (315 270)/T = 45/T, (0 -
315)/T = -315/T, (45 - 0)/T = 45/T, (90 45)/T = 45/T, where T is
time between successive values. But the -315/T element is clearly not
quite right. What has actually happened is that we have gone
through more than a single cycle. Therefore, if we want our frequency
calculation to work properly, we should really write the sequence as
180, 225, 270, 315, 360, 405, 450. Now the result of the frequency
calculation is, as it should be, a sequence of (45/T)'s. This process
of adding in 360 degrees whenever a full cycle has been completed is
known as *phase* *unwrapping* (see Figure 5). It is the final
necessary step in the sequence of operations which makes the phase
vocoder work.

Figure 5. Phase Unwrapping

Thus, the internal operation of a single phase vocoder bandpass filter consists of (1) heterodyning the input with both a sine wave and a cosine wave in parallel, (2) lowpass filtering each result, (3) converting the two parallel lowpass filtered signals to radius and angular-position signals, (4) unwrapping the angular-position values, and (5) subtracting successive unwrapped angular-position values and dividing by the time to obtain a rate-of-angular-rotation signal. But it should be noted that this rate-of-rotation signal (i.e., the instantaneous frequency) actually refers only to the difference frequency between the heterodyning sinusoid (i.e., the filter center frequency) and the input signal. Therefore the final step is simply to add the filter center frequency back in.

Figure 6. Filter Bank Interpretation vs. Fourier Transform Interpretation

These two differing views of the phase-vocoder analysis suggest two equally divergent interpretations of the resynthesis. In the Filter Bank interpretation (as noted above), the resynthesis can be viewed as a classic example of additive synthesis with timevarying amplitude and frequency controls for each oscillator. In the Fourier view, the synthesis is accomplished by converting back to real-and-imaginary form and overlap-adding the successive inverse Fourier transforms. This is a first indication that the phase vocoder representation may actually be more generally applicable than would be expected of an additive-synthesis technique. In the Fourier interpretation, the number of filters bands in the phase vocoder is simply the number of points in the Fourier transform. Similarly, the equal spacing in frequency of the individual filters can be recognized as a fundamental feature of the Fourier transform. On the other hand, the shape of the filter passbands (e.g., the steepness of the cutoff at the band edges) is determined by the shape of the window function which is applied prior to calculating the transform. For a particular characteristic shape (e.g., a Hamming window), the steepness of the filter cutoff increases in direct proportion to the duration of the window. Thus, again, we see the fundamental tradeoff between rapid time response and narrow frequency response. It is important to understand that the two different interpretations of the phase vocoder analysis apply only to the implementation of the bank of bandpass filters. The operation (described in the previous section) by which the outputs of these filters are expressed as time-varying amplitudes and frequencies is the same for each. However, a particular advantage of the Fourier interpretation is that it leads to the implementation of the filter bank via the more efficient Fast Fourier Transform (FFT) technique. The FFT produces an output value for each of N filters with (on the order of) N log2N multiplications, while the direct implementation of the filter bank requires N2 multiplications. Thus, the Fourier interpretation can lead to a substantial increase in computational efficiency when the number of filters is large (e.g., N = 1024). The Fourier interpretation has also been the key to much of the recent progress in phase-vocoder-like techniques. Mathematically, these techniques are described as Short-Time Fourier-Transform techniques (Rabiner & Schafer, 1978; Crochiere, 1980; Portnoff, 1980; Portnoff, 1981a,b; Griffin & Lim, 1984). Such algorithms may also be referred to as Multirate Digital Signal Processing techniques (for reasons which will be made clear below) (Crochiere & Rabiner, 1983).

Figure 7. Spectral Envelope Correction

Crochiere, R. E. (1980). *A weighted overlap-add method of
Fourier analysis-synthesis. IEEE Transactions on Acoustics,
Speech, and Signal Processing, ASSP-28(1), 55-69.*

Crochiere, R. E. & Rabiner, L. R. (1983). *Multirate Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall.*

Flanagan, J. L. & Golden, R. M. *Phase vocoder. Bell System Technical Journal, 45, 1493-1509.*

Grey, J. M., & Moorer, J. A. (1977). *Perceptual evaluations of
synthesized musical instrument tones. Journal of the Acoustical Society of America, 62, 454-462.*

Grey, J. M. (1977). *Multidimensional perceptual scaling of musical timbres. Journal of the Acoustical Society of America,
61, 1270-1277.*

Grey, J. M., & Gordon, J. W. (1978). *Perceptual effects of
spectral modifications on musical timbres. Journal of the
Acoustical Society of America, 63, 1493-1500.*

Griffin, D. W. & Lim, J. S. (1984). *Signal estimation from modified short-time Fourier transform. IEEE Transactions on
Acoustics, Speech, and Signal Processing, ASSP-28(2),
236-242.*

Moorer, J. A. (1978) *The use of the phase vocoder in computer
music applications. Journal of the Audio Engineering Society, 24(9), 717-727.*

Portnoff, M. R. (1980). *Time-frequency representation of digital
signals and systems based on short-time Fourier analysis.
IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28(1), 55-69.*

Portnoff, M. R. (1981a). *Short-time Fourier analysis of sampled
speech. IEEE Transactions on Acoustics, Speech, and Signal
Processing, ASSP-29(3), 364-373.*

Portnoff, M. R. (1981b). *Time-scale modification of speech based
on short-time Fourier analysis. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-29(3) 374-390.*