Here are some typical examples of what happens when you try to stretch audio to twice the original length:
By the way, if you want to print this page out you may want to switch to wide margins and white text on black background.
0. Install PVOC.EXE somewhere in your path, or put it in the directory where you keep and process your wav files. (Let's say this is c:\windows\desktop\wavfiles)
1. create a windows wav file with the sound you want to stretch/compress. Let's say it's called "guitarriff.wav"
2. open a DOS box. CD to the directory where the wav file above is.
cd c:\windows\desktop\wavfiles
3a. if you want to stretch the file to twice the length, type
pvoc -N2048 -T2.0 -Yguitarriff.wav -0stretched-guitarriff.wav
3b. if you want to compress the file to half the length, type
pvoc -N2048 -T0.5 -Yguitarriff.wav -0compressed-guitarriff.wav
The program is a bit slow and has no progress indicator, so be patient.
The name after the "-Y" is the sound file to process.
The name after the "-0" is where the resulting sound file will be written.
(Note, that's "dash zero", not "dash O!" The two characters look a bit alike
if the font doesn't provide a programmer's "slashed zero")
The number after the "-N" is the number of bins.
Time factor = 2.0 => the file will be twice as slow.
Time factor = 0.5 => the file will be twice as fast!
Time factor = 10.0 => the file will be ten times as slow.
You can experiment with the flags, for instance try
pvoc -N32 -T10 -Yguitarriff.wav -0absurd-guitarriff.wav
Normally, time stretching/compression is done by chopping up the audio in chunks of a few milliseconds. These chunks of audio waveform are then played back either quicker, by dropping parts of the chopped waveform, or slower, by repeating parts of the chopped waveform. The chopping points are difficult to choose for a stupid, "mindless" algorithm, and this gives rise to ugly artifacts.
The Phase Vocoder has a totally different approach. It converts the audio to a series of time-varying "spectral snapshots". By "spectrum" in this context I mean "frequency spectrum". These "snapshots" of the frequency content of the sound can be played back at any speed internally, somewhat like the frames of a film, and then converted back to wave audio. The "spectral power density" representation is a lot closer to how the brain and auditory complex "see" sound, which in my opinion is one of the reasons that makes it so useful. (This "spectral snapshot" concept is also the basis for "perception-equivalent" compressed audio formats, like mp3.)
A very simplified way of explaining it would be to consider an audio file that has two notes in it: first one note with the frequency 700 Hz, and then one note with 1900 Hz, each for one second. The wave form looks like one second of 700 squiggles per second, then one second of 1900 squiggles per second.
The program internally converts this wave to "one second worth of snapshots with energy at 700 Hz" followed by "one second worth of snapshots with energy at 1900 Hz". So if we play back these two seconds worth of snapshot sequences at half speed, the result will be "two seconds worth of snapshots with energy at 700 Hz" followed by "two seconds worth of snapshots with energy at 1900 Hz". These four seconds worth of snapshots are then converted back to wave audio ("squiggles"), which then will be exactly twice as long, but with no chopping noises or other glitches evident.
Alas, this process is not completely artifact-free either. This is due to the mathematical impossibility to accurately define exactly what a "snapshot" of the spectrum is. (But it does do a lot better than chopping time stretchers on a wide variety of audio material.)
The first problem is that one has to look at a certain amount of time of the wave for each spectral frame. If you look at a large amount of time, "smear" will result.
Think of an extreme case: a one-second wide "window" moving along a ten second file. The exact middle of the window is the point in the wave file we're "looking at" for instantaneous spectral content to gather the data for the spectral frame at the midpoint, we look at the whole second of audio under the "window."
If there is an abrupt change in the audio at 5 seconds, this change will make itself heard in spectral frames starting at 4.5 seconds, when the right side of the "window" hits the abrupt change; and ending at 5.5 seconds, when the left side of the window finally stops touching the abrupt change. The abrupt change will "smear out" over all frames between 4.5 and 5.5 seconds.
So why not make the window size very small?
The second problem is that the window size really has to be a certain size to determine the frequency content of the window. Think of an extreme case the window is just one sample. How do you determine what frequencies are present in just one sample!? So that wouldn't work. Well, let's consider 16 samples then. 16 samples is 0.36 milliseconds the lowest frequency that can be detected will then be 2.7 kHz!
Another side effect of the window size is that it corresponds to the "number of frequency bins". The bins are spaced evenly, in this case with one bin for 0-10 Hz, one bin for 10-20 Hz, one for 20-30, etc, up to the last bin for 22040 to 22050 Hz. Yes, you guessed it: if two strong frequencies in the input occupy the same bin for extended periods of time, it sounds a bit strange. With 16 bins, all frequencies between 0 and 2.7 kHz share that first bin! Frequencies between 2.7 - 5.6 kHz go in the second bin. Yes, pvoc -N16 sounds a bit odd.
So you basically have to make a tradeoff between too big windows and lots of smear, and too small windows and not enough frequency resolution and bass.
Window size is controlled with the "-N" flag. For a 44.1 kHz file, a size 4096 window is pretty OK for most purposes. This will detect frequencies down to 10Hz, and each frame will have a "spectral smear" of about 50 milliseconds "future" and 50 milliseconds "past". For a complex sample (ie, a whole band playing) 10 Hz is usually enough bass for the human ear, and the smear is not so bad that you can't make out individual notes and chords. If you have a sample with simpler content and less bass (say, a voice speaking, or a single instrument playing) go right ahead and use less bins!
Here are some examples of how different window sizes sound on a complex file at 44.1 kHz:
And a simple file (one instrument electric guitar):
There is one type of audio stretching/compression which is vastly superior the McAulay/Qatieri algorithm: FFT is used to obtain spectral snapshots like in the phase vocoder, but the frequency bin data is then processed (but not the imaginary ("phase") component, which is simply thrown away) first with parabolic interpolation to accurately find freqency peaks that possibly lie between bins. (Amongst other things, this means the effects of smear are reduced) Then corresponding peak information from the previous snapshot is considered heuristically with threshold and hysteresis criteria, to discard the weakest peaks. The remaining peaks are tied together frame-by-frame in "tracks" with very accurate frequency. These can be compressed and expanded in any fashion. Basically, MQ is asymmetric the resynthesis can not be done with IFFT. Even at unity timescale the actual waveform output will be very different from the input. But the important aspect of it is that it will _sound_ the same! MQ is what the Mac program "Lemur" uses. There is no Windoze executable, here or anywhere else. Please let me know if you make one or bump into one.. Update! Mikko Haapanen tipped me off about more modern versions of pvoc-related software. He gave me a few links, such as one about new formats (has better pvoc executables) and the Bath University sound research homepage. From the latter link I found SNDAN, which incorporates MQ analysis and synthesis programs! I'm sure the newer versions are more efficiently coded no old FORTRAN code or reading one sample per function call..
pvoc -h