♫Jens Johansson · The pvoc (phase vocoder) page

home sections references cd:s about links heptagon
margins view as white text on black backgound

The pvoc (phase vocoder) page

If you want to time stretch wave files on a Windows machine to transcribe or learn music, get the pvoc executable, it's only about 75 kbyte in size!

Why?

Because most commercial stretch routines don't lend themselves very well to transcription, they distort the pitches of chords — especially in the bass — or create other artifacts that make it hard to hear what's going on.

Here are some typical examples of what happens when you try to stretch audio to twice the original length:

(The music is from the Frank Zappa composition "Envelopes", excerpted here as "fair use for educational purposes". Incidentally, it's on the wonderful CD "Ship Arriving too late to save a drowning witch", which everyone should own at least one copy of. Also check out internet search links for 'zappa'.)

If you switched to white background just to print out this page, you can always switch back to white-on-black and normal margins!

Basic HOWTO

Basic use of pvoc for lay people:

0. Install PVOC.EXE somewhere in your path, or put it in the directory where you keep and process your wav files. (Let's say this is c:\windows\desktop\wavfiles)

1. create a windows wav file with the sound you want to stretch/compress. Let's say it's called "guitarriff.wav"

2. open a DOS box. CD to the directory where the wav file above is.

	cd c:\windows\desktop\wavfiles

3a. if you want to stretch the file to twice the length, type

	pvoc -N2048 -T2.0 -Yguitarriff.wav -0stretched-guitarriff.wav

3b. if you want to compress the file to half the length, type

	pvoc -N2048 -T0.5 -Yguitarriff.wav -0compressed-guitarriff.wav

The program is a bit slow and has no progress indicator, so be patient.

The name after the "-Y" is the sound file to process.
The name after the "-0" is where the resulting sound file will be written. (Note, that's "dash zero", not "dash O!" The two characters look a bit alike if the font doesn't provide a programmer's "slashed zero")
The number after the "-N" is the number of bins.

2048 bins is a good start value.
Bin number should be a power of two. Allowed: 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768
Higher number of bins give more "time smear". If drum or chord attacks are too indistinct, decrease N.
Lower number of bins give more "frequency smear". If complex chords sound muddy or bass disappears, increase N.

The number after the "-T" is the time factor.

Time factor = 2.0 => the file will be twice as slow.
Time factor = 0.5 => the file will be twice as fast!
Time factor = 10.0 => the file will be ten times as slow.

You can experiment with the flags, for instance try

	pvoc -N32 -T10 -Yguitarriff.wav -0absurd-guitarriff.wav

Details

Here's a somewhat more detailed explanation.

Normally, time stretching/compression is done by chopping up the audio in chunks of a few milliseconds. These chunks of audio waveform are then played back — either quicker, by dropping parts of the chopped waveform, or slower, by repeating parts of the chopped waveform. The chopping points are difficult to choose for a stupid, "mindless" algorithm, and this gives rise to ugly artifacts.

The Phase Vocoder has a totally different approach. It converts the audio to a series of time-varying "spectral snapshots". By "spectrum" in this context I mean "frequency spectrum". These "snapshots" of the frequency content of the sound can be played back at any speed internally, somewhat like the frames of a film, and then converted back to wave audio. The "spectral power density" representation is a lot closer to how the brain and auditory complex "see" sound, which in my opinion is one of the reasons that makes it so useful. (This "spectral snapshot" concept is also the basis for "perception-equivalent" compressed audio formats, like mp3.)

A very simplified way of explaining it would be to consider an audio file that has two notes in it: first one note with the frequency 700 Hz, and then one note with 1900 Hz, each for one second. The wave form looks like one second of 700 squiggles per second, then one second of 1900 squiggles per second.

The program internally converts this wave to "one second worth of snapshots with energy at 700 Hz" followed by "one second worth of snapshots with energy at 1900 Hz". So if we play back these two seconds worth of snapshot sequences at half speed, the result will be "two seconds worth of snapshots with energy at 700 Hz" followed by "two seconds worth of snapshots with energy at 1900 Hz". These four seconds worth of snapshots are then converted back to wave audio ("squiggles"), which then will be exactly twice as long, but with no chopping noises or other glitches evident.

Alas, this process is not completely artifact-free either. This is due to the mathematical impossibility to accurately define exactly what a "snapshot" of the spectrum is. (But it does do a lot better than chopping time stretchers on a wide variety of audio material.)

The first problem is that one has to look at a certain amount of time of the wave for each spectral frame. If you look at a large amount of time, "smear" will result.

Think of an extreme case: a one-second wide "window" moving along a ten second file. The exact middle of the window is the point in the wave file we're "looking at" for instantaneous spectral content — to gather the data for the spectral frame at the midpoint, we look at the whole second of audio under the "window."

If there is an abrupt change in the audio at 5 seconds, this change will make itself heard in spectral frames starting at 4.5 seconds, when the right side of the "window" hits the abrupt change; and ending at 5.5 seconds, when the left side of the window finally stops touching the abrupt change. The abrupt change will "smear out" over all frames between 4.5 and 5.5 seconds.

So why not make the window size very small?

The second problem is that the window size really has to be a certain size to determine the frequency content of the window. Think of an extreme case — the window is just one sample. How do you determine what frequencies are present in just one sample!? So that wouldn't work. Well, let's consider 16 samples then. 16 samples is 0.36 milliseconds — the lowest frequency that can be detected will then be 2.7 kHz!

Another side effect of the window size is that it corresponds to the "number of frequency bins". The bins are spaced evenly, in this case with one bin for 0-10 Hz, one bin for 10-20 Hz, one for 20-30, etc, up to the last bin for 22040 to 22050 Hz. Yes, you guessed it: if two strong frequencies in the input occupy the same bin for extended periods of time, it sounds a bit strange. With 16 bins, all frequencies between 0 and 2.7 kHz share that first bin! Frequencies between 2.7 - 5.6 kHz go in the second bin. Yes, pvoc -N16 sounds a bit odd.

So you basically have to make a tradeoff between too big windows and lots of smear, and too small windows and not enough frequency resolution and bass.

Window size is controlled with the "-N" flag. For a 44.1 kHz file, a size 4096 window is pretty OK for most purposes. This will detect frequencies down to 10Hz, and each frame will have a "spectral smear" of about 50 milliseconds "future" and 50 milliseconds "past". For a complex sample (ie, a whole band playing) 10 Hz is usually enough bass for the human ear, and the smear is not so bad that you can't make out individual notes and chords. If you have a sample with simpler content and less bass (say, a voice speaking, or a single instrument playing) go right ahead and use less bins!

Here are some examples of how different window sizes sound on a complex file at 44.1 kHz:

And a simple file (one instrument — electric guitar):

There is one type of audio stretching/compression which is vastly superior — the McAulay/Qatieri algorithm: FFT is used to obtain spectral snapshots like in the phase vocoder, but the frequency bin data is then processed (but not the imaginary ("phase") component, which is simply thrown away) — first with parabolic interpolation to accurately find freqency peaks that possibly lie between bins. (Amongst other things, this means the effects of smear are reduced) Then corresponding peak information from the previous snapshot is considered heuristically with threshold and hysteresis criteria, to discard the weakest peaks. The remaining peaks are tied together frame-by-frame in "tracks" with very accurate frequency. These can be compressed and expanded in any fashion. Basically, MQ is asymmetric — the resynthesis can not be done with IFFT. Even at unity timescale the actual waveform output will be very different from the input. But the important aspect of it is that it will _sound_ the same! MQ is what the Mac program "Lemur" uses. There is no Windoze executable, here or anywhere else. Please let me know if you make one or bump into one.. Update! Mikko Haapanen tipped me off about more modern versions of pvoc-related software. He gave me a few links, such as one about new formats (has better pvoc executables) and the Bath University sound research homepage. From the latter link I found SNDAN, which incorporates MQ analysis and synthesis programs! I'm sure the newer versions are more efficiently coded — no old FORTRAN code or reading one sample per function call..

The flags

I've gotten a few questions regarding the meanings of the command line flags. Most of them are esoteric or relate more directly to the inner workings of the phase vocoder. Many of them I haven't even tried, some of them don't work, some of them are just buggy. To see a short list of the flags, type

	pvoc -h

R = input sample rate (automatically read from infile): You don't need to muck about with this, in this version the sample rate is always read from the wav file. The resulting file is set to the same rate.
F = fundamental frequency (R/256) DONT'T USE -F AND -N: This is basically just another way to specify the number of bins in a sample-rate-independent way. For instance, try -F10 — it should instruct pvoc to see frequencies down to 10-ish Hz with a 100-ish msec smear.
N = # of bandpass filters (256 unless -F is specified): Described at length above. "Bandpass filters" has the same meaning as "bins".
W = filter overlap factor: 0,1,(2),3 DON'T USE -W AND -M: This can be fine-tuned to reduce or increase smear audibility.
M = analysis window length (N-1 unless -W is specified): This is sort of a more counter-intuitive version of the -W flag above.
L = synthesis window length (M): Can be reduced in certain cases.
D = decimation factor (min((M/(8*T)),(M/8)): This determines the "internal sampling rate" of the spectral data. You shouldn't have to muck with this either, the default value seems to be OK. See Dolson's article for details..
I = interpolation factor (=T*D): Relates to -D above and ideally shouldn't have to be mucked with either.
T = time-scale factor (1.): Time stretch factor.
P = pitch-scale factor (1.) DON'T USE -T AND -P: Pitch shift factor.
C = resynthesize odd (1) or even (2) channels only: To do only left or right channel.
i = resynthesize bandpass filters i thru j only: -i indicates the first bin to synthesize. See -j below..
j = resynthesize bandpass filters i thru j only: -j indicates the last bin to synthesize. For instance, -N16384 -i8192 -j12285 only resynthesizes bins 8192 to 12285. The result is a band pass type effect on the output.
b = starting sample (0): Use to only process parts of a file.
e = final sample (end of input): Use to only process parts of a file.
w = warp factor for spectral envelope (1.): This has to do with pitch-shifting while attempting to preserve the general shape of the spectrum, to make it sound more natural. Used with -E option, below, which doesn't work. See Dolson's article for details on this.
A: analysis only: output will be analysis data: I haven't tested this, but I'm sure it's broken as the win32 program reads and writes 16-bit linear Windows wave files exclusively.. This option normally wants IEEE floats on the input. I haven't had time to check how the truncation to 16 bit integers affects the analysis data... but I suspect it screws it up badly.
E: analysis only: output will be spectral envelope: Same as above. Normally wants to write IEEE floats. It's an interesting idea, so one day I'll fix this..
S: synthesis only: input must be analysis data: Same as above. Normally wants to read IEEE floats.
K: use Kaiser filter instead of hamming: This crashes the program. Will look into one of these days..
V filename: verbose (summarize on pvoc.sta or file) if filename is specified, it must follow all flags: Never tried this flag either, believe it or not.

Even more details

I have scanned Dolson's 1986 article from Computer Music Journal here. It is a bit longer (14 pages), but an excellent tutorial for musicians in somewhat easy to understand layman's terms! Check it out! Also I later happened upon an earlier version (it seems a little less detailed) of this document in the documentation for the some CARL Next software.. it was in troff format, which I converted to html in case it's helpful to someone..

Source, etc

I didn't write this myself, rather, ported it from the classic pvoc sources which I found at ftp://ftp.bath.ac.uk/pub/jpff/mdpvoc/. I also serve up the hacked source (which compiles under MSVC 4.0, at least) here.

Page updated Oct 27, 2001 at 06:31 • Email: jens@panix.com

All content copyright © Jens Johansson 2024. No unathorized duplication, copying, mirroring, archival, or redistribution/retransmission allowed! Any offensively categorical statements passed off as facts herein should only be construed as my very opinionated opinions.