Phase Vocoder Technical Documentation
This page provides in-depth documentation for the enhanced phase vocoder implementation.
The Phase Problem
What is Phase?
In the frequency domain, each frequency component has two properties:
\[X[k] = |X[k]| \cdot e^{j\phi[k]}\]Magnitude $|X[k]|$
How loud (amplitude) — determines volume of each frequency
Phase $\phi[k]$
Where in the cycle (timing) — determines waveform shape
Why Phase Gets Broken
When we shift frequencies, we change bin assignments:
440 Hz
φ = 0.5π"] end subgraph After ["After +100 Hz"] B["Bin 109
540 Hz
φ = ???"] end A -->|shift| B style A fill:#3d2963,stroke:#5b4180,color:#f5f0e6 style B fill:#cd8b32,stroke:#b8860b,color:#fff
If we just copy the original phase, it’s wrong for the new frequency!
The Artifacts
Incorrect phase causes:
- Metallic/robotic sound — phase incoherence across bins
- Phasiness — loss of presence
- Smearing — transients lose sharpness
- Pre-echo — artifacts before transients
Why Phase Matters
Horizontal Coherence (Temporal)
Phase must evolve correctly across frames in time.
For a sinusoid at frequency $f$:
\[\phi_{\text{expected}} = \frac{2\pi f H}{f_s}\]If phase doesn’t advance correctly → clicks, discontinuities
Vertical Coherence (Spectral)
Phase relationships between frequency bins must be preserved.
For a harmonic sound (e.g., 440 Hz fundamental + 880 Hz harmonic):
\[\phi_{\text{harmonic}} \approx 2 \times \phi_{\text{fundamental}}\]If this relationship breaks → metallic sound
Our Enhanced Implementation
Based on Laroche & Dolson (1999): “Improved phase vocoder time-scale modification of audio”
1. Peak Detection
Spectral peaks represent important content:
- Harmonics of musical notes
- Formants in vocals
- Resonances in instruments
Algorithm: Find local maxima above threshold:
\[\text{peak}[k] = \begin{cases} 1 & \text{if } |X[k]| > |X[k-1]| \text{ and } |X[k]| > |X[k+1]| \text{ and } |X[k]|_{\text{dB}} > \text{threshold} \\ 0 & \text{otherwise} \end{cases}\]2. Identity Phase Locking
Laroche & Dolson’s Key Contribution
Bins near a peak belong to the same partial. Their phases should maintain their relationships.
For each peak at bin $p$ with region of influence $r$:
\[\phi_{\text{locked}}[k] = \phi[p] + (\phi_{\text{original}}[k] - \phi[p])\]for all $k \in [p-r, p+r]$
Why it works:
- Peak represents the “center” of a partial
- Nearby bins contribute to the same partial
- Locking phases maintains the partial’s shape
3. Instantaneous Frequency Estimation
Standard FFT assumes frequencies are exact multiples of $f_s / N$.
Real audio frequencies fall between bins:
440 Hz might fall in bin 40.8 (not integer!)
We compute the true frequency using phase deviation:
\[\phi_{\text{expected}}[k] = \frac{2\pi k H}{N}\] \[\Delta\phi = \phi_{\text{curr}}[k] - \phi_{\text{prev}}[k] - \phi_{\text{expected}}[k]\] \[f_{\text{inst}}[k] = \frac{k \cdot f_s}{N} + \frac{\text{wrap}(\Delta\phi) \cdot f_s}{2\pi H}\]4. Phase Synthesis
Generate coherent phases for the modified spectrum:
\[\phi_{\text{advance}} = \frac{2\pi \cdot f_{\text{shifted}} \cdot H}{f_s}\] \[\phi_{\text{synth}} = \phi_{\text{prev}} + \phi_{\text{advance}}\] \[\phi_{\text{synth}} = \text{wrap}(\phi_{\text{synth}}) \quad \text{to } [-\pi, \pi]\]Complete Processing Flow
Parameter Tuning
Peak Detection Threshold
Default: $\text{threshold} = -40 \text{ dB}$
| Value | Effect |
|---|---|
| Too high (-20 dB) | Misses weak harmonics, more artifacts |
| Too low (-60 dB) | Detects noise as peaks, over-locking |
| Recommended (-40 dB) | Good balance for most music |
Region of Influence
Default: $r = 4$ bins (±4 bins around each peak)
At $f_s = 44100$ Hz with $N = 4096$:
- Bin width: $\Delta f \approx 10.77$ Hz
- Region: $\pm 43$ Hz around each peak
| Value | Effect |
|---|---|
| Too small ($r=2$) | Incomplete phase locking |
| Too large ($r=8$) | Locks bins from different partials |
| Recommended ($r=4$) | Good for 4096 FFT |
Performance
Quality Improvement
Comparison vs. naive phase copying:
| Metric | Naive | Enhanced |
|---|---|---|
| Metallic Artifacts | High | Low (~80% reduction) |
| Transient Preservation | Poor | Good |
| Harmonic Clarity | Low | High |
| Pre-echo | Noticeable | Minimal |
Computational Cost
\(O(N) \text{ for peak detection}\) \(O(\text{peaks} \times r) \approx O(80) \text{ for phase locking}\)
Total overhead: ~15-20% — well worth it for the quality improvement!
Troubleshooting
- Reduce shift amount (keep under ±250 Hz)
- Increase FFT size (4096 → 8192)
- Reduce hop size (1024 → 512)
Cause: Over-aggressive phase locking
- Reduce region_size (4 → 2)
- Raise threshold (-40 → -30 dB)
Material-Specific Issues
| Material | Issue | Solution |
|---|---|---|
| Vocals | Usually works great | Default settings |
| Percussion | May smear transients | Smaller FFT (2048) |
| Bass | Coarse quantization | Larger FFT (8192) |
| Noise | May sound worse | Mix with dry signal |
Research References
-
Laroche, J., & Dolson, M. (1999) “Improved phase vocoder time-scale modification of audio” IEEE Transactions on Speech and Audio Processing, 7(3), 323-332
-
Flanagan, J. L., & Golden, R. M. (1966) “Phase vocoder” Bell System Technical Journal, 45(9), 1493-1509
-
Dolson, M. (1986) “The phase vocoder: A tutorial” Computer Music Journal, 10(4), 14-27
-
Průša, Z., & Holighaus, N. (2022) “Phase Vocoder Done Right” arXiv preprint
| Back to Home | Algorithm Details |