Some kinds of audio restoration are on pretty firm ground these days: declicking and decrackling, for instance, or noise reduction. But speed variance correction—the removal of wow and flutter—remains an extraordinarily daunting challenge. It has a reputation for being the high-hanging fruit on the audio restoration tree, akin to rocket science or brain surgery. The only practical solutions put forward thus far have been limited to particular categories of musical content or to media with ultrasonic pilot signals, and they require costly software or specially configured playback equipment.
I’ve recently come up with an alternative method of my own that works surprisingly well. This method, which I’ve implemented in MATLAB and GNU Octave, can remove wow from off-center discs, out-of-round cylinders, and other media with periodic or cyclical speed fluctuations. It doesn’t require a pilot tone (ultrasonic or otherwise), and it can accommodate a wide range of content, including vocals with vibrato. And it’s already quite functional, although I still want to fine-tune some aspects of the process and the user interface.
So how does it work?
Let’s begin by considering a specific practical problem faced by the First Sounds initiative. Whenever we first convert one of the phonautograms recorded by Édouard-Léon Scott de Martinville from an image into a playable sound file, it comes out marred by extreme speed fluctuations. That’s because the cylinder of the phonautograph on which it was recorded was rotated irregularly by hand rather than by a well-governed motor. Fortunately, the phonautograms Scott recorded in 1860—including his iconic April 9, 1860 recording of “Au Clair de la Lune”—display the vibrations of a 250 Hz tuning fork recorded alongside the trace of the voice. By readjusting that trace to a consistent 250 Hz, and the voice trace along with it, we can restore the recording as a whole to its original consistent speed.
But there’s no off-the-shelf commercial software capable of speed-correcting a phonautogram automatically based on its tuning fork trace. Celemony’s Capstan comes closest, but it will only “autotune” pitches to the closest fifty cents, and Scott’s tuning fork traces vary more than that. For my book Pictures of Sound, I speed-corrected the tuning fork traces by hand, selecting and adjusting five cycles at a time, but there are several problems with that approach. First, it’s maddeningly time-consuming. Second, resampling tends to create a tiny click at each segment join; phonautograms are noisy enough that this isn’t very noticeable, but if you try this on a sine wave, the problem becomes glaringly evident. Third, the manual approach doesn’t actually do all that great a job of stabilizing the tuning fork pitch, which invariably remains a bit wobbly. David Giovannoni has applied Capstan to phonautograms that had already been speed-corrected by hand to within fifty cents of the target frequency, substantially improving the results, although all those tiny clicks remain. And even then, Capstan’s algorithm for detecting and fixing wow and flutter—which centers on inferring desirable pitch ratios from a complex analysis of musical content—strikes me as overkill when it comes to the comparatively simple task of locking to a pilot tone. The same could be said of Plangent Processes, a system that’s impressive more for the precision with which it captures ultrasonic pilot tones such as tape bias than for the computations it carries out on those tones afterwards—and the associated Clarity software isn’t available for public use anyway.
The method I use assumes the pilot tone is consistently the strongest frequency at each point throughout the data being analyzed. To help ensure that it is, the source WAV is first filtered to a particular frequency range—I’ve set the default to 100-1000 Hz to exclude typical rumble and high-frequency noise, but this is adjustable. I then use a Fast Fourier Transform (FFT) to determine the peak frequency for a window of given size centered on each sample (or, if I’m in a hurry, every nth sample) and save the results as a frequency curve. Next, I “expand” each sample by a given ratio—typically 100:1—multiplied by the ratio of the corresponding value in the frequency curve to a target frequency, plotting a spline curve in between the resulting data points. Finally, I resample the result by the inverse of the expansion ratio—e.g., 1:100—to get my speed-corrected WAV.
I’ve already shared some of my results in a previous post, but in case you missed them, here’s the uncorrected version of the April 20, 1860 phonautogram of “Au Clair de la Lune”:
And the corrected version:
The technique I’ve just described isn’t necessarily limited to working with pilot tones, since it can be used to “autotune” any recording at all by readjusting whatever my algorithm detects as the peak frequency at each point in time to the same target frequency. I’ve already taken early records of vowel sounds with distracting speed fluctuations and locked them to a single pitch so that it becomes easier to discern distinctive timbres in them. And I’ve also provided for editing auto-detected curves that don’t initially come out as desired (e.g., if the algorithm gets temporarily distracted by a lower-frequency signal component or an upper harmonic).
But we don’t always have a pilot tone available to guide speed adjustment, which leads me to another development that builds on the previous one. Consider cyclical pitch fluctuations of which the form is identical each time or changes only very gradually, such as those associated with an off-center gramophone disc or an out-of-round cylinder. My strategy for handling situations of this kind is based on two hypotheses:
- The effects of speed fluctuations on frequency curves will be consistent by cycle—that is, across each rotation or revolution of the carrier—whereas “real” changes in pitch won’t show any such correlation. If we average changes in frequency by cycle, the speed fluctuations should reinforce each other while “real” changes in pitch cancel themselves out.
- If the frequency changes imposed by speed fluctuation are sufficiently great, averaging should expose them clearly even if the number of cycles is limited.
This line of reasoning differs fundamentally from the logic behind Capstan, which takes as its point of departure the hypothesis that musical notes will be steadily held, as well as from the logic of Plangent Processes, which centers on harnessing “hidden” ultrasonic pilot tones. To the best of my knowledge, those are the only two systems for mitigating wow that have been made commercially available at any price (and the prices aren’t low). There have also been a number of scholarly papers written about digital methods of wow mitigation, including work by Simon Godsill in the 1990s and more recent contributions by Andrzej Czyżewski; but as far as I’ve read, the strategies outlined in past literature never take advantage of cyclicality to infer and counteract the specific shape of the fluctuation. I’m not sure whether this means that nobody else has thought of trying a cyclical averaging approach, or only that nobody else has been rash enough to imagine it might work. It does require a certain leap of faith in the power of averaging.
The application I came up with is called cyclify. It starts by loading an auto-detected peak frequency curve and converting the frequency values expressed in Hertz into their base-2 logarithms so that a given speed change will always result in the same numerical difference regardless of what frequency it affects. Next, it smooths the frequency curve so that the value of each point becomes the mean average of (say) 1000 points and the difference between each pair of points comes to represent the larger-scale local pitch gradient more reliably. It then calculates the difference between every pair of values in the smoothed frequency curve, compares all points on the resulting curve that are exactly one cycle length apart, and takes the median values, constructing an averaged curve out of them one cycle in length. Finally, it re-integrates the differences, evens out the gradient so that the loop connects at the seam, adjusts the median value of the result to the median value of the initial logarithmic curve, reconverts from logarithms back into Hertz, and loops the result as needed to fill out the duration of the recording. Then we can use the looped average to adjust the speed of the source recording just as we’d use a curve auto-detected from a pilot tone.
So how well does cyclify work in practice? Let’s check out some results.
Here‘s a link to a transfer of a copy at UCSB of Edison Blue Amberol 1628, “Non e Ver” by R. Festyn Davies, which David Giovannoni shared with me as an example of an out-of-round cylinder with wow he hadn’t been able to tame using Capstan while working on the Archeophone Archives series. Capstan had been able to handle the instrumental passages, but not the voice with its pronounced vibrato. More specifically, David emailed me this clip:
The nominal speed for Blue Amberol cylinders is 160 rpm, and this one had been digitized at a sampling rate of 44.1 kHz. A phonograph cylinder played at exactly 160 rpm and flawlessly digitized a 44.1 kHz should have a cycle length of exactly 16537.5 samples ((44100*60)/160), which we can round to 16537 or 16538. Here’s the result I got with cyclify by processing the clip at a cycle length of 16538 samples:
David wrote back: “This would have saved our bacon!” But not so fast—applying the same approach to the whole cylinder didn’t work nearly as well. What gives? Well, so far we’ve been assuming that the speed fluctuation will be consistent in form across all rotations, but of course that won’t ordinarily be true of something like a misshapen cylinder: the form of fluctuation will tend instead to change gradually over the course of the recording. Fortunately, all we need to do in order to handle such cases is implement a moving averaging window. Thus, rotations 1-10 could be adjusted to the average of rotations 1-20; rotation 11 to the average of rotations 2-21; rotation 12 to the average of rotations 3-22, and so on. Here’s a link to the whole “Non e Ver” recording processed by cyclify with a cycle length of 16538 samples and a moving 100-cycle window, which you can compare to your heart’s content with the original transfer.
For another experiment, I set my turntable (a belt-drive KAB Transcriber II) to 45 rpm by stroboscope and digitized a 78 rpm disc I had intentionally positioned way off center. This was a Leeds and Catlin record of the mid-nineteen-oughts with an unbranded plain white label and two extra holes drilled in it near the spindle hole. I centered it on one of the two “extra” holes to introduce extreme wow—so bad, in fact, that I couldn’t even get the stylus to track the groove at 78 rpm, whence the choice of 45 rpm as the transfer speed.
A disc played at exactly 45 rpm and flawlessly digitized at 44.1 kHz should have a cycle length of exactly 58800 samples ((44100*60)/45). However, when I carried out some initial experiments on a brief excerpt taken from the middle of the recording, I found through trial and error that a setting of 59000 samples gave a smoother average curve, suggesting that my turntable had actually been running closer to 44.85 rpm. Later, I came up with an “autofocus” algorithm that iteratively tests different cycle lengths to find the one that yields the average frequency curve with the largest spread in its logarithmic values, and that enabled me to fine-tune things even further to a cycle length of 59059 (44.8027 rpm). Here’s the result I got at that cycle length with a 25-cycle moving window. It’s not bad, considering what I started out with—right up until the very end, when some conspicuous wow suddenly intrudes (as opposed to physical effects of extreme eccentricity on the stylus itself, which my technique can’t do anything about).
The rate of change in wow as we approach the center of the disc—and the end of the content, with nothing after it to counterbalance what comes before—is probably exceeding the capacity of our averaging window, but a contributing factor may also be speed drift. After all, typical belt-drive turntable speed drift could easily throw off a cyclical analysis by hundreds of samples per revolution due to changes in cycle length. Thus, my turntable could have been running close to 44.8 rpm towards the middle of the recording, but a bit faster or slower than that towards the very beginning and end. A moving average window will mitigate the effects of minor speed drift, much as it helps compensate for inaccuracy in cycle lengths more generally. But it can only do so much.
To fix problems of this kind, we need reliable data about our variations in cycle length, and we have multiple options for obtaining that. One, suggested by my beautiful and brilliant wife Ronda L. Sewald, would be to capture the data as part of the transfer process. Her idea was to inscribe a sine wave around the periphery of a turntable platter which an extra pickup could transduce for digitization as an extra track, but even a sensor that would generate a simple click once per rotation would suffice. The other alternative is to use my “autofocus” algorithm—or some other type of autocorrelation—to infer cycle lengths post-digitization, not for entire recordings but for successive overlapping segments of them. With this data in hand, we could resample each rotation to a consistent cycle length before analysis and then resample the averaged curve(s) to each original length afterwards during the looping phase.
I’ve developed a handy graphic display for cyclical analysis by superimposing the peak frequency curves for all rotations on the same axes with their color changing gradually from blue (at the start of the recording) to red (at the end of the recording), and optionally showing the average in light green. Here’s what the data for the off-center Leeds and Catlin disc looks like analyzed with a cycle length of 59059 samples, a smoothing factor of 1000, and the y axis set to the base-2 logarithm of the frequency expressed in Hertz:
In this case, the form of the speed fluctuations was fairly consistent throughout the recording, so its shape is easy to see in both the raw data and the averaged result. On the other hand, it’s also plain that the broad blue and red “bands” don’t quite line up with each other along the x axis, or with the light green average line either, which I read as evidence of gradual drift.
Next, here’s the data for the “Non e Ver” cylinder, analyzed with a cycle length of 16538 samples (and, again, a smoothing factor of 1000):
This time, the form of the speed fluctuations varies a great deal from beginning to end, so the average for the recording as a whole comes out as nearly a flat line. However, the warping of the cylinder surface can still be seen where individual red and blue lines have coalesced into broader patterns. In cases like this one, I’m not sure how far to trust “autofocus” results inferred from averages calculated for the recording as a whole. However, when I applied “autofocus” to the whole of “Non e Ver,” it came up with a cycle length of 16556 (159.8212 rpm) as opposed to the 16538 (159.9952 rpm) which I used for the speed adjustment shared above; and if we view the results graphically, the red and blue lines do appear to cluster together more tightly with that alternative setting.
Maybe my “autofocus” algorithm could even infer the speed of recordings without appreciable wow, based on cyclical frequency fluctuations that are imperceptible (or barely perceptible) but still statistically significant. I suppose it might also have applications in quality control for digitization work, in the sense that it shouldn’t be possible to infer the speed of a perfectly centered disc.
It should be a simple matter to create a command-line binary (EXE file) to enable batch speed-adjustment of files to a given set of specifications. This could then be applied to the output of large-scale digitization projects. Future projects might be able to forego the labor-intensive step of centering discs, boosting throughput and saving money; while past projects that weren’t meticulous about centering could have their results retroactively “fixed.” Of course, there would need to be some tests beforehand to assess whether the results of my method compare favorably to the results of pre-digitization centering (in terms of both speed stability and noise modulation). I think they might—fingers crossed—but we’d want some empirical evidence of that before building my method into any industrial workflow.
I haven’t yet tried any experiments with flutter. However, flutter is nothing but speed fluctuation at a faster rate than wow, so if it’s similarly patterned and periodic, my technique ought to be able to mitigate it as well.
I believe this method has considerable commercial potential, but it was actually an unintended byproduct of an academic research project that’s still ongoing. I didn’t set out with the goal of devising a wow-removal algorithm. Instead, I wanted to try to speed-correct some phonautograms recorded before 1859 that don’t have tuning-fork traces. From studying the tuning-fork traces on later phonautograms, we can see that there’s some regularity to the cranking patterns: a repeating rotational cycle in which speed changes will predictably occur during particular phases but not others. I’ve created some animations to display these patterns, as well as other types of visualization, such as this one:
The above images show the peak frequencies of the tuning fork traces for two phonautograms of “Au Clair de la Lune” superimposed by rotational cycle, using a linear Hertz scale, with lower values towards the center and higher values towards the periphery. There’s a readily apparent shape here, a cross between an egg and a teardrop; let’s call it an “eggdrop.” My plan has been to analyze earlier phonautograms without tuning fork traces in a similar fashion and to speed-adjust them based on the “eggdrop” pattern. In order to do this, I wanted to be able to generate average single-cycle curves as a point of reference, and that’s what motivated me to create cyclify. The results I shared earlier prove that we can use these average curves to fix speed fluctuations that are regular in form. But the speed fluctuations in phonautograms are irregular in form: they may follow predictable patterns, but the specifics vary from rotation to rotation. The tools I’ve described so far are necessary, but not sufficient, for doing what I want to do. Fortunately, I have some further strategies in mind—but that’s a subject for another time.