I always find it hard to put into words what’s uniquely special about the phonautogram of “Au Clair de la Lune” recorded by Édouard-Léon Scott de Martinville on April 9, 1860.
On one hand, it stands out for the central role it’s played in the recent history of the playback of the world’s oldest recorded sounds. In March 2008, it was the first of Scott’s voice phonautograms to seize the public ear—the first compelling audible proof of his status as the inventor of sound recording. And one of the main reasons it could be made so compelling was that it had a pilot tone we were able to exploit. Scott rotated the drum of his phonautograph by hand while recording the vibrations of a membrane as a wiggly line on soot-blackened paper, and all his recordings, including this one, display severe speed irregularities as a result. If we play them “as is,” mapping a particular length of trace to a particular span of time, the content generally comes out unrecognizable, its melodies and pitch contours garbled by constant accelerations and decelerations. Fortunately, Scott often recorded the vibrations of a tuning fork alongside the vibrations of the voice, as he did for “Au Clair de la Lune” on April 9, 1860, and in such cases the tuning-fork trace gives us a pilot tone we can use to correct speed fluctuations in the accompanying voice track. (For convenience, I’ll refer to the traces themselves as “pilot tones,” even though this feels a little anachronistic given that they become “tones” only in playback.) Of course, correcting for speed fluctuations isn’t the same thing as restoring the correct absolute speed. At first, my colleagues in the First Sounds initiative and I thought the recorded voice we were hearing belonged to a young girl, but in the spring of 2009, once we’d had a chance to tinker around with more of Scott’s work, we corrected the record: it was actually a deep male voice singing to us across the ages, almost certainly that of Scott himself. Our error had lain in our interpretation of Scott’s reference to a tuning-fork frequency of “500 simple vibrations per second,” which corresponds to 250 Hz, and not to 500 Hz as we’d initially thought. Scott had been confused on this front too—according to his own notes, “500 simple vibrations per second” ought to have meant 125 Hz—but presumably he just copied the number he found marked on the tuning fork by its maker. It may actually be fortunate that we were misled into unveiling “Au Clair de la Lune” played back at twice the correct speed, since it sounds a lot more charming that way. If we’d played it correctly the first time around, it might not have grabbed listeners the way it did.
Today, the “Au Clair de la Lune” phonautogram of April 9, 1860—Scott #36 in my discography—remains the single most iconic of all Scott’s recordings. But what continued significance does it have, over and above that of Scott’s other phonautograms, in terms of nineteenth-century accomplishments rather than twenty-first-century ones? It wasn’t the oldest Scott recording we knew about even back in March 2008; by then, we already had scans of a flat-plate phonautogram of isolated vocal sounds from 1857 and a record made in 1859 of tuning-fork vibrations picked up via membrane. However, it was the oldest vocal recording we’d found that had tuning-fork time code, which meant it was the oldest recording of the human voice with a time base we knew how to stabilize. Many more Scott phonautograms have come to light since then, but nothing has yet unseated Scott #36. We now have lots of nice, long vocal recordings from 1857, but none of them have companion tuning-fork traces, which only entered the picture through Scott’s collaboration with instrument-maker Rudolph Koenig in 1859. We’ve also turned up more vocal recordings from 1860 with pilot tones, but we don’t possess any other complete recordings made in this way on or before April 9th. True, there are a few fragmentary snippets of Scott voice phonautograms with pilot tones attached to a backing sheet dated March 28, 1860, in the Regnault Papers (Scott #34), but none of them is more than a split second in duration. Ultimately, then, Scott #36 is not:
- The oldest known recording of airborne sound, or the oldest known recording of the voice (Scott #1, 1853 or 1854)
- The oldest known voice recording lasting more than a split second (Scott #4, July 1857)
- The oldest known recording of airborne sounds that can be speed-corrected with reference to a pilot tone (Scott #28, record of sound of a 222 Hz tuning fork picked up from the air by a membrane, traced alongside a record made directly by another tuning fork, in or before January 1858)
- The oldest known recording of the human voice that can be speed-corrected with reference to a pilot tone (Scott #34, split-second fragments mounted on paper dated March 28, 1860)
Instead, it’s the oldest known recording of the human voice lasting more than a split second that can be speed-corrected with reference to a pilot tone. That’s quite a mouthful, and not the sort of thing that makes for—say—a good sound bite in a radio interview. For that reason, we usually like to characterize it in some other way; for example, the FirstSounds website frames it as “the earliest clearly recognizable record of the human voice yet recovered.” But its significance really boils down to three points: it’s a record of something interesting to listen to, unlike the drone of a tuning fork or a split-second voice blip; it’s clearly recognizable in playback; and its time base has been objectively and definitively restored. It’s the oldest sound recording in the world about which we can say all of those things.
That may not be true for much longer—except perhaps for the definitively part, depending on your standards of definitiveness.
Many of Scott’s phonautograms from 1857 document subject matter that would be just as interesting to hear, judging from their inscriptions. Some of these can also be transduced into raw audio just about as well as phonautograms from 1860. However, speed fluctuations render them unrecognizable taken “as is”: you might sometimes think you hear something in one of them, but you can never be sure this isn’t just the audio equivalent of recognizing something in an inkblot. And this time there’s no pilot tone to come to the rescue.
That’s not to say we can’t go ahead and adjust the speed of one of these recordings anyway to turn it into something audibly coherent. Back in 2009, for example, I noticed that Scott #29—a phonautogram of the “timbre of the cornet” from late 1857—consists of successive groups of eight blats with pauses in between. I thought the repeated eight-note pattern was likely to have been a scale—there’s even a published reference to Scott exhibiting a phonautogram of a cornet scale (“toute une gamme de cornet à piston“) about that same time—so I speed-adjusted one of the groups to create an ascending major scale. The result is clearly recognizable as a scale, but that’s because I intentionally shaped it into one based on an educated guess about the content. More recently, David Giovannoni came up with an alternative interpretation of the same cornet recording, hypothesizing that all the notes but the final one were originally of about the same length and adjusting their pitches accordingly. The result is a nice little melody. Both interpretations are presented on track one of Edouard-Léon Scott de Martinville, Inventor of Sound Recording: A Bicentennial Tribute. They’re both plausible, but neither is definitive, and at least one of them must be wrong. Except for records of tuning forks, any Scott recordings from before 1860 that have thus far been made audibly intelligible have involved guesswork of this kind.
But what if we could do away with the guesswork—or at least most of it?
In this blog post, I’d like to introduce a new set of algorithms we can use to try to improve the speed stability of any phonautogram without reference to an original pilot tone. These algorithms aren’t as infallible as locking to a tuning-fork track, but at the same time they don’t involve any subjective speculation about the content of specific phonautograms (except perhaps for one optional step that’s only suitable for music). Instead, they aim to exploit a set of consistent, transparent, and defensible assumptions about the format itself. These assumptions will occasionally be wrong, but I believe they’ll usually be right. By the same token, I’m sure the algorithms won’t always nudge garbled phonautograms far enough in the right direction to make their content audibly recognizable, but I believe they often will. And if we can achieve speed adjustment in the face of uncertainty with accuracy much above 50%, I think it’s well worth the effort.
Instead of diving straight into source material from 1857, I’m going to be experimenting here on some voice tracks from 1860 that have tuning-fork pilot tones and have already been speed-corrected based on them. My idea is to compare the output of my algorithms with definitively correct results obtained from the same phonautograms. This is a cruelly unforgiving beast of a test. But if my algorithms don’t work on these later recordings, where we already know what they should sound like, then there would be little point in applying them to phonautograms from 1857—and no grounds for giving credence to the results.
When I say this post is about “Speed-Correcting Phonautograms Without Pilot Tones,” then, what I mean is that I’ll be attempting to carry out speed-correction without making use of original pilot tones, and not that the phonautograms in question don’t have any. In some cases, I’ll also be generating artificial pilot tones to use in lieu of “real” ones. But hey, there’s only so much nuance you can cram into the title of a blog post.
My first guinea pig will be Scott #44, a phonautogram of “Au Clair de la Lune” recorded on April 20, 1860. Here, for reference, is the uncorrected original:
And here’s the same recording with its speed definitively corrected based on the tuning-fork pilot tone:
Below are two frequency curves generated from Scott #44. The blue curve was calculated from the voice track, while the red curve was calculated from the tuning-fork track. The change in distance between the curves over time represents the melody.
The shapes of the two curves are nearly identical, which is typical of Scott phonautograms more broadly and reveals that most of the pitch fluctuation is due to speed variance. For that reason, simply adjusting a voice track to a fixed reference pitch will actually go a long way towards restoring its original time base in purely statistical terms. I’ll refer here to the practice of resampling a source by the ratio of its original frequency curve to a target frequency in order to remove changes in pitch as “pitch-leveling” to distinguish it from “auto-tuning,” which aims to preserve the original duration and is typically used for other purposes. Unfortunately, pitch-leveling also flattens out any “real” pitch changes that may be present. Here’s Scott #44 pitch-leveled to its median frequency using a pitch curve generated from it at relatively low precision.
Such extreme pitch-leveling can be more useful in practice than you might think. The work David Giovannoni and I have both done on Scott #29—that cornet recording I mentioned from late 1857—took pitch-leveling as its starting point, and the same approach can make vowel sounds and other timbral features in heavily distorted recordings easier for the ear to grasp.
But we can also opt to reduce pitch variance without eliminating it entirely.
On one hand, we can pitch-level to a chosen target frequency with an intensity less than 100%. That is, whatever the resampling ratio would ordinarily be at any given point, we can reduce its difference from 1:1 (no change) by any desired amount. This will reduce all pitch variance indiscriminately, but if most pitch variance is due to speed variance, the result should still be a net positive. An intensity of around 50 or 60% seems to work best, while 75% starts to level out the “good” pitch information, throwing out too much baby with the bathwater. The following example was adjusted at 50% intensity.
On the other hand, we can vary our target frequency to reflect changes in the source recording over time instead of adjusting to a single constant frequency throughout. For example, we can generate a variable target curve based on a moving mean average of detected frequency values across some number of samples—on the order of magnitude of a whole rotation—and adjust to that. My hypothesis here is that changes in average pitch on the scale of entire rotations will tend to reflect changes in “real” pitch more than they do longer-term changes in speed. If that’s true, then the challenge is to choose an averaging window that’s large enough to factor out bigger frequency jumps but small enough to preserve the most rapid “real” pitch changes. This won’t always be possible to do, but there’s a fair chance of success if each individual note is held across most of a rotation. Here’s Scott #44 pitch-leveled to a moving average of the detected frequency curve across the length of one rotation (47607 samples), at 100%.
That sounds awfully artificial, but we can combine these last two techniques and pitch-level to a variable target at a lower intensity than 100%. Here’s a repetition of the last experiment with pitch-leveling intensity set to 75%, which isn’t quite so objectionable.
Another technique I’d like to throw in at this point is only appropriate for musical phonautograms and shouldn’t be applied to phonautograms of speech. My hypothesis in this case is that the individual held notes of a melody are likely to be separated from each other by greater or lesser drops in amplitude. If that’s so, then we ought to be able to improve the pitch stability of these notes by leveling out adjacent target frequency values at a rate proportional to local peak amplitude. The method I came up with for implementing this idea involves running through a range of threshold amplitude values from low to high (50% from median towards maximum and minimum), averaging target frequency values across all segments with peak amplitudes continuously above the threshold, and feeding the result back into the variable target in a gradually increasing proportion (starting at 1%, and raising this 0.4% per iteration). Here’s a variant on the last two experiments with this new strategy incorporated at a pitch-leveling intensity of 90%.
One pair of notes comes out conspicuously wrong here: instead of F-F-F-G-A-G-F–A-G-G-F, we get something more like F-F-F-G-A-G-E–G♭-G-G-F. But that’s still an 82% success rate, and by now I think it’s safe to say that “Au Clair de la Lune” has become easily recognizable, even though we’ve made no use whatsoever of the tuning fork trace for speed adjustment.
Meanwhile, the two frequency curves shown above display another typical and potentially important feature: periodicity. The peaks, representing the slowest point in the cranking cycle, are pretty evenly spaced. The troughs, representing the fastest point in the cycle, are somewhat less regular, but the sharper they are, the more reliably they appear a consistent distance before the peaks. I’ve produced animations of the rotational patterns corresponding to actual tuning-fork traces in an effort to visualize them more effectively, and I believe their characteristic features can be chalked up to the physiology of cranking, with its muscular flexions and extensions. When I pretend that I’m using a handle to rotate the drum of a phonautograph, I find that I generally speed up on the downstroke and slow down at the end of the upstroke, and that the acceleration is more regular and more pronounced than the deceleration. In the animation shown below, based on Scott #44, can you guess which direction was originally “up”? I think I can.
In short, there’s some definite regularity to the irregularity, and I’ve long hoped to find some way of exploiting it as a source of information for speed-correcting phonautograms without tuning-fork traces.
The first tactic I tried was a big disappointment.
My plan was to average the changes in pitch for all the rotations of a phonautogram at each sample point within the rotational cycle and then to use a curve constructed from those averages for speed correction in lieu of a “real” pilot tone. Around ten months ago, I wrote a program called Cyclify for this purpose. As I described here, it turned out to work decently for speed-correcting off-center discs and out-of-round phonograph cylinders, so I knew that it was successfully doing what I’d designed it to do. But it didn’t work with phonautograms.
Sure, I could come up with a plausible average speed-variance curve for a phonautogram, like the one shown above for Scott #44 (note that the green curve represents the average change in pitch, and not the average pitch). But whenever I tested my results against known speed variance curves—by “cyclifying” the voice track on a phonautogram that had a corresponding tuning-fork track available for comparison—they were never close enough to the actual rotation-by-rotation speed changes to make much of a difference. The peaks and troughs had different amplitudes and shapes and turned up in slightly different locations, with seemingly random wiggles in between. There was just too much variation within the variation for the approach I had in mind to do any good.
But it’s since dawned on me that we can understand the “cyclicurve,” as I’ll call it, in a different way: namely, as an indicator of the probability that the pitch at any given point within the rotational cycle is skewed due to periodic speed variance, irrespective of the form the skewing takes during any particular rotation. Wherever the cyclicurve frequency is highest or lowest, the corresponding segment of the detected frequency curve is most likely to embody a periodic speed change. Conversely, wherever the cyclicurve approaches its median value, the corresponding segment of the detected frequency curve is least likely to reflect a speed change; if the pitch does in fact change there, it’s more likely to reflect a “real” pitch change which we’d want to preserve. There will be exceptions to the rule, to be sure, but I believe things ought to work as I’ve described most of the time.
So how can we put this information about periodic patterns to practical use?
First, we can consider the cyclicurve when selecting ranges of frequencies for averaging into the variable target, excluding any frequencies from our calculation which correspond to cyclicurve values that differ by more than a certain amount from the median. Thus, in the illustration below, we might calculate a moving average from just the points on the blue line (a raw frequency curve) that correspond to points on the red line (a cyclicurve) that fall within the pink band. The setting I’ve devised for factoring periodic peaks out of the variable target runs from 0% (no values excluded) via 50% (values excluded corresponding to cyclicurve values more than 50% of maximum distant from median) to 100% (all values excluded except those corresponding to the cyclicurve median itself).
Another strategy we can try is variable-intensity pitch-leveling: adjusting more or less aggressively to a detected frequency curve based on the corresponding cyclicurve values. The goal here is to pitch-level more wherever there’s a lot of change across rotations, and less where there’s not as much change across rotations. My resampling algorithm ordinarily determines how much to change the duration at each point in time based on the ratio of a source to a target, so the closer these values approach one another at any given moment, the less change there will be. Thus, to vary the intensity of pitch-leveling, I first generate a variable target as described above, and then I generate an artificial pilot tone that’s an average of the detected frequency curve—weighted proportionally to the difference from the median of the corresponding cyclicurve, or to its nth root—and the variable target. In this way, we lock down on the local average wherever the likelihood of distortion is greatest and loosen up wherever the likelihood of distortion is least.
Here’s Scott #44 once again, with the same adjustments made as in our last audio example, but now also with:
- presumed outlier values excluded at a setting of 80% from the averages comprising the variable frequency target, and the averaging window size increased from 47607 to 55000 to help compensate for the reduction in data points; and
- variable intensity pitch-leveling based on the cube root of the absolute value of the difference between the cyclicurve and its median.
The result is admittedly wobblier than that of Experiment #5, but the wobble largely affects those passages which that other experiment got wrong, including the anomalous E and G♭, which now at least bend towards and away from their correct pitches. Once again, the audio is easily recognizable as “Au Clair de la Lune,” and we still haven’t made any use of the tuning-fork trace; as far as our methods have been concerned, it might just as well not have existed. The only assumptions we’ve made up to this point are:
- Changes in average pitch on the scale of entire rotations will tend to reflect “real” pitch changes.
- Most short-term pitch fluctuation will instead reflect speed variance, but it’s more likely to do so the more closely its position coincides with a peak of periodic change within the rotational cycle.
- In a musical recording, segments of continuous higher amplitude are likely to correspond to held notes of relatively constant pitch.
If we didn’t have the tuning-fork-corrected version of Scott #44 available to confirm this “interpretation,” would you consider it definitive? Or would you argue that there was still room here for reasonable doubt about it?
I played around a bit with settings while carrying out the above experiments on Scott #44, but not enough to think I necessarily hit upon optimal ones. That said, the big question at this point is whether these same techniques and settings will work equally well on other phonautograms, or whether Scott #44 just happens to be a special case. With that question in mind, I’ve processed three more phonautograms from 1860 in exactly the same ways, using precisely the same settings (except for some slight variation in length of rotation, since this isn’t identical in the sources). Each of the next three audio files contains four different segments: (1) an unaltered original voice track; (2) the same track adjusted as in Experiment #5 above; (3) the same track adjusted as in Experiment #6 above; (4) the same track corrected based on its accompanying tuning-fork pilot tone.
Scott #47: “Et Incarnatus Est,” recorded September 1, 1860
This phonautogram yields results remarkably similar to the ones we obtained from Scott #44, so we know that our earlier results weren’t a mere fluke. The first result comes out with nicely stable pitches, and the majority of them are correct, with a couple conspicuous exceptions. The penultimate note, which should drop down to the tonic, instead falls only to the third, and the note before that is flat as well. The second result is less stable, but the wobble veers towards the correct note values that came out wrongly in the first result.
Scott #49: “Vole Petite Abeille,” recorded September 15, 1860
The chanson de l’abeille stands out for the rapidity and breadth of its changes in pitch, both of which make it more challenging than usual for my algorithm to distinguish “good” pitch changes from “bad” ones. The basic contour of the tune still comes out more or less right, but most of its brilliant detail gets leveled out, somewhat more in the first result than in the second. For what it’s worth, I’ve found that reducing the intensity of pitch-leveling from 90% to, say, 50% preserves more of the detail of the melody at the expense of pitch stability.
Scott #36, “Au Clair de la Lune,” recorded April 9, 1860
These results are the weakest yet. The tune isn’t nearly as recognizable here as it was with Scott #44, although we might still have been able to guess its identity from the opening sequence of five notes in the first result. Pitches tend to go up and down in correct places; the biggest problem is that they often don’t move when they should. My algorithm for stabilizing musical notes has also introduced at least one conspicuous pitch transition where there shouldn’t be one. Skipping that optional step eliminates those artifacts, but they’re not my greatest worry here. I’ve experimented with various settings, but nothing I’ve tried so far has delivered a really close resemblance to “Au Clair de la Lune.” It would be an interesting exercise to try to figure out which of my various assumptions this recording fails to satisfy.
Judging from the four phonautograms I’ve experimented with so far, it looks as though the algorithms I’ve been describing will speed-adjust some phonautograms remarkably well but won’t deliver equally good results for all phonautograms. It also seems that cases of failure are likely to be due more to “good” pitch changes getting lost than to “bad” pitch changes slipping through.
To summarize, the method I’ve introduced here for speed-adjusting phonautograms without original pilot tones is this:
- Detect a peak frequency curve a.
- Calculate a second curve b from the average change in a across one rotational cycle, and loop it.
- Calculate a third curve c that is the moving average of a across approximately one rotational cycle (or, optionally, the moving average of only those values of a corresponding to values of b that differ less than a designated amount from the median value of b). For musical recordings, optionally re-average the resulting values with groupings and intensities proportional to amplitude to stabilize the pitches of discrete notes.
- Calculate a fourth curve d that mixes a and c, varying the proportion of a relative to c based on the distance of the corresponding values of b from the median value of b.
- Use curve d in lieu of a “real” pilot tone for speed adjustment with curve b as the target.
Put more succinctly, it entails leveling out pitch fluctuations proportionally to their closeness to points of periodic variance, their divergence from local averages, and their location within continuous segments of higher amplitude.
The results from Scott #44 and Scott #47 prove that we can sometimes use this method to restore the time base of Scott’s phonautograms of the human voice accurately enough that they become clearly recognizable regardless of whether or not they have tuning-fork pilot tones. The presence of an original pilot tone therefore no longer seems quite as essential for valid speed correction as it did before.
At the same time, the results from Scott #49 and Scott #36 show that this new method won’t always work—or at least that it won’t always work very well. But Scott recorded over thirty phonautograms in 1857, and chances seem good that at least one of them will yield up its secrets to this new tool. Stay tuned.