Speech Averaging: “A Visit From St. Nicholas”

Consider the performance art of reciting poetry.  We recognize that it can be done well or poorly, that everybody will do it a little differently, and that no two performances will ever be quite alike.  But what if we were to average a number of recitations?  Would the result approach an ideal “performance” that sounds more impressively proficient in the conventions of the genre—however we might feel about those conventions—than any of its component sources?  Or would it only strip away the essence of artistic interpretation, yielding something that seems coldly robotic and inhuman?

miss-america-winners-1921-to-2017-median

Average of portraits of every Miss America winner from 1921 through 2017

Analogous questions have long since been explored in the realm of visual attractiveness.  Averaging a number of faces, for example, has been shown—it’s claimed—to produce composites that are more attractive than any of the individual sources.  This so-called “averageness effect” is typically explained in evolutionary terms, the idea being that averaging reinforces the most widely shared traits of a population while factoring out anomalies that might correspond to harmful mutations.  There have also been some attempts to extend this principle to the aural attractiveness of human voices, which could invite a similar evolutionary interpretation.  But poetry recitation, as a species of verbal art, draws us more clearly into questions of culture—specifically, the cultural expectations we have of a “good” recitation, not that we all necessarily agree about what those are.  Is a “good” recitation good because its formal patterning matches some culturally established prototype (in which case an average could be even “better”) or because it expresses some unique sensibility on the part of the reciter (in which case an average should fall flat)?  Are the “best” recitations the most typical or the most individualistic?  In short, does the attractiveness of recitations work in the same way that the attractiveness of faces is alleged to work?

Let’s see if we can find out.  I’ve published a number of posts here at Griffonage-Dot-Com about image averaging (see this one, for example), but lately I’ve been giving more thought to analogous strategies for averaging sounds, and this is the first of several posts in that direction I have in mind.  The assignment I set myself was this: Take a number of recordings of different people reciting “‘Twas the Night Before Christmas (A Visit From St. Nicholas)” and create a defensible “average” from them that sounds like a single person speaking.  (I’d thought about holding this post back until “A Visit From St. Nicholas” would be more seasonally appropriate, but I decided I’d rather get it out there sooner rather than later, so just think of it as a little celebration of Christmas in July.)

Of course, recitations of the “same” poem can differ starkly from one another, and in many different ways simultaneously (pitch, rhythm, intensity, pronunciation, even text).  Let’s take just one couplet by way of demonstration:

The moon on the breast of the new-fallen snow,
Gave [a/the] lustre of midday to objects below….

Here’s a compilation of readings of those lines by a number of different people.


And here’s an average of the prosody and pronunciation from all those varied readings, contrived using a technique I’ll describe below.


The voice quality admittedly sounds artificial and distorted, but as an average of the performances—that is, not the voices themselves, but what people were doing with their voices—I think this works pretty well.

I’ve scoured the Internet looking for previous attempts to average extended speech performances in this spirit, but so far I haven’t found any (which, of course, doesn’t necessarily mean they don’t exist).  It’s true that spectrographic averaging has been applied to the study of animal vocalizations since the 1970s, but as far as I can tell this approach has been limited to visual analysis, with no “playback” of the averages as sound.  The same is true of ProsodyPro, which can apparently generate time-normalized averaged pitch contours of human speech.  The promising-sounding “speech chorus method” for the acoustic analysis of speech, introduced in 1970, turns out upon further investigation to involve a technique where “several persons speak different texts simultaneously and the resulting sound is recorded on a single magnetic tape.”  One study entitled “Vocal Attractiveness Increases by Averaging” involved generating averaged single syllables through successive morphing in quantities of 2, 4, 8, 16, and 32 sources (using STRAIGHT), but it relied on a labor-intensive process of manually morphing pairs of spectrograms that would be prohibitively time-consuming for something on the scale of a whole poem.  Another study  (“The Role of Femininity and Averageness of Voice Pitch in Aesthetic Judgments of Women’s Voices”) used audio manipulation, though not actual audio averaging, to test hypotheses about “averageness” in voices.  However, I haven’t found any other audio out there that purports to be an average of multiple recitations of the same literary text, or even any discussion of how this might be done.  If I’m reinventing the wheel, the “wheel” in this case would at least seem to be a fairly obscure bit of technology.

So let me explain what I did.

I began by downloading a set of just nine public-domain recordings of different recitations of “A Visit From St. Nicholas” from Librivox.  One of them features two different voices—a child’s and an adult’s—but each of the other eight was recorded in a single adult voice, so I took those eight as my first experimental data set for averaging.  To give credit where credit is due, the eight speakers are Annie Coleman Rothenburg, Betsie Bush, Chris Goringe, Kara Shallenberg, Peter Yearsley, Mark Bradford, Sam Lipten, and Sean Randall.

To prepare the recordings for analysis and processing, I trimmed away the spoken introductions and conclusions, leaving nothing but the recitations of the poem itself, and converted the mp3s into mono 44.1 kHz 16 bit WAVs, normalized to 90% (not the best way to match levels, but better than nothing).  I then ran each of the eight WAVs through a piece of software called Gentle.  This was created by Robert M. Ochshorn and Max Hawkins to locate beginning and end times for each word in a speech recording with a precision of one hundredth of a second, associated either with an auto-generated transcript or with a user-supplied text.  Gentle offers a few different export options for its results, but I went with a CSV.  This contains one row per word and four columns: (1) the target word; (2) the matched word, if any, or <unk> if Gentle found what it thought was a word not in its reference database; (3) the detected beginning time; and (4) the detected end time.  Gentle is designed to operate on entire words rather than on syllables, but I was able to trick it into generating syllable-level measurements by feeding it the following transcript (which also includes substitutes for a few words that didn’t seem to be in Gentle’s vocabulary, such as “’twas”):

was the night before chris mess when all through the house not a greet sure was stir ring not even a mouse
the stock kings were hung by the chin knee with care in hopes that saint nick a less soon would be there
the chill drawn were nest old all snug in their beds while fish yawns of shook or plums danced in their heads
and mom maw in her curt chief and I in my cap had just set old our brains for a long win tars nap
when out on the lawn there a rose such a clad her I sprang from my bed to see what was the mat her
a way to the win doe I flew like a flash tore oh pen the shut hers and threw up the sash
the moon on the breast of the new fall an snow gave a lust or of mid day to hop checks be low
when what to my won the ring eyes did a peer but a men at your sleigh and eight tie knee rein deer
with a lit ill old dry fur so live lee and quick I knew in a mow meant he must be saint nick
more rap hid than he gulls his core sirs they came and he west old and shout head and called them by name
now dash her now dance her now prance her and vex son on caw mitt on cue pit on don her and blitz son
to the top of the porch to the top of the wall now dash a way dash way dash a way all
as dry leaves that be fore the wild her a cane fly when they meet with an hop stuck ill mount to the sky
so up to the house top the core sirs they flew with the sleigh full of toys and saint nick a less too
and then in a twin cling I heard on the roof the prance sing and paw wing of each lit ill hoof
as I drew in my head and was turn hung a round down the chin knee saint nick a less came with a bound
he was dressed all in fur from his head to his foot and his clothes were all tar nest with ash his and soot
a bun dull of toys he had flung on his back and he looked like a bed lure just oh pen hung his pack
his eyes how they twink old his dim pulls how mare he his cheeks were like row says his nose like a chair he
his droll lit ill mouth was drawn up like a bow and the beard on his chin was as white as the snow
the stump of a pipe he held tight in his teeth and the smoke it in sir culled his head like a wreath
he had a broad face and a lit ill round bell he that shook when he laughed like a bowl full of jell he
he was chub he and plump a right jaw lee old elf and I laughed when I saw him in spite of my self
a wink of his eye and a twist of his head soon gave me to know I had no thing to dread
he spoke not a word but went straight to his work and filled all the stock kings then turned with a jerk
and lay hung his fin gear a side of his nose and give hung a nod up the chin knee he rose
he sprang to his sleigh to his team gave a west ill and a way they all flew like the down of a this ill
but I heard him its claim ere he drove out of sight hap he chris mess to all and to all a good night

Gentle did a pretty good job, but it occasionally failed to detect any times at all for particular “words,” in which case it left the last three columns blank.

I wrote some code in MATLAB to cover the next few steps (speechaligner.m).  First came the task of averaging the timings across all eight recordings.  Through trial and error, I found I needed to interpolate approximate times for any empty cells so that the fact that Gentle hadn’t been able to detect a syllable in any particular recording wouldn’t unduly skew the average time calculated for that syllable.  Then I took the mean averages of all the beginning and end points for all syllables in the poem.

The next step was to adjust the timings of each recording to match the group averages so that all the recordings would line up consistently with each other in the time domain.  This step is analogous to warping a group images of faces so that the eyes, noses, mouths, and so forth all line up with each other in their average locations across the whole data set.  First, I discarded the times I’d interpolated for empty cells while calculating the average timings, opting instead to treat any group of one or more “zeros” as a block to be adjusted all at once.  Then I borrowed Dan Ellis’s handy phase vocoder to adjust duration without affecting pitch, but I set things up so that if this ran into an error—as it sometimes does, depending on how manageable the time ratio turns out to be—my code would pass through the source audio unaltered.  The durations of the time-adjusted words weren’t always correct down to the sample level, resulting in slight audible glitches in the output file.  To mitigate this effect, I applied linear interpolation to a twenty-sample segment across each join, which is admittedly a rather crude way of handling things, but hey, what I’m after here is a proof of concept, not perfection.

Finally it was time to average the time-adjusted recordings.  This step is equivalent to averaging pixel values across a set of warped source images of faces.  If we average the WAV sample values according to the same logic, it’s equivalent to playing the aligned recordings all together at the same time.  I arranged to stagger the output files randomly just enough so that the joins wouldn’t coincide, always keeping the offset within the hundredth-second range of uncertainty in the Gentle data.


This sounds something like a group of people reciting together in unison, although in very close unison—less like a congregation than like a choir guided by a director.  Yes, there are obviously a handful of glitches.

The formants and the noises associated with plosives, sibilants, etc. are relatively consistent from voice to voice, so they reinforce each other once the words have been aligned.  By contrast, the fundamental pitches of the voices and their associated overtone sequences vary more widely, both by speaker and by intonation.  When we combine just eight sources as in this case, they tend to remain individually distinguishable.

But what happens if we expand our source base?

By searching Librivox a little more proactively, I turned up forty-three different public-domain readings of “A Visit by St. Nicholas,” all listed below.  I chose to use thirty-nine of these for my next averaging experiment, skipping the four which I’ve marked with asterisks either because they featured multiple voices (#4) or because a reader departed somehow from the usual text.

  1. Annie Coleman Rothenburg
  2. Betsie Bush
  3. Chris Goringe
  4. Brad Bush and Grace Bush*
  5. Kara Shallenberg
  6. Peter Yearsley
  7. Mark Bradford
  8. Sam Lipten
  9. Sean Randall
  10. Rosemarie DeSapio
  11. mjbrichant
  12. Lynne T
  13. Verity Kendall*
  14. JemmaBlythe
  15. Fernashes
  16. Reynard T. Fox
  17. Sean Lynott*
  18. David Federman
  19. Alan Davis Drake
  20. Douglas D. Anderson
  21. Anne Cheng
  22. Abigail Bartels
  23. Anna Roberts
  24. Bill Mosley
  25. Caliban
  26. Clara Snyder
  27. icyjumbo
  28. Clarica
  29. David Lawrence
  30. Ernst Pattynama
  31. Elli
  32. Ezwa
  33. Lucy Perry
  34. Linda Lee Paquet
  35. John W. Michaels
  36. Mark F. Smith
  37. Neeru Iyer
  38. Ruth Golding*
  39. ravenotation
  40. Utek
  41. Sean McGaughey
  42. TriciaG
  43. Tasleem Khan

Here’s the result I got from time-normalizing and mixing thirty-nine recitations using the methods described above:


The fundamental pitches of the thirty-nine different voices have now smeared into a relatively unpitched mass of noise—something almost like a whisper, except that the spectrum is different.

It seems that the method I’ve described so far can effectively average phonemes and rhythms across a set of recordings of readings of the same text, and that—on that front—all that remains is for us to perfect our implementation.  I could see taking that in several directions.  On one hand, we could increase our source base from thirty-nine up to, say, a hundred recordings or more.  There are also some textual variants of “A Visit From St. Nicholas” that would be worth factoring in somehow; for example, about half of the Librivox readers say “the lustre of midday,” while the other half say “a lustre of midday.”

But what about pitch?  So far we haven’t really averaged that in any meaningful way; we’ve only blurred it.  Can we somehow take an audio mix of this kind and filter it such that it gives the impression of a single voice speaking with an averaged pitch contour?

Well, let’s consider some formal features of the voice.  Below are three sound spectrograms from a once-classified World War Two military publication, showing (A) normal speech, (B) a monotone, and (C) a whisper.

Our thirty-nine-recording mix resembles the whisper shown in (C), which differs from (A) and (B) mainly in its lack of horizontal bands corresponding to a fundamental frequency and its overtones.  Meanwhile, the monotone shown in (B) differs from (A) in turn mainly in its lack of a curve among the bands corresponding to a natural speech pitch contour.

So if we want to try to transform our “whispery” thirty-nine-recording mix (similar to C) into something that sounds more like natural speech (shown in A), we might identify an appropriate fundamental frequency for each point in time and then filter out everything but that frequency and its harmonics, or integer multiples—for example, at one particular moment we might pass the series 150 Hz (F0), 300 Hz (2×F0), 450 Hz (3×F0), 600 Hz (4×F0), 750 Hz (5×F0), etc.  And we can calculate an average fundamental frequency across all source recordings as a curve for filtering in this way.

To get my filtering curve, I opened each time-normalized WAV file in Praat, generated a pitch object with default settings, and then ran a simple script modified from the one posted here. This creates a TXT file in which each line represents one hundredth of a second and consists either of a frequency presented in the form “196.63,” (if Praat identified a value for the fundamental frequency) or “–undefined–,” (if it didn’t).  I manually numbered each out.txt after it was generated, in this case from 1-39.  The process was a bit tedious, but I haven’t looked into options for batch processing here yet.

Then I wrote another MATLAB script (praaterage.m) to take the median average of the fundamental frequencies reported by Praat (calculated using the base-two logarithms, not the value in Hertz).  The same script also generates a “certainty” vector which reports the quantity of source files from which a frequency was detected in each time window, divided by the maximum to yield a range of 0-1.  Using another script (applyavgcontour.m), I throw out any median fundamental frequency values that correspond to certainty values below 0.5, applying a linear interpolation across the gaps, and then smooth the result by taking a moving mean average with a width corresponding to one tenth of a second.  Then I run a 15-Hz-wide second-order Butterworth bandpass filter on each hundredth of a second, passing the neatened version of the averaged fundamental curve and fifty of its harmonics, cross-fading the first and last thirds of each segment.  The raw result of that process sounds like this:


Then a couple final cosmetic steps are in order: I apply some noise reduction to mitigate the “tuning” of background noise and add a little echo to help mask the blip-and-bloop artifacts of that filtering.


The artificiality of the voice may admittedly be distracting, but try to listen through it to the underlying prosody and pronunciation.  Does this come off as an impressively proficient performance (albeit perhaps of the maligned conventions of “poet voice”), or does it sound objectionably cold, passionless, inhuman?

You be the judge.

In closing, I’ll touch briefly on a few related ideas I haven’t yet tried out but would like to, eventually, time permitting.  First, I’d like to apply the same processing technique to prose, and the Speech Accent Archive at the Center for History and New Media at George Mason University looks like an ideal source with its much-recorded elicitation paragraph:

Please call Stella.  Ask her to bring these things with her from the store:  Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.  We also need a small plastic snake and a big toy frog for the kids.  She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.

The metadata collected by that project would also let us average recitations selectively by location, background, gender, age, and so forth.  I imagine the same technique could also be applied successfully to a capella singing—say, a hundred different renditions of “Happy Birthday to You.”  And another fun trick, in the spirit of what Marit MacArthur and Lee Miller call “deformance”, might be the creation of exaggerated versions of source recitations: first taking the timing differences between a source and the group average and increasing them by a particular factor, and then synthetically boosting the differences in pitch.  Other ideas?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.