New Software for Playing Pictures of Sound Waves

After years of trying to get existing software to do things it was never intended to do, I’ve finally written some code of my own for converting pictures of sound waves into playable audio.  It doesn’t have the friendliest of interfaces, but it’s relatively easy to get up and running, and it’s free (as in free beer, not free puppy), so the price is right.

Let’s say you have a digitized image of an audio waveform—a graph of the amplitude of sound vibrations as a function of time.  Maybe it’s a phonautogram from the 1850s or 1860s: a record of sound traced in soot on a moving paper sheet for visual analysis at a time when playback wasn’t yet on the table.  Or maybe it’s an ink print on paper made from a gramophone disc a few decades later (converted, in this case, from a spiral into parallel lines by a polar-to-rectangular-coordinates transform).  Or maybe it’s just a random squiggly line you want to treat as an audio waveform to find out what happens.  Who am I to judge?

sample-traceThe question is how to convert that image into a playable audio file so we can listen to it.

sample-trace-2I’ve previously described a circuitous method of doing that here and here, as well as in my book Pictures of Sound.  That method works, but a point raised by one reader is well taken:

It seems you used an extraordinary lengthy and convoluted process to generate barely audible sound. Wouldn’t writing some software to do this be far easier and produce greater fidelity?

The Technique

In the past, I’ve relied on Andrew Jaremko’s ImageToSound, a piece of freeware designed to convert any 24-bit BMP into an 8-bit WAV file as though it were an image of an optical film sound track.  In optical film sound tracks, the modulation of the audio signal is tied to how much light passes through a translucent strip that varies either in opacity (“variable density,” below left) or width (“variable area,” below right, with thanks to Iainf for the illustrations).

film-soundtrack-formats2ImageToSound converts the sums of pixel luminance in each successive column in a source image file into successive audio samples in a target WAV file.  At first glance, this might not seem applicable to playing sources such as phonautograms.  But eight years ago, I realized that I could convert a wavy line into a bright band of varying width by filling the area above or below it with white using an ordinary Photoshop paintbucket tool, and that ImageToSound could then be used to convert it into audio.  More recently, it occurred to me that we could convert that result into a band of varying brightness in turn by reducing it to one pixel in height and then (optionally) re-expanding it—not particularly useful, I guess, but ImageToSound can handle it in that form just as well.

wavy-line-conversionThis procedure has its drawbacks.  It can require a lot of time-consuming manual editing to clean up the area to either side of the trace and connect any breaks in it so that the paintbucket tool won’t “spill through” to the other side.  Meanwhile, the ImageToSound software has some unfortunate quirks and limitations and is becoming increasingly challenging to run—I can still execute it successfully on my current Windows 10 laptop in Windows 98 compatibility mode, but others have told me they can’t get it to work.  Unfortunately, there haven’t been any convenient alternatives.  AEO-Light is designed to convert digital images of optical film sound tracks into audio on roughly the same principle as ImageToSound, but it’s set up to assume there will be accompanying motion picture frames as a point of reference.  Nothing else of which I’m aware along the same lines is readily available for public use (PRISM, the software attached to IRENE, is not).  There are programs aplenty out there for converting pictures into audio, but virtually all of them treat source images spectrographically, as graphs of time against frequency—Photosounder is a prime example—rather than oscillographically, as graphs of time against amplitude.

However, I’ve since learned about some numerical computing environments and programming languages that can flexibly input and output both image and audio files while carrying out elaborate computations on them in between, and I decided to see if I could harness one of these as a substitute for ImageToSound.  Out of the available options—including MATLAB—I ended up going with GNU Octave, which is free and open-source.  It’s also convenient: I downloaded and ran the latest official Windows installer and had things up and running within moments.  Here’s what you see when you start the Windows GUI:

octave-screenshotYou can now enter commands at the cursor in the Command Window.  And here’s a sequence of them that accomplishes the same thing I’ve been using ImageToSound to do, presuming the source image is grayscale with a horizontal time axis:


The first line imports the image file as a matrix F of numbers representing pixel intensities.  Think of this as a grid in which columns represent pixel columns and rows represent pixel rows, and in which we can call up the value of a particular cell, column, or row and carry out whatever calculations on it we like.  The next line converts all of the numbers in the matrix to double precision format to remove any constraints imposed on them as pixel data, such as restriction to integers (which I’ve found can carry over into subsequent calculations if we don’t lift them).  The third line creates a row vector S—which is what a matrix is called if it has only one row—from the sums of pixel intensities for each column.  The fourth line rescales the result to the -1 to +1 range of values required by the WAV format, assigning the letter U to the rescaled vector; and the fifth line writes vector U to a 44.1k 24-bit WAV file with the specified name.

Depending on what you choose for a source image, you might get a message like this:

warning: your version of GraphicsMagick limits images to 16 bits per pixel

Octave will process your image anyway, but apparently it won’t take advantage of any available resolution beyond sixteen bits per pixel.  To exploit higher resolutions, it’s possible to rebuild GraphicsMagick with a quantum depth of 32 bits—see the discussion here—but I haven’t yet tried this myself.  Sixteen bits per pixel really isn’t too shabby.

For my purposes, what I’ve just described is already an improvement on ImageToSound, which insists on a 24-bit BMP source image, can only output an 8-bit WAV file, and reduces DC offset in some mysteriously unspecified way.  Now we can process pretty much any common image format, output a file at whatever bitrate we like, and handle DC offset according to some transparent procedure of our own choosing.  So if you’re looking for a way to do what I’ve described using ImageToSound to do in the past, the five lines of code shown above will let you do it more flexibly and more transparently in Octave.

But now that we’ve bit the bullet and started devising image-to-audio algorithms of our own, there’s no reason for us to stop there.  I’ve converted wavy lines into bright bands of varying width for converstion to WAV in the past mainly because that was the only kind of input I could find ready-made software to handle.  However, I’ve now come up with a reasonably simple way for us to sidestep that graphics editing step and convert wavy lines directly into audio using Octave.  Below are a few lines of code that will translate a grayscale image of a light trace on a dark background with a horizontal time axis into a WAV file.


The first three lines are the same as before.  But the fourth line creates a matrix Q of individual pixel intensities (in matrix F) divided by total column intensity (in vector S), while the fifth line cleans up any cases where this would entail division by zero by substituting a zero for each “NaN” (“Not a Number”) value.  Lines six and seven then create a column vector Z of descending row numbers; line eight calculates a matrix Y by multiplying the fractional pixel intensities in matrix Q by the row numbers in vector Z; and line nine creates a row vector W from the sums of each column in matrix Y, which we proceed to rescale to the range -1 to +1 as row vector V and write to a WAV file as before.

To illustrate what this sequence of steps accomplishes, let’s consider how it would handle an exceedingly simple case: a row of five columns (presuming a vertical time axis) in which 0% of total pixel intensity is in Column 1, 20% in Column 2, 40% in Column 3, 30% in Column 4, and 10% in Column 5.

column-number-calculation-grid1When we multiply the column numbers by the fractions of total pixel intensity, the results are 0, 0.4, 1.2, 1.2, and 0.5; and if we sum those we get 3.3.  Relative to the positions of the five columns (marked above with red lines), 3.3 (marked above with a green line) represents the center of total pixel intensity (represented above in yellow); thus, there should be an equal area of yellow on either side of the green line.  Or, if we want to visualize things relative to some actual pixel intensity values:

tinysample3For want of greater precision in our data, we’re inferring that the source trace would have looked something vaguely like what I’ve shown below before pixelation (the light orange marks the positions of the five scan lines, while the green line once again marks the center):

hypothetical-trace-structureWhen we apply this same principle to an image of the wavy line of a phonautogram, or something similar, it will return the center location of quantified brightness at each point along the time axis, which should ideally correspond to the center of the trace (although in practice visual background noise and imperfections or irregularities within the trace itself can complicate things).

We shouldn’t necessarily evaluate numerical pixel intensity values in the linear fashion I’ve described so far.  I’ve found through trial and error that results can often be dramatically improved by raising the original pixel intensity values to a certain power before carrying out the calculations.  Here’s how we would raise them to the power of ten, which can yield good results:


[and so on]

The optimal power value will vary from case to case, but it looks like there will be a sweet spot each time where noise is most fully attenuated, and beyond which raising the power further will only add noise.  Here’s an animation that shows some typical effects of raising the power of source pixel intensities incrementally by values up to 100:

picky-increasing-powers1And here’s the source image for comparison, excerpted from one of Eli Whitney Blake, Jr.’s 1878 photographic sound recordings of the phrase “Brown University” (with height and width adjusted here to match the scale of the waveform view above):picky-increasing-powers-sourceLarger flecks of visual background noise sometimes produce deflections in the WAV output which could be got rid of by manually cleaning up the source image beforehand.  But the single most consistent and conspicuous glitch in the audio actually comes from an azimuth problem where the trace itself runs slightly backwards in time:

picky-azimuth-glitchWhich is to say that the method itself appears to be working as intended here: it is identifying the center of brightness in each pixel column.  There are just some problems with this particular source trace which we haven’t tried to correct, and which my earlier technique using ImageToSound wouldn’t have handled any better.  With a clean, well-formed trace this new approach should output solid results; and when we’re not so fortunate in our source material, it will still be able to handle anything we can throw at it in a logically consistent way, no matter how messed up, just like my earlier technique.

I’ve been focusing here on the crucial step of converting image files into sound files, but Octave can be used to do other kinds of related audio processing as well.  We can read in a WAV file just as easily as we can an image file:




The first variant imports the WAV file as a column vector A (or if it’s a multi-channel file, as a matrix A with channels assigned to different columns and successive samples assigned to successive rows—for a stereo file, the first column is the right channel and the second column is the left channel).  The second variant does the same thing while also setting samplerate equal to the sample rate and bitdepth equal to the bit depth.  The third variant imports a WAV file and then exports a duplicate of it, as I’ve confirmed through experiment.  Then, once we’ve read in a WAV file, we can process it in ways that may be less convenient, or even impossible, using ordinary audio editing software.  For example, to convert displacement data into velocity data—as I discussed doing in a more roundabout way here a few months ago—the command for taking the derivative of A is simply:


To go in the opposite direction, from velocity to displacement—which I was never able to figure out how to accomplish sample-by-sample using standard audio editing software—we can do this:

for i=2:(length(D)+1);

I’m not really digressing, by the way, since these are both steps we’ll sometimes want to carry out on the raw results we get from converting a picture of a sound wave into audio.

Picture Kymophone

It’s possible to convert images into audio using Octave by entering commands one by one at the command prompt, as I’ve been describing.  But it’s tedious and inefficient to key these sequences in by hand over and over again, so I started piecing together a reusable “function” to take care of them for me automatically.  Over time, as I’ve added in more features and customizable options, this has grown into several hundred lines of code which I call Picture Kymophone 1.0.  You can download it in a zip file here.  (It’s furnished with no warranty of any kind: use at your own risk.)

To run Picture Kymophone it in its simplest form, place picky.m in a directory somewhere agreeable to you, navigate to it (e.g., in the Octave GUI File Browse window), and then enter picky at the Octave command prompt.  You’ll then be asked to select an image file to process.  Keep in mind that the function assumes the smaller dimension represents amplitude, the larger dimension represents time, the time scale runs from left to right or top to bottom, and the trace is light against a dark background (if it’s not, invert it before processing).  Also, if the image file contains multiple channels, such as RGB, they’ll all be summed indiscriminately to grayscale.  Anyhow, once you’ve selected your source image, it will get processed and you’ll receive a notification like this:

sourcefile.tif processed mode_0_w_all ID=1478341446

There will now be four WAV files in the same directory as picky.m, with filenames of this form:


“1478341446” is simply a time reference that indicates when the WAVs were generated—handy for keeping track of the results of successive experiments—while “sourcefile” is the name of the source image truncated at the first period.  The final element in the filename shows what the samples in the file represent:

  • pdp = position displacement (raw positional values)
  • pvl = position velocity (derivative/rate of change of positional values)
  • idp = intensity displacement (raw intensity values)
  • ivl = intensity velocity (derivative/rate of change of intensity values)

The idp file is equivalent to the output of ImageToSound.

That’s how Picture Kymophone runs by default, but you can also customize a number of variables at the command prompt, like this:

>> picky w 0 1 d 1 1 0.95 all 44100 24 c:/sourcefile.tif

I wish I could offer a nice little graphic user interface with radio buttons and so forth instead, but for the moment, this is what I’ve got.  A handy one-page pdf listing of commands is available here.  The eleven arguments need to appear in the exact order shown, and they specify:

  1. Type of output requested, default w (for WAV).
  2. Processing mode, default 0 (independent processing of a single source image).
  3. Power to which source pixel intensity values will be raised before calculations are performed (1=linear, 2=values squared, 3=values cubed, etc.), default 1 (linear).
  4. Numerical format setting to use for calculations, default d (standard processing with double precision).
  5. Adjustment of slope and DC offset, default 1 (no adjustment).
  6. Decimal amplitude threshold for sample-to-sample impulse rejection (e.g., a setting of 0.1 would detect and remove any values greater than 10% of total amplitude range from the derivative velocity files, and displacement files would then be reintegrated from the results); default 1 (no impulse rejection).
  7. Decimal value (e.g., 0.1=10%) to which amplitude values will be normalized for WAV output (alternatively, a value greater than 1 will force clipping); default 0.95 (normalization to 95%).
  8. The type(s) of output WAV to be generated (all, pdp, pvl, idp, ivl, pos = position, vel = velocity, int = intensity, dis = displacement); to suppress all WAV file output, enter any other group of three characters.  Default is all.
  9. Sample rate of output WAV file, default 44100.
  10. Bit depth of output WAV file, default 24.
  11. File path of source image (otherwise user is prompted to navigate to file(s)).

If you enter values for only some of the variables, the function will assign default values to all subsequent variables.  So, for example:

>> picky w 0 10

would cause the function to raise pixel intensity values to the power of ten before carrying out its calculations while reverting to defaults for all other parameters.  Or to limit WAV output to the “pdp” file while changing nothing else, you’d enter:

>> picky w 0 1 d 1 1 0.95 pdp

The numerical format setting (argument #4) has implications for how much memory will be required, and hence whether there will be enough of it to do whatever you’re trying to do.  The default is lower-case d for double precision format.  However, if you’re trying to process an image with an unusually large amplitude dimension, Octave may give you an out-of-memory error because you’ve exceeded the default 2 Gb limit for array size.  In that case, you can try to reduce the amount of memory being used by selecting lower-case s for single precision format.  The downside of that approach is that the single precision format can’t accommodate as wide a range of values as double precision format can.  Thus, if you’re simultaneously raising pixel intensity values to a relatively high power such as ten, you’re likely to exceed the maximum for a single-precision value, and Octave seems to handle this situation by returning a matrix or vector consisting of nothing but zeroes, which means the output WAV will be a silent “flat line.”  I’ve set Picture Kymophone up to issue an alert if this happens:

>> picky w 0 10 s
sourcefile.tif processed mode_0_w_all ID=1478958141
Output value range for (sourcefile.tif) is zero; recommend retry with different numerical format setting

If you don’t want to downsample your source image, one option would be to build Octave to allow for larger arrays than 2 Gb, but this looks like a lot of work and I haven’t tried it.  As an alternative, I created two alternative numerical format settings you can invoke with upper-case D or S.  Instead of running calculations on a whole array at once, which is where the memory problems often arise, these options loop through each of the columns separately in turn.  The trade-off is that the processing takes a bit longer.  But if a memory error occurs when Octave is initially reading the image file, then you’re out of luck and may just need to scale down your image somehow.

Now let’s consider the processing mode (argument #2).  Sometimes we simply want to convert a single image file into audio, but sometimes we want to tackle a whole group of images at once (such as separate rotations of a phonautogram or revolutions of a disc after a polar-to-rectangular-coordinates transform), and we may also want to process them in a mutually consistent way.  With that in mind, I’ve provided for a few different processing options:

  • 0 = Images are processed independently from one another, and auto-generated range values aren’t saved.
  • 1 = Images are processed independently from one another, but auto-detected range values are saved for future reference.  This creates four tiny working files in the same directory with picky.m, called ndif.mat, sdif.mat, tdif.mat, and wdif.mat.  These store the maximum group range values for the ivl, idp, pvl, and pdp transductions respectively.
  • 2 = Images are processed independently from one another, and auto-detected range values overwrite previously saved ones if and when they exceed them.
  • 3 = Images are processed using saved range values instead of auto-detected ones, with headroom added at top and bottom as needed.

So let’s say we have five separate image files representing successive parts of a given recording.  First we process the images once without generating any WAVs, the first in mode 1 and the remaining four in mode 2; and then we generate the output WAV files while processing all five images a second time in mode 3, applying the data we collected about the cumulative maximum amplitude range during the first pass.  This would be cumbersome to do manually, but I’ve created some additional processing modes to support different kinds of batch processing:

  • 50 = Processes a group of source images independently from one another.
  • 51 = Processes a group of source images at mutually consistent amplitude levels according to the method described above (using processing modes 1+2 for a “test” pass and then 3 for a second pass).
  • 52 = Processes a group of source images using previously saved amplitude level ranges.

Here are the options for output type (argument #1):

  • w = Outputs a separate mono WAV file for each processed image.
  • v = Outputs a separate vector for each processed image (pdp = vector W stored in w.mat; idp = vector S stored in s.mat; pvl = vector T stored in t.mat; ivl = vector N stored in n.mat)—intended to support further analysis and processing, and used for intermediate steps during batch processing.
  • x = Outputs a single concatenated mono WAV file from all processed images, with a filename of the form Batch_1478341446_pdp.wav, plus a mono “click” file documenting the join points (Batch_1478341446_clk.wav).
  • b (not yet tested) = Outputs a single concatenated vector for each processed image, plus a “click” vector for reference (vector clickfile stored in clk.mat)—intended to support further analysis and processing.  Filename takes a form like wconc.mat, sconc.mat, etc., storing vector Wconc, Sconc, etc.
  • s = Outputs a single concatenated stereo WAV file from all processed images with the results in the left channel and clicks in the right channel documenting the join points.  Filename takes the form Batch_1478341446_pdpst.wav.
  • m (not yet tested) = Outputs a matrix in which each successive row corresponds to a successively processed source image and each successive column corresponds to a successive sample (pdp = matrix Wmx stored in wmatrix.mat; idp = matrix Smx stored in smatrix.mat; pvl = matrix Tmx stored in tmatrix.mat; ivl = matrix Nmx stored in nmatrix.mat)—intended to support further analysis and processing.
  • h = Outputs a list of available command-line arguments; a, o, t, p then output more details about specific arguments.

Concatenated WAV files and vectors are padded with 3000 samples of silence at the beginning and end.  The reference “click file” marks each concatenated segment with a value of -1 for the last sample from one source image and +1 for the first sample from the next one, like this:

clickfile-interpretationFinally, let’s consider the options for adjusting slope and DC offset (argument #5).  The default value is 1, which means no adjustments will be made, but the other values associated with specific adjustments are all prime numbers, and to combine multiple options we just multiply their numbers (which then get sorted back out through prime factorization).  Some of the available adjustments operate on the single-image level.  Option 2 applies a 20 Hz second order Butterworth high pass filter to a single image result, while option 3 sets the beginning and end points from a single image equal to zero.  To do both at once, we enter the product of 2 times 3 = 6; this will first align the endpoints and then apply the high pass filter to them.

Other adjustments instead operate on results at the level of whole batches.  Option 5 is the batch result equivalent of option 2, and option is the batch result equivalent of option 3.  On the other hand, option 11 (“joining”) is unique to batch results: the first sample from each new image starting with #2 is set equal to the last sample from the preceding image.

join-adjustmentBelow is an actual example that shows the “joining” feature in action—this is a sequence of rotations from a paper print of a gramophone disc that has undergone a polar-to-rectangular-coordinates transform which left a significant curvature in each rotation.

picky-join-exampleThe waveform on top shows the rotations concatenated without adjustment, while the waveform below it shows the adjusted results, with a “click track” below that to show where segments were stitched together.  The exported amplitude range of the meaningful signal is minuscule here in relation to the rotational curvature, which isn’t so good for our levels even at 24 or 32 bit resolution, but we could improve things by implementing a 20 Hz filter prior to export, which is precisely the scenario I built that functionality in to address.  The downward clicks represent breaks in the original trace and can easily be removed with standard impulse noise attenuation tools.

Sample Results

I’m still experimenting myself with how to get the best results out of Picture Kymophone, but I’d be remiss not to offer at least a few examples of sounds I’ve used it to process.  A few months ago, I blogged about a record made in 1878 by Étienne-Jules Marey of the electrical discharge of a torpedo fish, which I’d converted into audio using my old technique:


To see how Picture Kymophone would handle it, I joined the three numbered segments by hand into a single long, narrow strip, erased just the numerals and the dozen or so biggest flecks of visual noise, and processed the edited result using this command:

picky w 0 10 d 1 1 0.95 pdp 25770

This instructs Picture Kymophone to generate:

  • a single WAV file (w)
  • from a single image processed independently (0)
  • with pixel intensity values raised to the tenth power (10)
  • using the double precision number format, since the image file is small enough not to pose any memory problems (d)
  • with no DC offset adjustment (1)
  • and with no impulse rejection (1)
  • normalized to 95% (0.95)
  • returning samples that represent positional displacement (pdp)
  • at a nonstandard sample rate of 25,770 samples per second, based on a scale in the source image equating a line 2,577 pixels long to 1/10 second (25770)
  • at the default bit depth of 24 bits

Next, I spent another fifteen minutes cleaning up the image a bit more meticulously in Photoshop and then processed the results again using the same command.  Here’s a sound file with (1) the audio from the first pass, (2) the audio from the second pass after manual cleaning, and then (3) the audio from the second pass after a standard impulse noise removal tool has been applied to it:

Cleaning up images in graphics editing software before processing them still improves our results—I haven’t rendered that step obsolete, much as I’d like to.  However, getting to hear something out of an image takes a lot less work now than it did before with my earlier technique.  That makes it much more feasible to test-play a waveform picture and get a sense for what it will sound like before investing serious time in it.

Next, here’s a source that didn’t lend itself well to my earlier technique: a recording Dayton Miller made with his phonodeik on May 18, 1909, of himself saying “Lord Rayleigh,” reproduced as a halftone illustration in his book The Science of Musical Sounds (published 1922, on page 238).

dayton-miller-lord-rayleighThe crests and troughs are clearly marked here, but there’s not much of a connecting trace between them.  Playing this recording using my old technique would require joining every crest and trough by hand in Photoshop to create a clear boundary capable of containing the action of the paintbucket tool.  But Picture Kymophone doesn’t need a boundary to find the center of brightness in each pixel column.  So I was able to join the two images up—there’s a slight overlap between them—and then process them into audio in a couple seconds (marks representing hundredths of a second are spaced at approximately 157.2 pixel intervals, hence the sample rate):

picky w 0 20 d 1 1 0.95 pos 15720

Here’s the result in a sound file with (1) samples based on displacement and (2) samples based on velocity.

Next, let’s try some batch processing.  With my new tool in hand, I returned to the preparation work I’d already done a few years ago on a paper print of a gramophone disc in a scrapbook in the Emile Berliner papers at the Library of Congress, the “Schalldruck” of November 11, 1889:


I’d previously done some basic cleanup on the trace while it was still a spiral, converted the spiral into parallel lines using a polar-to-rectangular coordinates transform, and then isolated each of the 124 rotations into a separate image file.  The last time around, using ImageToSound, I’d also needed to spend a good deal of time searching the trace for breaks and joining them as well as joining all the separate rotations together by hand.  But this time I just put the 124 separate images into a folder as TIFs and executed this command:

picky x 51 10 d 330 0.008 0.95 pdp 26460

That command instructs Picture Kymophone to create:

  • a single concatenated mono WAV file, plus a separate “click” file for reference (x)
  • from a batch of images processed with a consistent auto-detected amplitude range (51)
  • with pixel intensity values raised to the tenth power (10)
  • using the double precision number format, since the image files are small enough not to pose any memory problems (d)
  • with a 20 Hz high pass filter applied at both the single-image (2) and batch (5) levels, and with single-image endpoints adjusted to zero beforehand (3) and joined at the batch level afterwards (11) in case they’ve gotten off in the meantime (2*3*5*11=330)
  • ignoring any jump between samples greater than 0.8% of the total amplitude range, a value settled upon through trial and error (0.008)
  • normalized to 95% (0.95)
  • returning samples that represent positional displacement (pdp)
  • at a nonstandard sample rate of 26,460 samples per second, since each revolution occupies 30,000 pixels and our target speed is 50 rpm (26460)
  • at the default bit depth of 24 bits

Here’s the result:

Processing time on my laptop was approximately 26 minutes.

Finally, I reprocessed some cleaned-up images I already had on hand for the famous April 9, 1860 phonautogram of “Au Clair de la Lune,” using this command (note the capital D for accommodating unusually large image files):

picky s 51 10 D 2

I haven’t speed-corrected the results because I’m optimistic I’ll be able to implement a speed correction algorithm in Octave as well, and I’d rather devote any time I have free for such things to that than to correcting individual files by hand as I have in the past.  The following sound files therefore display extreme irregularities in recording speed.  They also combine a record of the voice in the left channel with a record traced simultaneously by a tuning-fork in the right channel.  With those caveats, here’s a displacement transduction (which has become the traditional default for phonautograms):

And here’s a velocity transduction (roughly what we’d get if we could somehow trace the phonautogram physically with a stylus attached to a modern electromagnetic pickup):

Want to try out the software yourself but don’t have any suitable pictures handy to experiment with?  You’re welcome to practice on the sample image below, which is a 1200 dpi scan of the vowel “o” as recorded by William Preece and Augustus Stroh sometime before February 1879, inverted from dark-on-light to light-on-dark but otherwise unaltered.  (See the source in context here.)

preece-stroh-oIf you decide to give Picture Kymophone a whirl, I’d appreciate a report of your experiences and results, whether they’re positive or negative.  And if anyone out there can figure out a way to build a more convenient user interface, that would be great too.














21 thoughts on “New Software for Playing Pictures of Sound Waves

  1. Pingback: How to “play back” a picture of a sound wave | Griffonage-Dot-Com

  2. Hey,
    both your blogs about this topic are tremendously interesting! Thank you so much for sharing your knowledge. It must have taken many, many hours to gain it!
    Right now I am downloading Octave to play around with the “sample o” sound. Yesterday I had an idea and maybe I can realize it with your technique: Imagine a tattoo on my skin of sound waves photographed and processed with “picky”.
    Do you think the quality of a tattoo could be high enough to “store” simple wave sounds like one ore two spoken words?

  3. Hey, this is an extremely interesting post. Would you mind posting you “Schalldruck” .tifs?
    I’d actually love to try scanning a vinyl and then extracting the music from the resulting image. Have you tried something like this already?


  4. This technique has enormous potential. At present, many people have old record collections in their attic, but few modern hi-fi systems allow you to play a record (or tape). If it was possible to take an old 33 rpm LP or a 45 rpm single, and put it on a flatbed scanner (which almost all the cheaper computer printers now include), then play the resulting image in the computer, there would be potential to revive interest in the huge number of old records still in circulation, by doing away with the need to find a working turntable. In particular, a 7-inch 45 rpm single is easily small enough to be scanned in high definition by even the smallest of current flatbed scanners.

  5. By the way, in GNU Octave, I give this: “warning: wavwrite is obsolete and will be removed from a future version of Octave, please use audiowrite instead
    warning: called from
    wavwrite at line 51 column 5
    picky at line 450 column 7”
    How can I fix it? If I replace all wavwrite to audiowrite, I give:
    “error: wrong type argument ‘matrix’
    error: audiowrite: FILENAME must be a string
    error: called from
    picky at line 450 column 7”

  6. Pingback: All Griffonage That On Earth Doth Dwell | Griffonage-Dot-Com

  7. I’m sure I’m just very dense about all this since I’m not a programmer, but I can’t figure out how to process ‘longer’ waveforms. I love the tool and it’s very useful in prototyping my project, but it seems like a short sound (.3 seconds, 2000px across .tif file) and longer ones (2 seconds, 18000px across) gives me the same short wav file result that sound ‘sped up.’ How can I tell picky how long my sound file is, without lowering the sampling rate (which makes the sound quality go way down). Thanks!

    • Thanks for your inquiry! Picky is designed to translate each column in your source image into a sample, so the duration of the file is determined wholly by the combination of a width in pixels and a sample rate. For example, 18000 pixels at 44100 Hz should yield a file 0.4 seconds long (18000/44100). This differs from software designed to convert images into sound by interpreting them as spectrograms, which often includes a duration setting. If you want to increase the duration of the audio you’re getting from the images you describe (which is equivalent to reducing playback speed), you might get better results by expanding the horizontal dimension of your image before processing to your sample rate in Hertz multiplied by your desired duration in seconds. That approach would make better use of the data you have than processing an image in Picky and then adjusting the audio afterwards. But bear in mind that playing 18000 pixel columns over two seconds is equivalent to an audio sample rate of 9 kHz, so you shouldn’t expect higher effective resolution than that regardless of what you do. Good luck!

      • Ah, that makes sense. I just stretched out the image horizontally and as you said it now makes a longer (and pitched correctly) wav file, thanks.

        Off topic: I’m prototyping a new project now and I need help with the software programming side of turning waveforms printed on paper into audio for a specific use. If you or someone you know is open to paid work of this type let me know.

    • I’m guessing that by “solve” you mean “play back.” The bad news is that the resolution of your image is too low for us to be able to retrieve decent audio from it. I wrote about this kind of sound wave art in another blog post — — which you might check out if you’d like to understand the nature of the problem. But I hate to leave any waveform image unheard, so I created a sound file from this one for you with its horizontal dimension upsampled to produce one second’s worth of audio at 44.1 kHz (the sample rate used on audio CDs). Here’s a link: — is this what you were expecting?

      • I did think the resolution wasn’t quite enough, I actually arrived at quite a similar result to you, so I’m feeling quite proud of myself, thanks for the excellent documentation. I initially thought it was going to be a spoken voice – perhaps it is! Thanks Patrick.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.