Creating Synesthetic Sound-Pictures with Generative AI

Pictures can be turned into sound recordings, and sound recordings can be turned into pictures.  That much is old news: transformations in both directions have long been used for creative and artistic purposes.  Some musicians like to hide images in audio tracks that become visible when they’re displayed as sound spectrograms, and some visual artists have presented sound spectrograms of existing sound recordings as wall art.

And yet nearly all past efforts of this kind have shared the same frustrating limitation.  Each case feels meaningful either as an image or as a sequence of sounds, but not as both at once.  Audio generated from random pictures tends not to sound pleasant or musical, and it doesn’t even have much appeal as “noise music,” since it all tends to sound alike, with surprisingly little novelty to its pitches or timbres or rhythms.  Meanwhile, sound spectrograms generated from random audio selections can be informative if you know how to read them, but their patterns just aren’t all that visually engaging.  And so we typically find ourselves faced either with unappealing sounds that correspond to appealing images or with unappealing pictures that correspond to appealing sounds.

But I don’t think it needs to be that way.  Check out the following examples.

Prompt waltz “beautiful princess with long hair wearing an elaborate colorful dress” (Lykon/DreamShaper, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 293918, 20 steps, 768×512, saturation 1.  NB: All examples in this post use EulerDiscreteScheduler.


[Download]


Prompt waltz “ship sailing on a stormy sea with tall waves” (Lykon/DreamShaper, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 3908, 20 steps, 768×512, saturation 1.5


[Download]


Prompt waltz “spectacular castle” (Lykon/DreamShaper, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 3907, 20 steps, 512×512, saturation 1.5


[Download]


Prompt waltz “close-up portrait of a female face” (Lykon/DreamShaper, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 3906, 20 steps, 512×512, saturation 1


[Download]


Prompt waltz “highly detailed illustration of a beautiful insect” (Lykon/DreamShaper, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 3908, 20 steps, 512×512, saturation 1


[Download]


Prompt waltz “ancient tree with gnarled roots and leafy branches” (Lykon/DreamShaper, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 3910, 20 steps, 512×512, saturation 1


[Download]


Prompt waltz “beautiful princess with long hair wearing an elaborate colorful dress” (Lykon/DreamShaper, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 293198, 20 steps, 512×512, saturation 1.


[Download]


Prompt waltz “a group of people dancing merrily to lively music” (Lykon/DreamShaper, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 73916, 20 steps, 512×512, saturation 1.


[Download]


Prompt waltz “pixel art of dragon flying in the sky” (dreamlike-art/dreamlike-diffusion-1.0, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 905, 20 steps, 512×512, saturation 1


[Download]


Prompt waltz “close-up portrait of a female face” (SG161222/Realistic_Vision_V1.4, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 1003, 20 steps, 512×512, saturation 1


[Download]


Prompt waltz “pixel art of a beautiful insect” (SG161222/Realistic_Vision_V1.4, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 905, 20 steps, 512×512, saturation 1.


[Download]


Prompt waltz “beautiful princess with long hair wearing an elaborate colorful dress” (Lykon/DreamShaper, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 77138497, 20 steps, 768×512, saturation 1.


[Download]


If you download the audio files linked above and open them in grayscale spectrographic view using the sound editing program of your choice, you should be able to see the corresponding pictures, although not in color.  The images below are screenshots taken from Audacity.


How It’s Done

I’d been toying for a while with the idea of creating sound-pictures that would engage the senses of sight and hearing equally.  My goal wasn’t to insert a picture into an audio track that remained primarily a picture, or to make a picture from an audio track that remained primarily an audio track, but instead to produce a hybrid artwork that was truly both things at once, thoroughly interwoven and inseparable: a synesthetic sound-picture.

At first I had only a vague sense of what such sound-pictures might look like and sound like, with no idea how anyone could go about actually making one.  Then along came Stable Diffusion: a form of generative AI that’s designed to take a natural-language prompt, such as “cat riding a unicycle,” and create an image to match it.  And out of the many models contrived for use with Stable Diffusion, one branch of them—associated with the name Riffusion—has been trained to generate sound spectrograms that sound like whatever you specify in the prompt when played as audio rather than looking like it as usual.  Those two developments gave me the raw ingredients I needed to make my idea a reality, but the result I was looking for still required a bit of creative cooking.

By itself, Riffusion models—including the Riffusion v1 model on which I’ll be focusing here—aren’t equipped to create sound spectrograms that can also be steered in some direction as visual images.  But we’re not limited to using a Riffusion model on its own.  We can also combine it with an “ordinary” model which—for want of a better term—I’ll refer to here as “pictorial,” meaning that it aims to produce pictures that look like what’s specified in a prompt.

In the world of Stable Diffusion, it’s become a common practice to merge two or more pictorial models together with each other into new, composite models.  However, I was pretty sure that merging Riffusion with a pictorial model in that way would only create confusion, so I decided to try a different approach.  Stable Diffusion splits up its processing into multiple inference steps, and another fairly widespread technique involves alternating between two prompts from step to step to produce a result that’s responsive to both of them, often morphing the subjects together in some interesting or surprising way.  Here, for example, is an image generated using the prompt “spaceship” for odd-numbered steps and the prompt “castle” for even-numbered steps across twenty total steps.

Prompts “castle” (even steps) and “spaceship” (odd steps), seed  2154752413, 20 steps total, guidance 9, Lykon/DreamShaper, DPMSolverMultistepScheduler.

I couldn’t find any evidence that anyone had ever tried alternating from step to step between different models in the same way, but I found I was able to write some code to pull this off, and it ended up working perfectly well.  I blogged about the technique of model-switching back in July 2023, but what I didn’t mention at the time was that my main motive for working it out had been to use it to juxtapose Riffusion with other models in pursuit of my quest to generate hybrid sound-pictures.

Since then I’ve been experimenting further, off and on, in an effort to figure out which specific arrangements work for sound-picture generation and which don’t.

First, I found that switching between Riffusion and a pictorial model at every other step produced results that were too pictorial and not spectrogram-like enough.  So I decided to try alternating between one step of a pictorial model and two steps of Riffusion, and that turned out to work significantly better.  You might think of the process as a waltz: one, two, three, one, two, three, one, two, three.  Other rhythms show potential too, such as one, two, three, one, two—but for the moment I’ll stick with the waltz, if only to avoid juggling too many variables at once.  Sometimes I use the same guidance scale setting for all steps, typically 9, but it appears that raising the guidance scale setting for the pictorial steps (say, to 11) and lowering it for the Riffusion steps (say, to 7) can draw out the pictorial aspect at the expense of the spectrographic aspect, while adjusting the settings in the opposite way can have the opposite effect.

I started out using a single prompt for both Riffusion and the pictorial model, just because that was easier, but I soon switched to submitting different prompts to the two models, steering the spectrogram in one direction and the visual image in another, unrelated one.  For Riffusion, one of my go-to prompts (and the one I’ve used for most of the examples here) has been “steady beat”—a phrase that’s meant to be fairly neutral while still nudging results towards something rhythmic and loopable. Meanwhile, anything is fair game for the pictorial prompt, although I’ve hit upon a few measures that seem to make a difference—not necessarily always a desirable difference, mind you, but perhaps at least an informative one.  For example: curved lines tend to sound comparatively non-musical, manifesting themselves to the ear as higher-frequency pitch sweeps, so I decided to try adding “pixel art” to my prompts and found that this not only helped straighten out horizontal lines, and hence pitches (without necessarily making them more musically appropriate), but also drew out other facets of the images for which I wouldn’t have expected it to make a difference, such as boosting contrasts of coloration.

The algorithm the creators of Riffusion—Seth Forsgren and Hayk Martiros—have put forward for playing Riffusion images is based, I assume, on the principle of converting them into audio using the same parameters that were used to encode the spectrograms that were originally used to train the model.  That’s correct in theory, I suppose, but I think there’s some room for improvement in practice.  Consider the treatment of the three different color channels: red, green, and blue.  As far as I’m aware, the original training dataset for Riffusion consisted entirely of grayscale spectrograms which should each have contained three identical channels.  Forsgren and Martiros seem to have assumed there would be no meaningful differences among color channels in generated spectrogram images either, so their approach has been to take only the first color channel from these (red) and to generate monophonic audio from it.  But my hybrid sound-pictures often show traces of color introduced by whatever pictorial model I’m throwing into the mix, and I wanted to find some way of factoring those colors into the audio as well.  The most obvious solution I could think of was to generate separate audio tracks from each color channel and then to combine these tracks in stereo: specifically, mapping red to the left channel and green and/or blue to the right channel.  To my pleasant surprise, I found that this produced a nicely immersive stereo effect.  Moreover, I found that playing ordinary Riffusion images in the same way accomplishes the same thing: it seems there are differences among the three color channels even if these aren’t apparent to the eye.  Allow me to demonstrate.  At the end of an earlier blog post, I presented the following extended piece of monophonic Riffusion-generated audio.


[Download]


Now here’s that same piece rendered in stereo from the same set of spectrographic images.


[Download]


Of course we’re dealing here with entirely “fake” stereo, in the sense that Riffusion wasn’t trained on stereo source material (although Forsgren and Martiros since seem to have looked into this).  But hey—if it works, it works!  And if we can get stereo out of Riffusion with scarcely any extra work, why not take advantage of the fact?  My preference, really, would be to play RGB images in three-track stereo, perhaps from three speakers arranged in a triangle—but I’ll leave that prospect for another time.

Now that color has taken on new importance in my image-to-sound scheme, I need to acknowledge one shortcoming of the generative method I’ve described so far: namely, the raw results look awfully washed-out.

An unaltered original.

I’ve tried a few experiments with an eye towards mitigating this problem as part of the Stable Diffusion processing itself, but without any very promising results as yet.  Fortunately, it’s easy to boost color saturation after the images have been generated.  As a working expedient, I’ve been using PhotoGIMP with the saturation scale set to 10 (as high as it wants to go) and opacity set to 50.  Sometimes that setting still isn’t enough, so I run a second adjustment with the saturation scale set to 5 (noted in my captions as “saturation 1.5” as opposed to “saturation 1”).  The “saturated” results look more colorful, of course, but as an added bonus they also enhance the stereo effect we experience when listening to them.  Still, the range of colors is still limited, and I’m still hoping to find some better solution in the fullness of time.


Further Developments

Riffusion generates segments of audio that are each just a few seconds long, so to get a piece of music that’s good for something besides a ringtone, we need to join multiple segments together.  But there’s a problem, as you may have noticed while listening to the examples presented above: segments sometimes come out with an even number of bars, but much of the time they don’t, such that if we loop them we’ll hear a jarring hiccup at the join.  And even when a segment loops satisfactorily, a single endlessly repeating loop can get monotonous pretty quickly, so ideally we also want to find some way of causing music to vary while still keeping it coherent.

Forsgren and Martiros came up with working solutions to these two challenges which I don’t find particularly satisfying.  Here I’m going to consider only the looping issue, saving the variation issue for a future post.  Their solution in that case was start with a set of existing “seed” spectrograms corresponding to musical loops and generate new spectrograms from these in Image2Image mode—the mode where you start with an existing image rather than with random noise.  I don’t like that approach because it necessarily reduces the diversity of musical outcomes the model could otherwise contrive.  I’d rather find a method that’s less constraining.

The looping strategies I’ve been exploring myself fall into two categories: (1) those that center on taking segments that don’t loop naturally and excerpting loops from them and (2) those that center on getting Stable Diffusion to generate segments that loop from the start.

My current favored strategy in category one runs as follows:

  1. Starting with n=150 (or so), and then for all ascending values possible for a given image width, search for the best correlated pair of segments excerpted respectively from 0:n and n:2*n along the image’s horizontal axis.  Finding this gives us the optimal looping width.  (I’ve more recently started doing this with latents rather than with output images, starting with n=20, which seems to work even better.)
  2. Take the best matching pair of excerpts and find the best correlated pair of corresponding columns within them.  This gives us the optimal looping point, which can be handled as an offset.
  3. Create a loop between the two columns identified in step 2—let’s call their horizontal positions a and b.  Then, for purposes of demonstration and evaluation, concatenate segments 0:b,a:b,a:w, where w is the original width.
  4. If the loop seems too short, maybe leaving out part of a bar, try raising the starting value of n.  Or if it seems too long, instead try lowering the starting value of n.
  5. If the looping point seems wrong—maybe it feels like it should come a beat earlier or later—measure the number of columns in the image needed to nudge it forward or backward to a better looping point and add or subtract that amount from both a and b.

Here are a few examples prepared in this way:

Prompt waltz “kittens playing with a ball of string” (Lykon/DreamShaper, guidance 9) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 9), seed 5001, 20 steps, 512×512, saturation 1.  Cycle width 414, cycle offset 26.


[Download]


Prompt waltz “portrait of santa claus smiling” (Lykon/DreamShaper, guidance 9) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 9), seed 5002, 20 steps, 512×512, saturation 1.  Cycle width 400, cycle offset 34.


[Download]


Prompt waltz “galloping horses” (Lykon/DreamShaper, guidance 9) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 9), seed 5002, 20 steps, 512×512, saturation 1.  Cycle width 398, cycle offset 87.


[Download]


Prompt waltz “galloping horses” (Lykon/DreamShaper, guidance 9) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 9), seed 5004, 20 steps, 512×512, saturation 1.  Cycle width 392, cycle offset 102.


[Download]


Prompt waltz “colorful patchwork quilt” (Lykon/DreamShaper, guidance 9) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 9), seed 5008, 20 steps, 512×512, saturation 1.  Cycle width 404, cycle offset 52.


[Download]


Prompt waltz “colorful patchwork quilt” (Lykon/DreamShaper, guidance 11) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 7), seed 5000, 20 steps, 512×512, saturation 1.  Cycle width 418, cycle offset 51.


[Download]


Prompt waltz “snowy scene of house with colorful christmas lights and decorations” (Lykon/DreamShaper, guidance 9) and 2× “steady beat” (riffusion/riffusion-model-v1, guidance 9), seed 5000, 20 steps, 512×512, saturation 1.  Cycle width 41, cycle offset 53.


[Download]


My current favored method in category two, although I have no idea why it works as it does, is to shift the latent horizontally between inference steps.  That is, after each round of denoising, we shift all the columns in the latent n places to the right or left—relocating columns to the opposite side as needed—before beginning the next round.  Then, at the end, we shift the columns back to their original positions, just to keep things tidy.  For what it’s worth, the same technique can be used to produce pictorial images for tiling as well.  Here are some specimens of sound-pictures generated in this other way:

Prompt waltz “snowy scene of house with colorful christmas lights and decorations” (Lykon/DreamShaper, guidance 11) and 2× “fast steady beat exotic ethnic hard rock folk metal with haunting vocal harmonies” (riffusion/riffusion-model-v1, guidance 7), seed 56, 20 steps, 512×512, saturation 1, latent shifted 5 columns to the left per step.


[Download]


Prompt waltz “bottle of wine, loaf of bread, and vase of flowers, on a patterned tablecloth, with a dark background” (Lykon/DreamShaper, guidance 9) and 2× “midi synthesizer chords and melody” (riffusion/riffusion-model-v1, guidance 9), seed 19712, 20 steps, 512×512, saturation 1, latent shifted 5 columns to the left per step.


[Download]


Prompt waltz “galloping horses” (Lykon/DreamShaper, guidance 11) and 2× “fast steady beat exotic ethnic hard rock folk metal with haunting vocal harmonies” (riffusion/riffusion-model-v1, guidance 7), seed 51, 20 steps, 512×512, saturation 1, latent shifted 5 columns to the left per step.


[Download]


The horizontal-latent-shifting method I’ve just described doesn’t guarantee a steady 4/4 beat or anything like that.  A stumbly extra beat frequently gets thrown in somewhere.  But the music adapts “intelligently” to this, working around the irregularity, and as far as my own musical tastes go, the unusual rhythms that often result are really more of a feature than a bug.

With both of the strategies I’ve described, in order to produce absolutely seamless joins between segments, I’ve found that it’s necessary to concatenate the latents before VAE decoding, as opposed to concatenating the decoded images afterwards.  But for audio we then run into another problem: if we apply the Griffin-Lim algorithm to spectrographic images that are wide enough to yield, say, a minute’s worth of audio all at once, the quality of the audio suffers, becoming increasingly warbly and tinny.  I think I’ve come up with a decent working solution, but I’m going to save it for another, future post.


Reflection

It was back in June that I first figured out how to create the kind of synesthetic sound images I’ve been describing here.  For whatever historic interest it might have, here’s the result of my first reasonably successful experiment—timestamped June 29, 2023, at 1:01 PM—in its original state as to both image and sound:

Prompt waltz “portrait of a girl nvinkpunk” (Envvi/Inkpunk-Diffusion, guidance 9) and and 2× “portrait of a girl” (riffusion/riffusion-model-v1, guidance 9), seed 375611258, 20 steps.


[Download]


I set the idea aside for a while to work on other things but returned to it in late November, and all the other examples shared above have been created over the past month (as of the time of posting).

While I was preparing to share my results, I decided I ought to check to see whether anyone else had been doing anything similar, and that’s how I discovered the wonderful work of Gil Assayas, a.k.a. GLASYS, which it would be remiss of me not to mention—and which it would be remiss of you not to check out.  For some time past, GLASYS has been composing clever pieces of music that look like things when viewed in MIDI piano-roll format, and in February 2023 he expanded on this background to create a spectrogram in the same spirit, which he calls “Spectro Dragon”:

I know some artists have hidden images in their music before using tools that convert photos to frequencies, but those just sound like random noises and I wanted this to sound like music. In other words, I couldn’t rely on any “tools” and I created this from scratch (just like my MIDI art) using my experience in composition, counterpoint and music theory – as well as a healthy dose of trial and error to get the desired shapes.

I hope the examples I’ve shared here have demonstrated that “tools” can in fact be used to produce results that sound musical, at least to a point.  Still, the connection between pictorial and musical form is probably easier to recognize in GLASYS’s piece than it is in my AI-generated sound-pictures: the dragon’s head sounds like this, its tail sounds like this, and so on.  If so, I’d say that’s partly because of the human ingenuity that went into making it, but partly also because it’s rooted more in the world of musical notation—to which MIDI belongs—than in the world of audio per se with all its spectral nuances.  As GLASYS writes:

The types of sounds I used were very important – they had to be fairly dark, with few harmonics (especially on the low end) because brighter sounds have many more harmonics that show up in the spectrogram and obscure the drawing. The biggest challenge here was finding sounds that worked (without limiting myself only to boring sine waves).

The choice of display parameters plays a role here as well.  If we convert the audio of “Spectro Dragon” into the same spectrographic format used by Riffusion, the dragon ends up squeezed into the bottom half of the image, while the upper half contains a confusion of multiple harmonic echoes of its shape.

This is an area in which I think my technique might have an edge, since it can incorporate harmonics meaningfully into its sound-pictures rather than simply avoiding them.

That’s not to say that the elements in my images always cohere audio-visually as much as I’d like, though.  Steady rhythms reliably correspond to regular vertical patterns, but otherwise it seems the most musically significant details tend to cluster near the bottom of images and not to contribute all that much to the visual aspect of things, while the most visually significant details tend to appear towards the top and not to contribute all that much to the music.  When I judge how well a result has come out, one of my main criteria is how much crossover there is between the musically and visually significant details.

Eventually I’d like to put together a longer piece of program music that tells an extended story through interwoven images and sounds like these—maybe a fantasy adventure story with dragons and princesses and castles and so forth, which I think would lend itself nicely to the technique.  But for now I hope your eyes and ears have enjoyed these introductory examples.  It’s a fun little trick, and one that actually takes advantage of Riffusion’s connection to images rather than treating this as a mere technical curiosity.


Thanks for stopping by!  If you’ve enjoyed this post as much as you’d enjoy a cup of coffee from your coffeehouse of choice, please consider supporting the work of Griffonage-Dot-Com by leaving a proportionate tip.

Donate with PayPal

2 thoughts on “Creating Synesthetic Sound-Pictures with Generative AI

  1. Pingback: One Strategy for Generating Extended Pieces of Music with AI | Griffonage-Dot-Com

  2. Pingback: Introducing SyDNEy (Stable Diffusion Numerical Explorer) | Griffonage-Dot-Com

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.