One Strategy for Generating Extended Pieces of Music with AI

Let’s ring in the new year with some more AI-generated music—a whole album’s worth this time (72:54).  Each of the thirty selections presented below begins and ends rather abruptly, and the music often has little obvious connection with the text prompt I used to generate it.  But neither of those issues gets in the way of what I’ve been trying to experiment with here: namely, a new strategy for generating extended pieces of music.  AI tends to be able to mimic short-term musical structure decently well, but not longer-term musical structure, resulting in what one composer has called “unstructured musical babbling.”  That’s not necessarily a fatal flaw: the music of wind chimes lacks structure too, and my own feeling is that AI-generated music is worthwhile only if and when it differs somehow from music a human being would or could otherwise have composed.  But I’m still interested in getting it to babble more engagingly and hold the attention better over time, much as someone might try to design an improved wind chime.  I’ll spell out the particulars of my strategy further below; for now, I just invite you to listen.


1. Prompt “medieval celtic harp flute fiddle”, seed 17, width 512, guidance 9, sequence B; consistent parameters for all tracks: 10 inference steps, EulerDiscreteScheduler [Download]



2. Prompt “fast steady beat exotic ethnic hard rock folk metal with haunting vocal melodies”, seed 0, width 496, guidance 6, sequence B [Download]



3. Prompt “uptempo disco dance music”, seed 54312, width 512, guidance 7, sequence A [Download]



4. Prompt “fast steady beat exotic ethnic hard rock folk metal with haunting vocal melodies”, seed 51, width 512, guidance 6, sequence B [Download]



5. Prompt “fast rock guitar chords classic melody harmony”, seed 1004, width 512, guidance 9, sequence A [Download]



6. Prompt “guitar harmonies enchanting female vocal”, seed 111, width 512, guidance 7, sequence A [Download]



7. Prompt “fast steady beat exotic ethnic hard rock folk metal with haunting vocal melodies”, seed 56, width 512, guidance 6, sequence B [Download]



8. Prompt “fast rock guitar chords classic melody harmony”, seed 1018, width 512, guidance 9, sequence A [Download]



9. Prompt “guitar harmonies enchanting female vocal”, seed 104, width 512, guidance 7, sequence A [Download]



10. Prompt “catchy bass groove with percussion and melodic chimes”, seed 33302, width 512, guidance 9, sequence A [Download]



11. Prompt “fast rock guitar chords classic melody harmony”, seed 1043, width 512, guidance 9, sequence A [Download]



12. Prompt “guitar harmonies enchanting female vocal”, seed 114, width 512, guidance 7, sequence A [Download]



13. Prompt “fast steady beat exotic ethnic hard rock folk metal with haunting vocal melodies”, seed 38, width 512, guidance 6, sequence B [Download]



14. Prompt “lively folk song played on concertina” (note: Riffusion v1 clearly doesn’t know what a concertina is!), seed 54306, width 512, guidance 7, sequence A [Download]



15. Prompt “fast rock guitar chords classic melody harmony”, seed 1016, width 512, guidance 9, sequence A [Download]



16. Prompt “medieval celtic harp flute fiddle”, seed 0, width 512, guidance 9, sequence B [Download]



17. Prompt “catchy bass groove with percussion and melodic chimes”, seed 33306, width 512, guidance 9, sequence A [Download]



18. Prompt “fast rock guitar chords classic melody harmony”, seed 1117, width 512, guidance 9, sequence A [Download]



19. Prompt “guitar harmonies enchanting female vocal”, seed 103, width 512, guidance 7, sequence A [Download]



20. Prompt “fast rock guitar chords classic melody harmony”, seed 1114, width 512, guidance 9, sequence A [Download]



21. Prompt “fast steady beat exotic ethnic hard rock folk metal with haunting vocal melodies”, seed 0, width 512, guidance 6, sequence B [Download]



22. Prompt “synthesizer organ chords with percussion and bells”, seed 33306, width 512, guidance 7, sequence A [Download]



23. Prompt “celtic traditional music”, seed 93313, width 512, guidance 7, sequence A [Download]



24. Prompt “fast steady beat exotic ethnic hard rock folk metal with haunting vocal melodies”, seed 27, width 512, guidance 6, sequence B [Download]



25. Prompt “disco funk rock”, seed 93310, width 768, guidance 8, sequence A [Download]



26. Prompt “fast rock guitar chords classic melody harmony”, seed 1062, width 512, guidance 9, sequence A [Download]



27. Prompt “guitar harmonies enchanting female vocal”, seed 101, width 512, guidance 7, sequence A [Download]



28. Prompt “medieval celtic harp flute fiddle”, seed 16, width 512, guidance 9, sequence B [Download]



29. Prompt “catchy bass groove with percussion and melodic chimes”, seed 33301, width 512, guidance 9, sequence A [Download]



30. Prompt “disco funk rock”, seed 93315, width 768, guidance 8, sequence A [Download]


The strategy I used to generate these examples should be applicable to any and all Riffusion-type models: that is to say, Stable Diffusion models that have been trained to generate sound-spectrographic images rather than “pictorial” ones.  I see that “Riffusion” has now been registered as a trademark, so I suppose we’ll eventually need some other generic name for this category of model—image-domain generative spectrophony, or whatever.

One problem with models of this type (whatever we end up calling them in the long run) is that they can only generate a few seconds’ worth of audio at a time.  In order to create a piece of music that’s good for something other than a ringtone, it’s necessary to join together a sequence of multiple clips that are both consistent enough among themselves to be musically coherent and varied enough to be interesting.  The question is how to prepare clips that fit those criteria.

The method initially proposed by the creators of Riffusion was to generate clips based on incremental positions along the path in latent space that lies between two text prompts or two seeds.  But the audio examples they provided at the time have since vanished from their site (although I managed to salvage one from YouTube), and as of this writing, Riffusion.com is limited to generating unconnected twelve-second clips like this one:


[Download]

In the meantime, I’ve been experimenting myself with varying clips in other ways to produce long-form audio, for example by making incremental numerical adjustments to prompt embeddings, as in the following example—originally from this post, as modified here:


[Download]

Morphing gradually between reference points in this way can produce decent results, and I like that last example quite a lot, even if it gets awfully repetitive at points.  But this approach comes with a built-in limitation: the resulting audio will necessarily evolve from one distinct thing into another.  It won’t work through a sequence of variations on a given theme, as we typically find in “real” human-composed music.  And so I set out to find some way of varying musical clips such that they’d circle continuously around a starting point without ever fully leaving it behind.

The Riffusion v1 model was trained on images that are 512 pixels high, which means that the latents it uses are 64 rows high, with those 64 rows corresponding to the spectrograms’ frequency axis.  I reasoned that if we were to swap around the positions of just a few of the rows in a latent before we denoised it to generate an output image, we ought to get a result that was just a little different from what we’d have gotten otherwise.  And I reasoned further that if we were to choose rows to reorder that correspond to those parts of the spectrogram that carry the melody or the bass line, we should be able to vary those aspects of the music selectively—if randomly—while leaving the rest mostly unchanged.  My hope was that the results would strike an appealing balance between continuity and variation in their melodies, harmonies, and rhythms, similar to what we tend to find in “real” human-composed music.

After some trial and error, I settled on a process of selecting successive n-row blocks, starting at the bottom of the latent and working my way upward, and flipping these blocks upside down.  I start with a first clip generated from an unaltered latent, and then I move on to a second clip generated from that same latent with its lowest n rows flipped upside down, and then on again to a third clip generated from the same original latent with the same thing done to an n-row block positioned one row higher than before, proceeding from there to shift the block upward one row at a time for some arbitrary number of steps and clips.  For practical purposes, I measure blocks in terms of rows counted downward from the top of the latent to the top of the block and rows counted upward from the bottom of the latent to the bottom of the block, with rows numbered starting at 0.  For n=10, which I’ve been using as a default value, the first flipped block will be 54,0 (rows 54 through 63); the next will be 53,1 (rows 53 through 62); the next will be 52,2 (rows 52 through 61); and so on.  Each of the examples presented above faithfully follows one of two patterns:

  • Sequence A (consistent ten-row blocks): 54,0;53,1;52,2;51,3;50,4;49,5;48,6;47,7;46,8;45,9;44,10;43,11;42,12;41,13;40,14;39,15;38,16;37,17;36,18;35,19;34,20;33,21;32,22;31,23;30,24;29,25;28,26;27,27;26,28;25,29;24,30
  • Sequence B (shorter, and switches to twelve-row blocks for the last three clips): 54,0;53,1;52,2;51,3;50,4;49,5;48,6;47,7;46,8;45,9;44,10;43,11;42,12;41,11;40,12;39,13

Sometimes you’ll notice that one clip diverges markedly from adjacent clips in its rhythm—often the bottom-most block (54,0)—causing it to stick out conspicuously and disrupt the flow of the music.  And sometimes we run into several adjacent clips with particularly weird harmonies.  In both cases, though, the music soon “recovers.”  And while I’ve presented all the sequentially generated clips in the main set of examples I’ve shared above, it would also be possible to edit any of these sequences by removing “bad” clips, repeating “good” clips, or rearranging things in any other desired way.  Here, for example, is an edited version of Track 20, whittled down to just ten select clips:


[Download]

I should mention that my examples also incorporate a few other changes I’ve made to the standard Riffusion algorithm.  To ensure that I get loopable clips, I shift the columns of the latent n places to the left at each inference step, and then I shift them back to their original positions at the end.  To generate stereo, I extract my left stereo channel from the red color channel and my right stereo channel from the green color channel.  Those two techniques are described in more detail here.  Finally, I redesigned the process of rendering latents into images, and images into audio, in an effort to minimize phase anomalies without overtaxing the Griffin-Lim algorithm.  (For anyone who might want the details: Running the Griffin-Lim algorithm on a longer spectrogram image produces tinny-sounding results—apparently increasing the length decreases the algorithm’s ability to resolve lower frequencies. But rendering the component latents and images into audio separately can produce audible “tics” at the joins because of discrepancies in phase.  So the process I’ve settled on for the moment entails concatenating the latents for successive clips together as latents, running the Griffin-Lim algorithm for phase estimation on images generated from successive 60-column chunks of latent with a 10-column overlap, transitioning between these image-mode clips wherever in the overlap their magnitudes are most similar, and then converting the whole concatenated complex-valued spectrogram into audio in one pass.  Most of the time this works pretty well.  The main artifact I’ve encountered so far is “ringing,” which tends to affect one clip but not adjacent ones, and which manifests itself visibly as an area of heightened contrast in the spectrogram.  It looks as though increasing the chunking width from its 60-column default might be able to lessen this, or at least shift it into different positions, but I haven’t done enough experimenting yet to work out any general rules.)  The results of this last method aren’t perfect, but they’ve been coming out better on average than anything else I’ve tried.

Anyhow, that’s how it works—hope you’ve enjoyed the concert!

P.S. (January 4, 2024): I changed the title of this post (originally “A Strategy for Generating Longer Pieces of Music with AI”) after it occurred to me that the earlier wording could be taken to mean that my goal here is simply to increase the length of AI-generated music, which it isn’t.  This is the first time I’ve ever revised the title of a Griffonage-Dot-Com post after it’s been published!  I’ll try not to make a habit of it.


Thanks for stopping by!  If you’ve enjoyed this post as much as you’d enjoy a cup of coffee from your coffeehouse of choice, please consider supporting the work of Griffonage-Dot-Com by leaving a proportionate tip.

Donate with PayPal

2 thoughts on “One Strategy for Generating Extended Pieces of Music with AI

  1. Your scientific articles, such as this one, seem to demonstrate a greater ability to use A.I. creatively than anyone else who has an expressed interest in this field. And you’ve been writing for ages about Sound, and not solely in relation to A.I.

    I wonder whether there is any scope for using A.I. to improve the quality of recorded sound?

    You blogged, ages ago, about early gramophone recordings from the 19th Century. They were comprehensible, as pieces of speech or music, but they suffered a great deal of analogue distortion, hiss, and other unwanted damage. Some were not even recorded on wax or shellac, but as photographs reproduced on paper in a magazine or scientific journal.

    A.I. usually gets a bad press. But not here. And its processing power is certainly impressive. It might be used constructively to improve old recordings, perhaps by isolating and removing hiss, or by correcting distortion due to speed variations, maybe even by using SBR techniques to rebuild more of the original sound than acoustic or early electrical amplification was capable of capturing.

    A.I. as a means for composing new sounds is perhaps not as practical an application for the technology as repairing or restoring existing recordings, which seemed to be the original focus of this blog. A.I. as a composer of music has a good deal of competition, from human composers with a big head start on it. But the field of restoration of sound is pretty much wide open.

    Training an A.I. program to distinguish between a musical note and a high-frequency pattern of hiss would be a significant achievement. A pure musical note must express a specific frequency, whereas ‘noise’ or ‘hiss’ will just spead uniformly across all frequencies.

    But to make any effective progress would need someone with your level of ability. A.I. research seems to be mostly conducted by people with no interest in the history or mechanics of recorded sound. It seems to me, the new technology is not being especially useful when used for text-to-speech processing, which appears to be the closest anyone currently is to addressing Sound recording with it.

  2. Pingback: Introducing SyDNEy (Stable Diffusion Numerical Explorer) | Griffonage-Dot-Com

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.