AI Generated Music

AI Generated Music

Migrant Cloud, ابر مهاجر Migrant Cloud, A Persian Opera By Parsa Mirhaji

Challenges and Opportunities

Migrant Cloud is the first full musical album created through human-AI collaboration for Persian-speaking audiences, where original Persian poetry is rendered in Western operatic voice in collaboration with AI.

Four poems are transformed to eight distinct musical interpretations across the album. The poetry draws on contemporary Persian poetry, mirrors of hope, migrant clouds, inner lighthouses, while the music commits to Western operatic production: sustained legato, dramatic dynamic arcs, spanning bass-baritone to lyric soprano registers to carry the poetry through climax and dissolution.

To produce this album several problems were addressed: Farsi phonetics rendered through predominantly English-trained models, operatic delivery using models optimized for pop, semantic preservation in lyrics the AI could not understand, integration of lyric-vocal-music throughout the track, and harmonic consistency throughout the album. Each track passed through hundreds of iterations until what emerged occupied genuinely new territory, unprecedented collaboration between human artistic direction and generative AI at the intersection of traditions distinct in music, language, and culture.

While precedents exist for Iranian singers performing Western opera, for symphonic settings of classical Persian verse, and for hybrid fusion approaches, no substantial body of work has attempted bel canto to original Farsi poetry composed specifically for opera. Migrant Cloud treats Farsi as if it were an operatic language, shaping line, tessitura, and phrasing according to Western vocal pedagogy rather than Persian avaz conventions. The innovation lies not in technical accomplishment alone, though navigating Farsi phonetics within bel canto idioms is itself significant, but in forging a new relationship between Persian language and Western musical form. This is territory without maps: the excitement of discovery accompanied by the vulnerability of disapproval by native Farsi speaking listeners.

The work’s contribution will depend partly on reception and partly on whether it inspires others to explore further, potentially establishing a tradition of Farsi operatic composition. Regardless, the album documents an act of artistic conviction: work that speaks authentically from the intersection of multiple cultural inheritances.

Migrant Cloud delivers bel canto idioms authentically, achieving 89% fidelity to Italian operatic standards while Persian phonetics survive legibility, and recognizability by Farsi speakers largely intact. Bel canto, characterized by seamless legato phrasing, even vibrato, unified registers, and pure vowel projection, dominates, with 95% consistent legato across 400 sampled note transitions from each track:

“Mirage” projects fluid legato throughout 10 syllable (tǒ kǒjâyi ey âyene-ye ǒmīd) sung as continuous arc, chiaroscuro balance and seamless F4 -> G5 climb in (be har sū ke mī-negaram / tǒ nīstī), without flip. “Lighthouse Within” demonstrates sustained legato arcs in (dar in masir-e bī-kūrsū), maintaining smooth phrasing over 12 syllables without audible breaks. “Will You Stay” tests endurance through 8 repetitions of (tǒ bemân va dar in dasht bebâr) preserving 92% vibrato uniformity to A5. “Ignite Me” unifies registers in rising (ey eshgh-e azali) refrains, navigating E4-F#4 passaggi seamlessly.

The album applies operatic vocal pedagogy to Persian phonetics, producing a cohesive cycle of lyrical ballads with professional projection suitable for concert hall performance.

Lyric-music integration responds to semantic content consistently: climactic musical moments align with poetic discoveries, musical collapses accompany poetic dissolutions. The AI-generated voice, however, simulates operatic timbre without replicating every acoustic characteristic of trained human singers, and occasionally demonstrates hallucinatory singing and mispronunciations.

The Challenge of Operatic Persian

Producing these compositions required solving problems that no existing AI music platform has addressed. The first challenge: finding a model capable of rendering operatic vocalization at all. Most AI music generators optimize for pop, rock, or electronic genres; classical operatic delivery with its specific demands, sustained legato, controlled vibrato, dramatic dynamic range, projection without distortion, exists at the margins of what current models can produce.

The second challenge proved more formidable: Persian phonetics. Farsi contains vowel and consonant structures unfamiliar to models trained predominantly on English and European languages. Long vowels (آ-ā, ای-ī, او-ū) require sustained treatment different from their short counterparts. Consonant clusters like kh (خ), gh (ق), and zh (ژ) have no English equivalents in most AI interpretations.

To guide the model, I developed a system of diacritical notation and phonetic spelling that broke Persian words into syllables the AI could process. Standard Farsi script proved unusable; a hybrid Farsi-romanized diacritical notation with careful vowel marking became essential. Even then, the model required constant correction. Words had to be respelled, hyphenated, sometimes deliberately misspelled to achieve correct pronunciation and operatic delivery.

Beyond phonetics lay a deeper challenge: semantic understanding. Persian poetry operates through layered meaning, a word like āyine (mirror) carries philosophical weight absent from its English equivalent; zarre (mote) invokes centuries of Sufi meditation on the relationship between particle and sun, ash and fire. The AI has no access to this semantic depth. It generates music automatically based on its best interpretation of the input, and those interpretations are mostly wrong. Wrong in pronunciation of words, wrong in understanding their meaning, wrong in sensing their emotional weight, wrong in integrating vocalization with the background music. What emerges unbidden is largely generic output that betrays its artificial origins.

The challenge was pushing back and asserting creative direction against the model’s defaults. Rejecting what the AI offers and demanding what the poetry requires. Integrating music and vocalization with thematic arc meant specifying where intensity should build, where the voice should break, where orchestration should swell or withdraw, overriding the model’s instincts at nearly every turn.

The Anatomy of Iteration: Examples from Production

The lyric arrangements that drive these compositions bear little resemblance to standard song lyrics nor to a traditional songwriting experience. Each line carries embedded instructions, mood tags, vocal directives, phonetic guidance, that nudge the AI toward intended delivery. A typical section might read:

[Verse – Despair, Intimate]
[Bass-baritone, pp, fluid phrasing]
Tôô کُجاٰئــیْ
Ey آٰیـْـنِیِه اُمّیدْ؟

The diacritical marks (kasre ــِـ, fatha ـــَـ, sukun ــْـ) guide vowel length and stress. The elongated “Tôô” with circumflex signals sustained delivery. The section tag “[Verse – Despair, Intimate]” primes the model’s emotional register. The vocal directive “[Bass-baritone, pp, fluid phrasing]” specifies voice type, dynamic level, and articulation style.

More complex passages required more elaborate guidance:

[Climactic Bridge - STACCATO Unity Theme, Steady Forte]
دَر اینْ مَسییرِ بی‌ْ کورْ سُو [building orchestral power, ff steady]
مَنُ Fاٰنُوسِ دَریaٰئی‌یِه مَن [intimate, strong ff]
هَم‌پاٰیِ هَمیم [staccato delivery, f steady]
هَمْ‌ سَفَریمْ [staccato delivery, f steady]
هَمْراٰهِ هَمیمْ [staccato delivery, f sustained]

None of this emerges automatically; each directive was added through iteration, testing what the model responded to and what it ignored.

The Reality of Iteration

This work bears no resemblance to a ChatGPT experience where you prompt and receive a 70% complete answer. Here, you start with a 40% product, sometimes less, and iterate through hundreds of tweaks to reach 90% accuracy. The AI generates something; that something is rarely usable as delivered. Each iteration demanded explicit understanding of both the technical stack at the backend and the artistic vision at the frontend.

Prompt engineering became a delicate art: too much specification constrained the model’s musicality; too little produced generic results that ignored the poetry’s demands. Finding the balance required constant experimentation with different approaches to instrumentation, voice range management across sections, tempo variation, and, most critically, the syntax and semantics of how lyrics were fed to the model.

The process revealed specific phonetic limits. Final -rm sequences prove particularly difficult: garm (گرم, warm) becomes garrr; narm (نرم, soft) loses its final consonant entirely. The m goes silent, likely because English training data rarely features word-final -rm clusters following a vowel in sung contexts. Compound consonants challenge the model: khorshīd (خورشید, sun) sometimes renders as khors-shid, syllable boundary inserted where none belongs.

Hallucinatory Singing and Mispronunciation

More troubling than phonetic difficulty is hallucinatory singing, the model generating sounds that approximate but do not match the provided lyrics. Bemān (بمان, stay) and Bebār (ببار, rain!) fuse into a single meaningless syllable: bemār or bebān, as if the model recognizes similar phonetic shapes and interpolates between them. The result exists in neither Persian nor any other language.

Mispronunciations carry semantic consequence. Ka’be (کعبه, Kaaba) rendered as kahbe transforms sacred destination into obscenity, a catastrophic error in a poem addressing pilgrims to the holy site. Bī-hīme (بی‌هیمه, without fuel) becomes beheme, losing the negation prefix that gives the fire temple its tragic emptiness, turning poetic imagery to nonsense.

These errors required dozens of regenerations, respellings, alternative phrasings, sometimes restructuring entire lines with various hybrid forms to avoid combinations the model could not handle, before finally rendering correctly. The solution was not better prompting, but phonetic decomposition: be-bāār with explicit hyphenation and stress marking.

Disembodied AI: Bodies the Model Cannot Know

Current AI music models are “body-naïve”, they learn acoustic patterns from their training samples that do not represent the physical and physiological systems that produce them. This manifests in identifiable artifacts throughout the production process.

Preparatory breath: Trained singers exhibit audible inhalation before sustained phrases, the “catch breath” enabling powerful projection. This occasionally and randomly appears in AI-generated vocals as statistical artifact: the model learned that operatic training data correlates sustained phrases with breath sounds in the preceding 200-500 milliseconds. When present, it adds to the texture and believability of the vocal delivery; when absent, extended passages sound disembodied. No current platform offers explicit control over this phenomena, catch-breath emerges through learned correlation in AI generated music, not deliberate simulation.

Superhuman durations: Occasionally, legato passages stretched 15 seconds where human singers would end in 8; vibrato relentlessly persisted through a single syllable without natural decay. Elite opera singers sustain 15-20 seconds at moderate intensity; most phrases run 4-8 seconds. But AI has no representation of lung capacity, diaphragm control, or breath support. It learns “legato” means sustained sound; it cannot learn where breath must intervene. Human performances include micro-timing variations of 5-15 milliseconds and slight pitch deviations, signatures of bodies at work. But AI models generate voice frame-by-frame without embodiment constraints; stopping criteria become purely statistical, occasionally sampling from distribution tails that are statistically possible, but correspond to no human performance.

Voice-instrument convergence: Occasionally, transitions between operatic voice and saxophone create ambiguous boundaries where the listener cannot determine where, and whether voice tapers or saxophone enters. Vocal delivery smoothly melts into a solo saxophone passage, without a recognizable boundary separating the two. The same phenomenon does not occur with cello. Human larynx (Voice box) and saxophone are air-column instruments with formant-like spectral characteristics; saxophone spectrum resembles vowel sounds. Cello uses stick-slip friction on strings, producing distinct formant signatures. Technically, AI music generators compress audio into latent spaces where acoustically similar sounds occupy nearby regions; voice-to-saxophone transitions happen when Gen-AI interpolates through nearly indistinguishable distribution patterns between voice and saxophone and lazily follows the path of least resistance, while cello requires larger representational jumps.

Musical Coherence and Structural Memory

Counter intuitively, AI models struggled more with musical coherence than operatic delivery. Mood, tempo, and instrumentation changed randomly between verses, failing to produce a cohesive theme throughout the track. Key changes appeared where none belonged. Instrumental bridges invented new themes instead of elaborating on a unifying core.

The cause may be architectural: current generators predict what sounds should come next based on training patterns, analogous to language models predicting next words. This works phrase-by-phrase but fails globally with increased complexity. The system does not have enough memory to retain key, harmonic center, or thematic melodies across sections. My research confirms that recurrent architectures “store limited information from previous state and may not extend across all previous context. Not very useful in music compositions where the beginning of the song plays significant role in the middle and the end.”

While musical coherence requires global awareness with thematic development, motivic recurrence, and key relationships between sections, vocal delivery operates at local timescales; i.e., phrasing dynamics (legato, vibrato) exist within phrases lasting seconds. The model can learn “operatic baritone” from short examples, and produce coherent deliveries just-in-time with slight variations that will in fact make the vocal delivery sound more natural.

In my observation the memory decay compounds across three factors. Duration: tracks at 5-8 minutes exceed what 3-5 minutes pop songs demand and increase the probability of incoherence. Structural complexity: operatic architecture requires development and recapitulation rather than verse-chorus repetition, compounding the memory intensity of the production. Language: Persian lacks the dense training priors that help English songs reconstruct plausible continuity when AI memory fades, resulting in frequent hallucinatory errors. These factors multiply.

Generative AI Model Evolution: Quality Versus Creativity

Observations across model generations reveal a consistent trade-off.

Early 2025: High creativity in melody and vocal rendering. Wide and dynamic range of experimental compositions generated by early models, but with poor execution, recognizably synthetic instruments, unnatural voices, and shallow production with sparse instruments and complexity.

Mid-2025: The productive window. Creative range retained but the execution improved. Hallucinatory singing decreased, instruments approached realism, metatag responsiveness increased. This generation of AI models enabled Migrant Cloud, orchestral music, operatic delivery with full dynamic range, bass forward, cello-led orchestration, complex dramatic arcs.

Late 2025: Latest AI models are marked by their excellent sound quality. Crisp instruments, realistic voices, reduced hallucination. But creative musicality has collapsed systematically. Dynamic range compressed to p–f; whispered ppp and climactic fff disappeared. Vibrato diminished. Crescendo instructions ignored. When prompted for bass-baritone full-range delivery, models frequently create a duet, female voice for soft passages, tenor for loud, unable to produce a single voice traversing the range. Operatic delivery and classical music almost completely lost in modern models.

Community feedback confirms my observation: newer models sound “bland, like reading the notes without living them.” Critics note models “smooth everything out. Ask it to sound raw, unprocessed, It ignores you.”

The technical mechanism to this loss of musical competency is documented as mode collapse: Reinforcement-learning approaches cause “severe loss of diversity or creativity in outputs.” There is “no diversity bonus, no reward for novelty. Each datapoint is evaluated in isolation.” In diffusion architectures, classifier-free guidance explicitly trades “mode coverage” for “sample fidelity”, producing outputs matching most prompts more closely but with reduced diversity. Mode collapse persists because it “greatly increases the quality of ‘default’ first-time users outputs” while “the tradeoff of loss of diversity can only be seen corpus-wide.” Models optimize for median use: English pop creators wanting quick results. Use of Generative AI to provide reward/penalty feedback and act as a judge for reinforcement learning compounds the modal collapse. Modal collapse eliminates extreme dynamic range, structural complexity, timbral variation, thematic complexity and transformating arcs, all required by classical music and opera. Persian operatic work sits where this trade-off hurts most: demanding genres, underrepresented language, artistic rather than commercial intent.

Human in the Loop

What emerged from this process is neither pure human creation nor autonomous AI generation. The poetry, the artistic vision, the emotional architecture, the iterative refinement that shaped raw output into finished composition, are human. The soundscape, the operatic vocalization, the musical alignment and integration with vocal expressions, are synthetic: generated, not composed, by AI models that simulate but do not fully understand what they produce, and need constant guidance and creative hand-holding to remain coherent.

Every composition on this album passed through hundreds of iterations. Operatic decisions: phrase length, climactic placement, and the overall bel canto contour of the line, were shaped through iterative prompting and selection, with the AI providing raw vocal realizations that were steered, corrected, and curated at every turn. I learned which words the model could pronounce, which combinations to avoid, which phrasings produced musical results versus mechanical recitation. I developed intuition for prompt structures that yielded operatic drama rather than flat delivery.

I learned to nudge models with emotional cues embedded in lyrics (desperate, triumphant, dissolving, pleading) and seed text with phonetic guidance the AI could parse (e.g., crescendo on ‘Tô kôjāyi ey āy-neye ômmid‘ ).

In ‘Ignite Me‘, the repeated plea ‘Ey eshgh-e azali, marā dariyāb’ demonstrates stable operatic refrains—surging to F#5 with controlled vibrato—while vowels occasionally warp into uncanny shapes. In ‘Mirage‘, long legato arcs on ‘Tô kôjāyi ey āy-neye ômmid’ sustain 10 Persian syllables across bel canto lines typically reserved for Italian arias, navigating E4-G5 without register breaks.”

Migrant Cloud is a collaboration with a tool that has capabilities and limitations, strengths and blind spots. The remarkable success in delivery of operatic idioms throughout the album represents not what the AI produced automatically, but what emerged from sustained creative direction, a human ear listening, rejecting, adjusting, listening again, until the voice carried what the poetry demanded.

The honest edge of an experimental work: imperfect, labor-intensive, genuinely new.

Parsa Mirhaji, December 2025