Technical Deep-Dive
About This Project
An exploration of neural sequence generation applied to symbolic music, progressing from simple LSTM baselines to Transformer architectures with learned tokenisation.
Motivation
Consumer AI music tools (Suno, Udio, Google's MusicLM) generate audio directly from text prompts. This project works at a different level of abstraction: generating symbolic MIDI sequences that musicians can edit, transpose, and arrange in a DAW. The distinction matters because symbolic representations preserve musical structure (individual notes, timing, velocity) in a format that's both human-readable and machine-processable.
This approach builds on a line of research into symbolic music generation with neural networks. The Music Transformer (Huang et al., ICLR 2019) demonstrated that self-attention could capture long-range structure in piano performances. The Pop Music Transformer (Huang & Yang, ACM MM 2020) introduced the REMI tokenisation scheme used in this project, showing that encoding beat-relative timing explicitly improves rhythmic coherence. Fradet et al. (EMNLP 2023) then showed that applying Byte Pair Encoding to symbolic music tokens reduces sequence length while improving generation quality, an approach implemented in the MidiTok library that this project uses. The V8 model also trains on the MAESTRO dataset (Hawthorne et al., ICLR 2019), a standard benchmark of 1,276 competition piano performances with aligned MIDI.
Most of this research exists as static papers with cherry-picked audio samples. The goal here was to bridge the gap between research and engineering: build a live, interactive system where each technique can be tested, compared, and tuned in real time. Rather than a paper that claims results, this is a working implementation that demonstrates them. The full pipeline (data processing, tokenisation, model training, inference API, and interactive frontend) ships as a single deployable stack, progressing from basic LSTMs with raw pitch encoding through to LLaMA-style Transformers with BPE tokenisation across seven model versions.
Architecture
The system is a three-tier stack: a Next.js frontend with an in-browser MIDI player and Web Audio synthesiser, a FastAPI backend serving PyTorch models for real-time inference, and PostgreSQL for generation history and metadata. The frontend communicates with the API to request melody generation with configurable parameters (model selection, temperature, sequence length), then renders and plays the resulting MIDI in-browser.
Model Evolution
Seven model versions track the progression from naive baselines to production-quality generation:
V2: LSTM Baseline (11MB, 59 tokens)
Three-layer LSTM ([256, 512, 256] units) trained on 24 R&B and 90s hip-hop MIDI files with pitch transposition augmentation (~3x effective data). Character-level pitch encoding with fixed vocabulary. The small, genre-focused dataset produced a tight token distribution (59 unique tokens).
V3: Genre Scaling (11MB, 635 tokens)
Same LSTM architecture and augmentation pipeline, scaled to ~180 dance/electronic tracks. The vocabulary jumped to 635 tokens due to the more harmonically complex source material, with the same model capacity as V2.
V4: Jazz Corpus (12MB, 1,279 tokens)
~120 jazz standards with augmentation. Vocabulary exploded to 1,279 tokens due to complex chord voicings: chords are encoded as dot-joined normalOrder integers (e.g. "0.4.7" for a major triad), and jazz voicings with extensions and alterations created a long tail of rare token combinations.
V5: Mixed Genre, Wider Network (22MB, 629 tokens)
~275 songs across genres with augmentation. Widened all LSTM layers to 512 units ([512, 512, 512], up from [256, 512, 256] in V2-V4). Added temperature-controlled sampling for generation diversity.
V6: REMI Tokenisation (23MB, 362 tokens)
Same 275-song dataset as V5 but without augmentation. Switched from raw pitch encoding to REMI (REvamped MIDI-derived) tokenisation via MidiTok, and added a learned embedding layer (dim 128). REMI encodes note onset, duration, velocity, and tempo as discrete tokens, giving the model explicit timing information instead of forcing it to learn temporal structure implicitly. Vocabulary dropped from 629 to 362 despite the richer representation. Trained for 91 epochs on an H100 before early stopping (best val loss: 0.349).
V7: Transformer (130MB, 512 tokens)
Replaced the LSTM with a LLaMA-style Transformer decoder: 8 layers, 8 attention heads, d_model=512, SwiGLU feed-forward (d_ff=2048), RoPE positional encoding, and RMSNorm. Multi-head self-attention allows the model to capture long-range dependencies across the full 512-token context window. Uses REMI tokenisation with BPE (vocab 512) on the same 275-song dataset. Trained for 45 epochs on an H100 before early stopping (best val loss: 0.095).
V8: BPE Tokenisation (131MB, 1,024 tokens)
Same LLaMA-style Transformer as V7, trained on a significantly larger dataset: 1,551 MIDI files combining the original 275 mixed-genre songs with 1,276 performances from the MAESTRO dataset (classical piano competition recordings from the International Piano-e-Competition). BPE vocabulary doubled to 1,024 to handle the richer data. Trained for the full 100 epochs on an H100 without early stopping (best val loss: 0.920, higher than V7 due to the larger vocabulary and more diverse data).
Observations
- Same architecture, different data, different vocabularies. V2, V3, and V4 share identical LSTM configs ([256, 512, 256]) but vocabulary ranged from 59 to 1,279 tokens depending on genre complexity. Jazz chord voicings alone drove a 21x vocab increase over R&B.
- Tokenisation changed more than widening the network. V5 widened to [512, 512, 512] but kept raw pitch encoding (629 tokens). V6 kept the same layer width, switched to REMI tokenisation, and the vocabulary dropped to 362 while encoding timing, velocity, and dynamics that raw pitch encoding discards.
- V7 and V8 share the same Transformer config (8 layers, 8 heads, 512 dims, RoPE, SwiGLU, RMSNorm, weight-tied embeddings) but differ in two ways: V8 doubled the BPE vocabulary from 512 to 1,024 and trained on 6x more data (1,551 files vs 275). V7 early-stopped at epoch 45 with val loss 0.095; V8 ran all 100 epochs with val loss 0.920. The losses aren't directly comparable due to different vocabularies and data distributions.
Stack
Frontend
Next.js 15, React, Web Audio API, Tailwind CSS
Backend
Python, FastAPI, PyTorch, MidiTok
Training
PyTorch, music21, MidiTok, NVIDIA H100 80GB
Infrastructure
Docker Compose, PostgreSQL, Traefik, GitHub Actions
Source
The full source, including the frontend, backend, model training pipeline, and all seven model checkpoints, is available on GitHub.