Eight models,
34 million
parameters

An exploration of neural sequence generation applied to symbolic music. Evolving over the past few years from simple LSTM baselines to a Transformer architecture with BPE tokenisation.

Why

Symbolic,
not audio

Tools like Suno and Udio generate finished audio: a complete waveform, like a rendered image. You can listen to it, but you can't reach in and move a note, change a chord, or rewrite the bridge. The audio is the final product.

This project generates symbolic music: MIDI sequences where every note is a discrete event with a pitch, a start time, a duration, and a velocity. Think of it as the difference between a screenshot and an SVG. One is flat pixels. The other is individual shapes you can select, move, and restyle. You can select individual notes, transpose sections, swap instruments, adjust timing, and drop the output straight into a DAW like Ableton or Logic as a starting point for real composition.

The model doesn't know what music sounds like. It has never heard audio. It learns structure (which notes tend to follow which, how phrases build, where tension resolves) entirely from the symbolic representation. The audio you hear on this site is just a MIDI-to-WAV render of what the model actually outputs: a sequence of tokens.

Built on research from the Music Transformer (Huang et al., 2019), the Pop Music Transformer (Huang & Yang, 2020) which introduced REMI tokenisation, and Fradet et al. (2023) on BPE compression for symbolic music. V8 trains on the MAESTRO dataset (v3): 1,276 competition piano performances.

Representation

How music
becomes tokens

Three strategies, each building on the last.

V1 was the first experiment, a simple proof of concept that's since been lost to the ether. Everything here starts from V2.

V2–V5

Pitch Strings

Raw pitch names and chord integers. No timing, no velocity. The model learns temporal structure purely from position.

59–1,279

tokens

V6

REMI

Typed tokens: Bar, Position, Pitch, Velocity, Duration, Tempo, Chord, Rest. Explicit musical structure. Vocabulary dropped despite encoding more.

362

tokens

V7–V8

REMI + BPE

Byte Pair Encoding compresses multi-token note events into single merged tokens. More musical context per window.

512–1,024

tokens

Architecture

LLaMA-style
Transformer

8 layers. 8 attention heads. 512 dimensions. Causal masking. Weight-tied embeddings. ~34M parameters.

RoPE

Rotary positional encoding. Encodes position by rotating query/key vectors. Naturally decays attention with distance.

SwiGLU

Gated feed-forward: W₂(SiLU(xW₁) ⊙ xW₃). Three projections give finer control over information flow.

RMSNorm

Pre-norm without mean centering. Simpler than LayerNorm, empirically as effective for Transformer training.

Weight Tying

Input embeddings and output projection share weights. Fewer parameters, built-in regularisation.

Training

How the models
learn to compose

Data Pipeline

MIDI files split into train/val before augmentation to prevent leakage. Pitch shifting (±6 semitones) approximately triples effective dataset size.

Sequence Prep

Sliding window over token sequences. Sequence length 256 for Transformer (512 max context at inference), 100 for LSTM. Configurable stride controls overlap.

Optimiser

AdamW with cosine annealing (LSTM) or linear warmup + cosine decay (Transformer). Gradient clipping at norm 1.0.

Early Stopping

Best checkpoint saved by validation loss. Training halts after 10–15 epochs without improvement. V7 stopped at epoch 45, V8 ran all 100.

Inference

Controlling
randomness

0.8

0.1 – 2.0

Temperature

Divides logits before softmax. Below 1.0 sharpens the distribution (safer, more predictable). Above 1.0 flattens it (more surprising, more diverse).

50

0 – 500

Top-K

Keeps only the k highest-probability tokens. Everything else goes to negative infinity. Prevents sampling from the incoherent long tail.

0.95

0.01 – 1.0

Top-P (Nucleus)

Includes tokens until cumulative probability exceeds p. Adapts dynamically: fewer candidates when confident, more when uncertain.

Key Insights

01

Same architecture, different data, different vocabularies. V2–V4 share identical LSTM configs but vocabulary ranged from 59 to 1,279 tokens. Jazz chord voicings alone drove a 21x increase over R&B.

02

Tokenisation changed more than widening the network. V5 widened to [512, 512, 512] but kept float input and pitch-string encoding. V6 switched to PyTorch, added learned embeddings, and adopted REMI tokenisation. Vocabulary dropped from 629 to 362 while encoding timing, velocity, and dynamics.

03

V7 and V8 share the same architecture but differ in scale: V8 doubled BPE vocabulary to 1,024 and trained on 6x more data. V7 early-stopped at epoch 45 with best val loss 0.0952; V8 ran all 100 epochs.

Try it yourself

Generate a melody, or explore the full source on GitHub.