Eight models,
34 million
parameters
An exploration of neural sequence generation applied to symbolic music. Evolving over the past few years from simple LSTM baselines to a Transformer architecture with BPE tokenisation.
Why
Symbolic,
not audio
Tools like Suno and Udio generate finished audio: a complete waveform, like a rendered image. You can listen to it, but you can't reach in and move a note, change a chord, or rewrite the bridge. The audio is the final product.
This project generates symbolic music: MIDI sequences where every note is a discrete event with a pitch, a start time, a duration, and a velocity. Think of it as the difference between a screenshot and an SVG. One is flat pixels. The other is individual shapes you can select, move, and restyle. You can select individual notes, transpose sections, swap instruments, adjust timing, and drop the output straight into a DAW like Ableton or Logic as a starting point for real composition.
The model doesn't know what music sounds like. It has never heard audio. It learns structure (which notes tend to follow which, how phrases build, where tension resolves) entirely from the symbolic representation. The audio you hear on this site is just a MIDI-to-WAV render of what the model actually outputs: a sequence of tokens.
Built on research from the Music Transformer (Huang et al., 2019), the Pop Music Transformer (Huang & Yang, 2020) which introduced REMI tokenisation, and Fradet et al. (2023) on BPE compression for symbolic music. V8 trains on the MAESTRO dataset (v3): 1,276 competition piano performances.
Representation
How music
becomes tokens
Three strategies, each building on the last.
V1 was the first experiment, a simple proof of concept that's since been lost to the ether. Everything here starts from V2.
Pitch Strings
Raw pitch names and chord integers. No timing, no velocity. The model learns temporal structure purely from position.
59–1,279
tokens
REMI
Typed tokens: Bar, Position, Pitch, Velocity, Duration, Tempo, Chord, Rest. Explicit musical structure. Vocabulary dropped despite encoding more.
362
tokens
REMI + BPE
Byte Pair Encoding compresses multi-token note events into single merged tokens. More musical context per window.
512–1,024
tokens
Architecture
LLaMA-style
Transformer
8 layers. 8 attention heads. 512 dimensions. Causal masking. Weight-tied embeddings. ~34M parameters.
RoPE
Rotary positional encoding. Encodes position by rotating query/key vectors. Naturally decays attention with distance.
SwiGLU
Gated feed-forward: W₂(SiLU(xW₁) ⊙ xW₃). Three projections give finer control over information flow.
RMSNorm
Pre-norm without mean centering. Simpler than LayerNorm, empirically as effective for Transformer training.
Weight Tying
Input embeddings and output projection share weights. Fewer parameters, built-in regularisation.
Training
How the models
learn to compose
Data Pipeline
MIDI files split into train/val before augmentation to prevent leakage. Pitch shifting (±6 semitones) approximately triples effective dataset size.
Sequence Prep
Sliding window over token sequences. Sequence length 256 for Transformer (512 max context at inference), 100 for LSTM. Configurable stride controls overlap.
Optimiser
AdamW with cosine annealing (LSTM) or linear warmup + cosine decay (Transformer). Gradient clipping at norm 1.0.
Early Stopping
Best checkpoint saved by validation loss. Training halts after 10–15 epochs without improvement. V7 stopped at epoch 45, V8 ran all 100.
Inference
Controlling
randomness
0.8
0.1 – 2.0
Temperature
Divides logits before softmax. Below 1.0 sharpens the distribution (safer, more predictable). Above 1.0 flattens it (more surprising, more diverse).
50
0 – 500
Top-K
Keeps only the k highest-probability tokens. Everything else goes to negative infinity. Prevents sampling from the incoherent long tail.
0.95
0.01 – 1.0
Top-P (Nucleus)
Includes tokens until cumulative probability exceeds p. Adapts dynamically: fewer candidates when confident, more when uncertain.
Key Insights
Same architecture, different data, different vocabularies. V2–V4 share identical LSTM configs but vocabulary ranged from 59 to 1,279 tokens. Jazz chord voicings alone drove a 21x increase over R&B.
Tokenisation changed more than widening the network. V5 widened to [512, 512, 512] but kept float input and pitch-string encoding. V6 switched to PyTorch, added learned embeddings, and adopted REMI tokenisation. Vocabulary dropped from 629 to 362 while encoding timing, velocity, and dynamics.
V7 and V8 share the same architecture but differ in scale: V8 doubled BPE vocabulary to 1,024 and trained on 6x more data. V7 early-stopped at epoch 45 with best val loss 0.0952; V8 ran all 100 epochs.