Autoregressive Symbolic Music Generation
with LSTM and Transformer Architectures
This document describes the design, training, and evaluation of eight model versions developed for monophonic melody generation from MIDI data. It covers the evolution from a baseline LSTM to a 34M-parameter Transformer with REMI tokenisation and BPE compression, and reports observations from each iteration.
1. Problem
Problem
statement
The task is next-token prediction over symbolic music sequences. Given a sequence of tokens representing musical events, the model predicts a probability distribution over the vocabulary for the next token. At inference time, tokens are sampled autoregressively from this distribution to produce novel sequences.
The input representation is symbolic MIDI, not audio. Each note is a discrete event with pitch, onset time, duration, and velocity. The output is a MIDI file that can be rendered to audio, edited in a DAW, or transposed to a different key. The model has no access to audio signals and learns structure entirely from the symbolic representation.
This differs from waveform generation systems (Suno, Udio, MusicGen) which produce finished audio. Symbolic generation preserves editability at the cost of requiring a separate synthesis step.
2. Related work
Prior work
2.1 Attention mechanisms for music
Huang et al. (2019) demonstrated that self-attention captures long-range musical structure more effectively than recurrent networks. Their Music Transformer used relative positional encoding to learn interval relationships rather than absolute positions, which is musically motivated: a perfect fifth is a perfect fifth regardless of the starting pitch.
This work motivated the transition from LSTM to Transformer in V7. However, the implementation here uses Rotary Position Embeddings (RoPE) [Su et al., 2021] instead of the relative attention scheme from the original paper. RoPE encodes position by rotating query and key vectors, producing a natural attention decay with distance at lower computational cost.
2.2 REMI tokenisation
Huang and Yang (2020) introduced REMI (REvamped MIDI-derived Events), a tokenisation scheme that encodes musical events as typed tokens: Bar, Position, Pitch, Velocity, Duration, Tempo, and Chord. This replaced earlier approaches that treated MIDI as a flat sequence of note-on/note-off events or pitch strings.
REMI makes musical structure explicit in the token stream. A note is no longer a single pitch token but a structured group of tokens encoding when it occurs, how loud it is, and how long it lasts. This was adopted in V6 using the MidiTok library with 32 velocity levels, chord detection, tempo tokens, and rest tokens (a MidiTok extension not in the original REMI specification).
2.3 Subword compression for music
Fradet et al. (2023) applied Byte Pair Encoding (BPE) [Sennrich et al., 2016] to REMI token sequences. BPE iteratively merges the most frequent token pairs into single tokens, compressing common multi-token note events (e.g., Position + Pitch + Velocity + Duration) into atomic units. This increases the musical context visible within a fixed sequence length.
BPE was adopted in V7 with a learned vocabulary of 512 tokens, and expanded to 1024 in V8 to accommodate patterns from a larger training corpus.
2.4 Architectural components from LLM research
The Transformer architecture draws on several components developed for language modelling: SwiGLU feed-forward layers [Shazeer, 2020], RMSNorm [Zhang and Sennrich, 2019], and weight tying between input embeddings and the output projection [Press and Wolf, 2017]. The learning rate schedule follows warmup followed by cosine decay [Loshchilov and Hutter, 2016]. These are standard choices in current Transformer training and are not specific to music.
3. Tokenisation
Input representation
Three tokenisation strategies were used across model versions, each encoding progressively more musical information.
Pitch strings
Notes represented as music21 pitch names (e.g., "C4", "F#3") and chords as dot-separated integers. No explicit encoding of timing, velocity, or dynamics. Temporal structure learned implicitly from position in the sequence.
59–1,279
tokens
REMI
Typed tokens: Bar, Position, Pitch, Velocity (32 levels), Duration, Tempo, Chord, Rest. Musical structure made explicit. Vocabulary decreased despite encoding strictly more information, because the typed scheme is more efficient than enumerating all observed pitch/chord combinations.
362
tokens
REMI + BPE
BPE compression applied to REMI sequences. Frequently co-occurring token groups (e.g., a note event: Position + Pitch + Velocity + Duration) merged into single tokens. A 256-token BPE sequence covers approximately 2–3x more musical content than 256 raw REMI tokens.
512–1,024
tokens
4. Architecture
Model architectures
4.1 MelodyLSTM
V1–V6
Three-layer LSTM. V1 through V4 used hidden units of [256, 512, 256] with float-normalised pitch input (dividing by vocabulary size). V5 widened to [512, 512, 512]. V6 added a learned embedding layer (128 dimensions), replacing float normalisation. The final hidden state is projected through a linear layer to the vocabulary size.
V1 through V5 were trained with TensorFlow/Keras and later converted to PyTorch. V6 was the first version trained natively in PyTorch. The codebase includes support for multi-head self-attention over the LSTM output, but no shipped model was trained with this feature enabled.
The LSTM processes sequences of length 100 with stride 1 during training. At inference, a seed sequence (randomly selected from training data) is extended autoregressively using a sliding window.
4.2 MusicTransformer
V7–V8
8-layer decoder-only Transformer with 8 attention heads, 512 model dimension, and 2048 feed-forward dimension. Approximately 34 million parameters.
RoPE
Rotary positional encoding. Position encoded by rotating query/key vectors in complex plane. Auto-extending cache for variable sequence lengths.
SwiGLU
Gated feed-forward: W₂(SiLU(xW₁) ⊙ xW₃). Three weight matrices instead of two.
RMSNorm
Pre-norm without mean centering. Applied before both attention and feed-forward sublayers.
Weight tying
Token embedding and output projection share weights. Reduces parameters and regularises.
Causal masking prevents attention to future positions. At inference, KV caching stores previous key/value projections so each autoregressive step only computes attention for the new token. The cache resets when the sequence exceeds max_seq_len (512).
5. Training
Training methodology
5.1 Data pipeline
MIDI files are split into training and validation sets (90/10) before any augmentation to prevent data leakage. Each MIDI file is then augmented via pitch transposition before tokenisation: the score is shifted by a random number of semitones within +/-6, producing up to 2 augmented copies per original. This approximately triples the effective dataset size while preserving all structural relationships (a melody transposed to a different key is musically equivalent). Files are then tokenised (pitch-string or REMI depending on version).
Token sequences are then sliced into fixed-length windows using a sliding window with configurable stride. The LSTM uses stride 1 (dense overlap, sequence length 100). The Transformer uses stride 64 (sequence length 256). For the LSTM, the target is the single next token. For the Transformer, the target is the full sequence shifted by one position (causal language modelling loss at all positions).
5.2 Hyperparameters
LSTM (V6)
Transformer (V7/V8)
5.3 Training data
| Model | Genre | Tracks | Tokeniser | Vocab |
|---|---|---|---|---|
| V2 | R&B / 90s hip-hop | 24 | Pitch strings | 59 |
| V3 | Dance | ~200 | Pitch strings | 635 |
| V4 | Jazz | ~120 | Pitch strings | 1,279 |
| V5 | Mixed | 275 | Pitch strings | 629 |
| V6 | Mixed | 275 | REMI | 362 |
| V7 | Mixed | 275 | REMI + BPE | 512 |
| V8 | Mixed + classical | 275 + MAESTRO | REMI + BPE | 1,024 |
V8 includes the MAESTRO dataset (v3): 1,276 competition piano performances from the International Piano-e-Competition.
6. Sampling
Decoding
strategy
At each autoregressive step, the model produces a logit vector over the vocabulary. Three transformations are applied before sampling:
0.8
0.1–2.0
Temperature
Logits are divided by the temperature value before softmax. Values below 1.0 sharpen the distribution (more deterministic). Values above 1.0 flatten it (more diverse). This is applied first.
50
0–500
Top-k
All tokens outside the k highest-probability candidates are masked to negative infinity. This truncates the long tail of low-probability tokens that can produce incoherent output.
0.95
0.01–1.0
Top-p (nucleus)
Tokens are sorted by probability and included until the cumulative probability exceeds p. This adapts the candidate set size dynamically: fewer candidates when the model is confident, more when uncertain.
If both top-k and top-p mask all tokens (possible with aggressive settings), the implementation falls back to uniform sampling over the first k token indices, or the full vocabulary if top-k is disabled.
7. Evaluation
Evaluation
metrics
Perplexity
exp(cross-entropy loss) on held-out validation data. Measures how well the model predicts the next token. V7 reached validation perplexity of approximately 1.1.
N-gram repetition
Computed at 2, 4, 8, and 16-gram levels over generated sequences. High repetition indicates the model is stuck in loops. REMI tokenisation reduced repetition relative to pitch-string encoding, because explicit timing tokens break up otherwise identical pitch sequences.
Compression ratio
Ratio of unique n-grams to total n-grams in generated output. Values close to 1.0 indicate high diversity. Low values indicate heavy repetition. Used as a quick diagnostic during training.
Pitch distribution
Histogram comparison between generated output and training data. Checks for mode collapse (over-representing a few pitches) and out-of-range values. A well-trained model should approximate the training distribution without memorising it.
8. Experiments
Experimental results
Eight model versions were trained sequentially. Each version changed one or two variables from the previous. The following summarises the progression and observations.
Baseline LSTM
LSTM [256, 512, 256], float input, TF/Keras
Initial proof of concept. Float-normalised pitch input (dividing by vocabulary size), no embedding layer. Produced vaguely musical output without coherent phrase structure. This version has since been lost but established that the autoregressive approach was viable.
Genre-specific data
LSTM [256, 512, 256], float input, TF/Keras
Same architecture as V1. Trained on 24 R&B/hip-hop tracks. Vocabulary of 59 tokens due to limited harmonic complexity.
Increased training data
LSTM [256, 512, 256], float input, TF/Keras
Same architecture as V2, trained on ~200 dance tracks. Vocabulary grew to 635 tokens. More data improved diversity but the model tended toward repetitive loops. The pitch-string encoding struggled with the harmonic variety in the dataset.
Jazz dataset
LSTM [256, 512, 256], float input, TF/Keras
Same architecture as V3. Trained on ~120 jazz tracks. Vocabulary reached 1,279 tokens due to the combinatorial explosion of jazz chord voicings. The vocabulary size became a clear bottleneck for this architecture.
Wider LSTM, mixed-genre training
LSTM [512, 512, 512], float input, TF/Keras
LSTM hidden units widened from [256, 512, 256] to [512, 512, 512]. Combined all 275 tracks across genres into a single training set. Vocabulary 629 tokens. Output quality was acceptable but had reached the ceiling of pitch-string encoding and float normalisation.
PyTorch, embeddings, REMI tokenisation
LSTM [512, 512, 512], embedding dim=128, REMI, PyTorch
Rewritten in PyTorch. Added a learned embedding layer (128 dimensions), replacing float normalisation. Switched from pitch-string to REMI tokenisation. Vocabulary dropped from 629 to 362 while encoding timing, velocity, dynamics, and tempo. Output quality improved substantially. This confirmed that the input representation was the primary constraint, not the model capacity.
Transformer architecture, BPE compression
Transformer (8L/8H/512d), REMI + BPE 512
Replaced LSTM with an 8-layer Transformer. Added RoPE, SwiGLU, RMSNorm, weight tying. Applied BPE (512 vocab) over REMI tokens. Sequence length increased to 256. Linear warmup (2000 steps) followed by cosine decay. Gradient accumulation (2 steps) for effective batch size of 128. Early-stopped at epoch 45 with best validation loss of 0.0952 (at epoch 30).
Expanded vocabulary and dataset
Transformer (8L/8H/512d), REMI + BPE 1024
Same architecture as V7. BPE vocabulary doubled to 1024. Training data expanded with MAESTRO dataset (1,276 classical performances). Warmup increased to 4000 steps. Ran all 100 training epochs without early stopping, suggesting the model continued to benefit from the larger dataset throughout training.
9. Discussion
Observations
Tokenisation had more impact than model capacity
V5 and V6 use the same LSTM dimensions and training data. V6 changed three things simultaneously: the framework (TensorFlow to PyTorch), the input representation (float normalisation to learned embeddings), and the tokenisation scheme (pitch strings to REMI). Despite these combined changes, the most impactful was tokenisation: vocabulary dropped from 629 to 362 while encoding strictly more musical information. This is consistent with findings in NLP where input representation improvements outweigh model scaling.
Genre-specific training had diminishing returns at this data scale
V2 through V4 were genre-specific (R&B, dance, jazz). V5 combined all genres. With fewer than 300 MIDI files total, the diversity of the combined dataset outweighed the benefit of genre focus. Genre-specific models may become worthwhile at dataset sizes of thousands of tracks per genre.
Widening the LSTM had limited returns
V5 increased hidden units from [256, 512, 256] to [512, 512, 512] and combined all genres. This improved output quality modestly, but the fundamental limitation was the pitch-string representation and float normalisation, not the model capacity. The recurrent bottleneck also remained: all context must pass through the hidden state sequentially. The Transformer in V7 removed this bottleneck by allowing direct attention between any pair of positions.
BPE compression increased effective context at negligible cost
Applying BPE over REMI tokens compressed multi-token note events into single tokens. A fixed-length window of 256 BPE tokens covers 2-3x more musical content than 256 raw REMI tokens. The encoding/decoding overhead is negligible. V8 doubled the BPE vocabulary (512 to 1024) to accommodate the more complex patterns in the MAESTRO dataset.
Data augmentation was necessary to prevent overfitting
Pitch transposition (+/-6 semitones, up to 2 augmented copies per file) approximately tripled the effective dataset. Without augmentation, models overfit rapidly on the relatively small training sets. The augmentation is musically valid because transposition preserves all structural relationships between notes.
V8 did not early-stop, unlike V7
V7 early-stopped at epoch 45 of 100 with best validation loss 0.0952 (at epoch 30). V8, with the same architecture but a larger dataset and vocabulary, ran all 100 epochs. This suggests V7 was data-limited and V8 had enough data to continue learning throughout the full training run.
10. Limitations
Limitations and future work
Single-track generation
The model produces a single melodic line. Multi-track generation (melody, bass, chords, drums) would require interleaved token representations or a multi-stream architecture.
Limited training data
The primary dataset is 275 MIDI files. While MAESTRO adds classical performances for V8, the total training data is small by modern standards. Scaling to tens of thousands of curated MIDI files would likely improve both quality and diversity.
No model-level conditioning
The API supports key, tempo, and style parameters, but the models do not yet use these as conditioning inputs. Prefix tokens or cross-attention over condition embeddings are two approaches under consideration.
Evaluation is primarily automated
Perplexity, n-gram repetition, and pitch distribution are useful diagnostics but do not directly measure musical quality. Human evaluation or a learned critic model would provide stronger signal.
No music-specific positional encoding
RoPE encodes sequential position but not musical concepts like beat position or bar structure. The original Music Transformer used relative position encodings designed for pitch intervals. Combining RoPE with music-aware relative encoding could improve harmonic coherence.
No fine-tuning from user feedback
The gallery collects generated melodies but does not capture user preferences. A rating system could provide signal for RLHF-style fine-tuning on perceived musical quality.
References
Bibliography
Huang, C-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A., Hoffman, M., Dinculescu, M., Eck, D. (2019). Music Transformer: Generating Music with Long-Term Structure. arXiv
Huang, Y-S., Yang, Y-H. (2020). Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions. arXiv
Fradet, N., Gutowski, N., et al. (2023). Byte Pair Encoding for Symbolic Music. arXiv
Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I., Huang, C-Z. A., Dieleman, S., Elsen, E., Engel, J., Eck, D. (2019). Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. arXiv
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv
Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv
Zhang, B., Sennrich, R. (2019). Root Mean Square Layer Normalization. arXiv
Press, O., Wolf, L. (2017). Using the Output Embedding to Improve Language Models. arXiv
Sennrich, R., Haddow, B., Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv
Loshchilov, I., Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv