Autoregressive Symbolic Music Generation
with LSTM and Transformer Architectures

This document describes the design, training, and evaluation of eight model versions developed for monophonic melody generation from MIDI data. It covers the evolution from a baseline LSTM to a 34M-parameter Transformer with REMI tokenisation and BPE compression, and reports observations from each iteration.

1. Problem

Problem
statement

The task is next-token prediction over symbolic music sequences. Given a sequence of tokens representing musical events, the model predicts a probability distribution over the vocabulary for the next token. At inference time, tokens are sampled autoregressively from this distribution to produce novel sequences.

The input representation is symbolic MIDI, not audio. Each note is a discrete event with pitch, onset time, duration, and velocity. The output is a MIDI file that can be rendered to audio, edited in a DAW, or transposed to a different key. The model has no access to audio signals and learns structure entirely from the symbolic representation.

This differs from waveform generation systems (Suno, Udio, MusicGen) which produce finished audio. Symbolic generation preserves editability at the cost of requiring a separate synthesis step.

2. Related work

Prior work

2.1 Attention mechanisms for music

Huang et al. (2019) demonstrated that self-attention captures long-range musical structure more effectively than recurrent networks. Their Music Transformer used relative positional encoding to learn interval relationships rather than absolute positions, which is musically motivated: a perfect fifth is a perfect fifth regardless of the starting pitch.

This work motivated the transition from LSTM to Transformer in V7. However, the implementation here uses Rotary Position Embeddings (RoPE) [Su et al., 2021] instead of the relative attention scheme from the original paper. RoPE encodes position by rotating query and key vectors, producing a natural attention decay with distance at lower computational cost.

2.2 REMI tokenisation

Huang and Yang (2020) introduced REMI (REvamped MIDI-derived Events), a tokenisation scheme that encodes musical events as typed tokens: Bar, Position, Pitch, Velocity, Duration, Tempo, and Chord. This replaced earlier approaches that treated MIDI as a flat sequence of note-on/note-off events or pitch strings.

REMI makes musical structure explicit in the token stream. A note is no longer a single pitch token but a structured group of tokens encoding when it occurs, how loud it is, and how long it lasts. This was adopted in V6 using the MidiTok library with 32 velocity levels, chord detection, tempo tokens, and rest tokens (a MidiTok extension not in the original REMI specification).

2.3 Subword compression for music

Fradet et al. (2023) applied Byte Pair Encoding (BPE) [Sennrich et al., 2016] to REMI token sequences. BPE iteratively merges the most frequent token pairs into single tokens, compressing common multi-token note events (e.g., Position + Pitch + Velocity + Duration) into atomic units. This increases the musical context visible within a fixed sequence length.

BPE was adopted in V7 with a learned vocabulary of 512 tokens, and expanded to 1024 in V8 to accommodate patterns from a larger training corpus.

2.4 Architectural components from LLM research

The Transformer architecture draws on several components developed for language modelling: SwiGLU feed-forward layers [Shazeer, 2020], RMSNorm [Zhang and Sennrich, 2019], and weight tying between input embeddings and the output projection [Press and Wolf, 2017]. The learning rate schedule follows warmup followed by cosine decay [Loshchilov and Hutter, 2016]. These are standard choices in current Transformer training and are not specific to music.

3. Tokenisation

Input representation

Three tokenisation strategies were used across model versions, each encoding progressively more musical information.

V2–V5

Pitch strings

Notes represented as music21 pitch names (e.g., "C4", "F#3") and chords as dot-separated integers. No explicit encoding of timing, velocity, or dynamics. Temporal structure learned implicitly from position in the sequence.

59–1,279

tokens

V6

REMI

Typed tokens: Bar, Position, Pitch, Velocity (32 levels), Duration, Tempo, Chord, Rest. Musical structure made explicit. Vocabulary decreased despite encoding strictly more information, because the typed scheme is more efficient than enumerating all observed pitch/chord combinations.

362

tokens

V7–V8

REMI + BPE

BPE compression applied to REMI sequences. Frequently co-occurring token groups (e.g., a note event: Position + Pitch + Velocity + Duration) merged into single tokens. A 256-token BPE sequence covers approximately 2–3x more musical content than 256 raw REMI tokens.

512–1,024

tokens

4. Architecture

Model architectures

4.1 MelodyLSTM

V1–V6

Three-layer LSTM. V1 through V4 used hidden units of [256, 512, 256] with float-normalised pitch input (dividing by vocabulary size). V5 widened to [512, 512, 512]. V6 added a learned embedding layer (128 dimensions), replacing float normalisation. The final hidden state is projected through a linear layer to the vocabulary size.

V1 through V5 were trained with TensorFlow/Keras and later converted to PyTorch. V6 was the first version trained natively in PyTorch. The codebase includes support for multi-head self-attention over the LSTM output, but no shipped model was trained with this feature enabled.

The LSTM processes sequences of length 100 with stride 1 during training. At inference, a seed sequence (randomly selected from training data) is extended autoregressively using a sliding window.

4.2 MusicTransformer

V7–V8

8-layer decoder-only Transformer with 8 attention heads, 512 model dimension, and 2048 feed-forward dimension. Approximately 34 million parameters.

RoPE

Rotary positional encoding. Position encoded by rotating query/key vectors in complex plane. Auto-extending cache for variable sequence lengths.

SwiGLU

Gated feed-forward: W₂(SiLU(xW₁) ⊙ xW₃). Three weight matrices instead of two.

RMSNorm

Pre-norm without mean centering. Applied before both attention and feed-forward sublayers.

Weight tying

Token embedding and output projection share weights. Reduces parameters and regularises.

Causal masking prevents attention to future positions. At inference, KV caching stores previous key/value projections so each autoregressive step only computes attention for the new token. The cache resets when the sequence exceeds max_seq_len (512).

5. Training

Training methodology

5.1 Data pipeline

MIDI files are split into training and validation sets (90/10) before any augmentation to prevent data leakage. Each MIDI file is then augmented via pitch transposition before tokenisation: the score is shifted by a random number of semitones within +/-6, producing up to 2 augmented copies per original. This approximately triples the effective dataset size while preserving all structural relationships (a melody transposed to a different key is musically equivalent). Files are then tokenised (pitch-string or REMI depending on version).

Token sequences are then sliced into fixed-length windows using a sliding window with configurable stride. The LSTM uses stride 1 (dense overlap, sequence length 100). The Transformer uses stride 64 (sequence length 256). For the LSTM, the target is the single next token. For the Transformer, the target is the full sequence shifted by one position (causal language modelling loss at all positions).

5.2 Hyperparameters

LSTM (V6)

Layers3x LSTM [512, 512, 512]
Embedding128 dim (V6 only)
Sequence length100
Stride1
Batch size256
Learning rate4e-3
OptimiserAdamW (wd=0.01)
LR scheduleCosine annealing
Gradient clippingmax norm 1.0
Early stopping10 epochs patience
Augmentation2x, +/-6 semitones

Transformer (V7/V8)

Layers8 transformer blocks
Attention heads8
d_model512
d_ff2048 (SwiGLU)
Sequence length256
Stride64
Batch size64 (accum 2)
Learning rate3e-4
OptimiserAdamW (wd=0.01)
LR scheduleWarmup + cosine
Warmup2000 (V7) / 4000 (V8)
Early stopping15 epochs patience

5.3 Training data

ModelGenreTracksTokeniserVocab
V2R&B / 90s hip-hop24Pitch strings59
V3Dance~200Pitch strings635
V4Jazz~120Pitch strings1,279
V5Mixed275Pitch strings629
V6Mixed275REMI362
V7Mixed275REMI + BPE512
V8Mixed + classical275 + MAESTROREMI + BPE1,024

V8 includes the MAESTRO dataset (v3): 1,276 competition piano performances from the International Piano-e-Competition.

6. Sampling

Decoding
strategy

At each autoregressive step, the model produces a logit vector over the vocabulary. Three transformations are applied before sampling:

0.8

0.1–2.0

Temperature

Logits are divided by the temperature value before softmax. Values below 1.0 sharpen the distribution (more deterministic). Values above 1.0 flatten it (more diverse). This is applied first.

50

0–500

Top-k

All tokens outside the k highest-probability candidates are masked to negative infinity. This truncates the long tail of low-probability tokens that can produce incoherent output.

0.95

0.01–1.0

Top-p (nucleus)

Tokens are sorted by probability and included until the cumulative probability exceeds p. This adapts the candidate set size dynamically: fewer candidates when the model is confident, more when uncertain.

If both top-k and top-p mask all tokens (possible with aggressive settings), the implementation falls back to uniform sampling over the first k token indices, or the full vocabulary if top-k is disabled.

7. Evaluation

Evaluation
metrics

Perplexity

exp(cross-entropy loss) on held-out validation data. Measures how well the model predicts the next token. V7 reached validation perplexity of approximately 1.1.

N-gram repetition

Computed at 2, 4, 8, and 16-gram levels over generated sequences. High repetition indicates the model is stuck in loops. REMI tokenisation reduced repetition relative to pitch-string encoding, because explicit timing tokens break up otherwise identical pitch sequences.

Compression ratio

Ratio of unique n-grams to total n-grams in generated output. Values close to 1.0 indicate high diversity. Low values indicate heavy repetition. Used as a quick diagnostic during training.

Pitch distribution

Histogram comparison between generated output and training data. Checks for mode collapse (over-representing a few pitches) and out-of-range values. A well-trained model should approximate the training distribution without memorising it.

8. Experiments

Experimental results

Eight model versions were trained sequentially. Each version changed one or two variables from the previous. The following summarises the progression and observations.

V1

Baseline LSTM

LSTM [256, 512, 256], float input, TF/Keras

Initial proof of concept. Float-normalised pitch input (dividing by vocabulary size), no embedding layer. Produced vaguely musical output without coherent phrase structure. This version has since been lost but established that the autoregressive approach was viable.

V2

Genre-specific data

LSTM [256, 512, 256], float input, TF/Keras

Same architecture as V1. Trained on 24 R&B/hip-hop tracks. Vocabulary of 59 tokens due to limited harmonic complexity.

V3

Increased training data

LSTM [256, 512, 256], float input, TF/Keras

Same architecture as V2, trained on ~200 dance tracks. Vocabulary grew to 635 tokens. More data improved diversity but the model tended toward repetitive loops. The pitch-string encoding struggled with the harmonic variety in the dataset.

V4

Jazz dataset

LSTM [256, 512, 256], float input, TF/Keras

Same architecture as V3. Trained on ~120 jazz tracks. Vocabulary reached 1,279 tokens due to the combinatorial explosion of jazz chord voicings. The vocabulary size became a clear bottleneck for this architecture.

V5

Wider LSTM, mixed-genre training

LSTM [512, 512, 512], float input, TF/Keras

LSTM hidden units widened from [256, 512, 256] to [512, 512, 512]. Combined all 275 tracks across genres into a single training set. Vocabulary 629 tokens. Output quality was acceptable but had reached the ceiling of pitch-string encoding and float normalisation.

V6

PyTorch, embeddings, REMI tokenisation

LSTM [512, 512, 512], embedding dim=128, REMI, PyTorch

Rewritten in PyTorch. Added a learned embedding layer (128 dimensions), replacing float normalisation. Switched from pitch-string to REMI tokenisation. Vocabulary dropped from 629 to 362 while encoding timing, velocity, dynamics, and tempo. Output quality improved substantially. This confirmed that the input representation was the primary constraint, not the model capacity.

V7

Transformer architecture, BPE compression

Transformer (8L/8H/512d), REMI + BPE 512

Replaced LSTM with an 8-layer Transformer. Added RoPE, SwiGLU, RMSNorm, weight tying. Applied BPE (512 vocab) over REMI tokens. Sequence length increased to 256. Linear warmup (2000 steps) followed by cosine decay. Gradient accumulation (2 steps) for effective batch size of 128. Early-stopped at epoch 45 with best validation loss of 0.0952 (at epoch 30).

V8

Expanded vocabulary and dataset

Transformer (8L/8H/512d), REMI + BPE 1024

Same architecture as V7. BPE vocabulary doubled to 1024. Training data expanded with MAESTRO dataset (1,276 classical performances). Warmup increased to 4000 steps. Ran all 100 training epochs without early stopping, suggesting the model continued to benefit from the larger dataset throughout training.

9. Discussion

Observations

Tokenisation had more impact than model capacity

V5 and V6 use the same LSTM dimensions and training data. V6 changed three things simultaneously: the framework (TensorFlow to PyTorch), the input representation (float normalisation to learned embeddings), and the tokenisation scheme (pitch strings to REMI). Despite these combined changes, the most impactful was tokenisation: vocabulary dropped from 629 to 362 while encoding strictly more musical information. This is consistent with findings in NLP where input representation improvements outweigh model scaling.

Genre-specific training had diminishing returns at this data scale

V2 through V4 were genre-specific (R&B, dance, jazz). V5 combined all genres. With fewer than 300 MIDI files total, the diversity of the combined dataset outweighed the benefit of genre focus. Genre-specific models may become worthwhile at dataset sizes of thousands of tracks per genre.

Widening the LSTM had limited returns

V5 increased hidden units from [256, 512, 256] to [512, 512, 512] and combined all genres. This improved output quality modestly, but the fundamental limitation was the pitch-string representation and float normalisation, not the model capacity. The recurrent bottleneck also remained: all context must pass through the hidden state sequentially. The Transformer in V7 removed this bottleneck by allowing direct attention between any pair of positions.

BPE compression increased effective context at negligible cost

Applying BPE over REMI tokens compressed multi-token note events into single tokens. A fixed-length window of 256 BPE tokens covers 2-3x more musical content than 256 raw REMI tokens. The encoding/decoding overhead is negligible. V8 doubled the BPE vocabulary (512 to 1024) to accommodate the more complex patterns in the MAESTRO dataset.

Data augmentation was necessary to prevent overfitting

Pitch transposition (+/-6 semitones, up to 2 augmented copies per file) approximately tripled the effective dataset. Without augmentation, models overfit rapidly on the relatively small training sets. The augmentation is musically valid because transposition preserves all structural relationships between notes.

V8 did not early-stop, unlike V7

V7 early-stopped at epoch 45 of 100 with best validation loss 0.0952 (at epoch 30). V8, with the same architecture but a larger dataset and vocabulary, ran all 100 epochs. This suggests V7 was data-limited and V8 had enough data to continue learning throughout the full training run.

10. Limitations

Limitations and future work

Single-track generation

The model produces a single melodic line. Multi-track generation (melody, bass, chords, drums) would require interleaved token representations or a multi-stream architecture.

Limited training data

The primary dataset is 275 MIDI files. While MAESTRO adds classical performances for V8, the total training data is small by modern standards. Scaling to tens of thousands of curated MIDI files would likely improve both quality and diversity.

No model-level conditioning

The API supports key, tempo, and style parameters, but the models do not yet use these as conditioning inputs. Prefix tokens or cross-attention over condition embeddings are two approaches under consideration.

Evaluation is primarily automated

Perplexity, n-gram repetition, and pitch distribution are useful diagnostics but do not directly measure musical quality. Human evaluation or a learned critic model would provide stronger signal.

No music-specific positional encoding

RoPE encodes sequential position but not musical concepts like beat position or bar structure. The original Music Transformer used relative position encodings designed for pitch intervals. Combining RoPE with music-aware relative encoding could improve harmonic coherence.

No fine-tuning from user feedback

The gallery collects generated melodies but does not capture user preferences. A rating system could provide signal for RLHF-style fine-tuning on perceived musical quality.

References

Bibliography

[1]

Huang, C-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A., Hoffman, M., Dinculescu, M., Eck, D. (2019). Music Transformer: Generating Music with Long-Term Structure. arXiv

[2]

Huang, Y-S., Yang, Y-H. (2020). Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions. arXiv

[3]

Fradet, N., Gutowski, N., et al. (2023). Byte Pair Encoding for Symbolic Music. arXiv

[4]

Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I., Huang, C-Z. A., Dieleman, S., Elsen, E., Engel, J., Eck, D. (2019). Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. arXiv

[5]

Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv

[6]

Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv

[7]

Zhang, B., Sennrich, R. (2019). Root Mean Square Layer Normalization. arXiv

[8]

Press, O., Wolf, L. (2017). Using the Output Embedding to Improve Language Models. arXiv

[9]

Sennrich, R., Haddow, B., Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv

[10]

Loshchilov, I., Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv