Chord Detection: From Spectrogram to Chord Chart

A chord chart is the most useful object Rast produces. Open a song, scroll the timeline, and there it is: D minor for two bars, then G minor, then A7 over the verse turnaround. Getting from raw audio to those labels involves a neural network, a vocabulary of 170 possible chord types, and two cleanup passes that turn a stream of jittery per-frame predictions into something a musician can actually read.

The input: an instrumental, a spectrogram

Chord detection runs on the instrumental stem produced by separation, not the full mix — the singer would only confuse things. Rast's default backend is CREMA (Chord Recognition with Multiplicative Activations), a convolutional neural network from McFee & Bello that has held up remarkably well since 2017.

CREMA does not look at the audio waveform directly. It looks at a Constant-Q Transform (CQT) — a special kind of spectrogram where the frequency axis is musically spaced, with three bins per semitone over six octaves starting at C1. Two harmonic stacks are computed (one at the fundamental, one an octave up) and converted to a decibel scale, giving the network a tensor that explicitly encodes "where pitched energy lives" in a way that is invariant under transposition. The CQT machinery lives in rust/rast-analysis/src/librosa_cqt.rs and is bit-for-bit compatible with librosa's reference implementation.

The model: 170 classes per frame

The network emits, for each ~93 ms frame of audio, a probability distribution over 170 chord labels. The vocabulary is twelve roots times fourteen qualities, plus N (no chord) and X (unknown):

triads: maj, min, dim, aug
sixths: maj6, min6
sevenths: maj7, min7, 7 (dominant), dim7, hdim7 (half-diminished), minmaj7
suspensions: sus2, sus4

So at every frame the model is voting on whether the music right now is, say, D:min7 or A:7 or N. The full CREMA vocabulary table lives in rust/rast-analysis/src/crema_inference.rs.

                ~93 ms frames
     time ─────────────────────────────────────▶
    ┌───────┬───────┬───────┬───────┬───────┬───────┐
    │D:min  │D:min  │D:min  │D:min  │D:min  │ N     │ ← per-frame argmax
    │D:min7 │D:min  │D:min  │D:min  │D:min  │D:min  │ ← raw frame label
    │D:min  │A:maj  │D:min  │D:min  │D:min  │D:min  │ ← noisy frame!
    └───────┴───────┴───────┴───────┴───────┴───────┘
                                    ▲
                                    │
                            stutter from a single
                            confused frame

A single noisy frame can flip the chord for a tenth of a second and produce the visual equivalent of a stutter. That is what the next two passes fix.

Cleanup pass 1: smoothing

For our alternative chord backend BTC (Bi-directional Transformer for Chord recognition), Rast applies a beat-bucketed log-sum smoother. Within every beat span, all the per-frame probability distributions are added in log space, and the chord with the highest summed log-probability wins the whole beat.

The trick is how the maths weights confidence. A confident frame (say 95% on D:min) contributes almost nothing to its own chord and a very negative number to every other chord. So one confident frame can veto a sea of weak frames. A frame where the model said "well, maybe D:min, maybe F:maj, maybe A:min" contributes roughly equal log-mass to all three and washes out. The smoother lives in chord_smooth.rs. CREMA's outputs are already segmented internally and skip this pass.

Cleanup pass 2: snapping to beats

Whatever the model produced, the chord segment boundaries do not line up with beats. They land at arbitrary 93 ms frame edges. Two problems follow: the chord lane in the timeline looks ragged, and edits the user makes via the chord picker have no clean target to snap to.

The solution, in chord_snap.rs, is straightforward. Take the beat grid; for every beat-to-next-beat interval, look at all chord segments that overlap it, and pick whichever one covers the most of that beat. That label becomes the chord for that whole beat. Then merge adjacent same-label beats into longer segments.

   beats:       │   │   │   │   │   │   │   │
   raw chords:  Dm    Dm  G       Dm    A7  A7
   per-beat:    Dm  Dm  Dm  G   Dm   Dm   A7   A7
   merged:      [── Dm ──][G ][── Dm ──][── A7 ──]

The result is a chord stream where every boundary lands on a beat, every segment spans a whole number of beats, and the editor has clean targets.

The trade-off, honestly

Smoothing too aggressively erases real chord changes — a one-bar passing chord can vanish if the smoother decides its neighbours are louder. Smoothing too lightly leaves the chart stuttering. The current settings are tuned for a sweet spot on Greek material, but the model still makes mistakes. That is why every chord in the timeline is editable: click it, pick a replacement from the chord picker, and the change is persisted alongside the analysis. See Editing chords for the workflow. The chord stream feeds Key detection, which uses chord qualities — not just notes — to disambiguate dromoi.

Chord Detection: From Spectrogram to Chord Chart ​

The input: an instrumental, a spectrogram ​

The model: 170 classes per frame ​

Cleanup pass 1: smoothing ​

Cleanup pass 2: snapping to beats ​