Skip to content

Beat Tracking: Finding the Pulse

Tap your foot along to a song. The moments your foot lands on the floor are beats. Now count "one, two, three, four" — every "one" is a downbeat, the start of a bar. Beat tracking, the task Rast performs to build a timeline you can scrub and snap chords to, is the art of getting a computer to tap its foot and count bars convincingly.

It is harder than it sounds, especially for Greek music. Most off-the-shelf beat trackers are trained on Western 4/4 pop and stumble on 7/8 kalamatiano or 9/8 zeibekiko. Rast does its best with a careful two-step process that we are honest about the limits of.

What you get out

A list of beat times in seconds, plus a BPM (beats per minute). The BPM is just the median of the gaps between consecutive beats, converted from seconds to per-minute. Downbeats — the bar starts — are computed but currently not trusted as the rendered "1" in the UI: see the section on Greek meters below.

Step one: a neural network listens

Rast uses beat_this, a deep neural network trained on a large corpus of annotated music. The audio is downsampled to 22.05 kHz mono, transformed into a log-mel spectrogram (a heat-map of which frequencies are loud at each 20 ms tick, on a perceptually-spaced frequency axis), and fed to the network in 30-second windows with a 5-second overlap so beats near a chunk seam aren't missed.

For each 20 ms frame, beat_this emits two numbers between 0 and 1: a beat probability and a downbeat probability. The activation curves look something like this — peaks rise wherever the model thinks a beat is happening:

beat activation
1.0 ┤        ▲          ▲          ▲          ▲
    │       ╱ ╲        ╱ ╲        ╱ ╲        ╱ ╲
    │      ╱   ╲      ╱   ╲      ╱   ╲      ╱   ╲
0.0 ┤─────╯     ╰────╯     ╰────╯     ╰────╯     ╰─
    └───────────────────────────────────────────── time

A naive approach would now just take every peak above 0.5 and call it a beat. That works on a clean recording with a metronome, but real music has weak transients (a soft acoustic strum), tempo drift, ornaments, and dropped beats during fills. Raw peak-picking produces a stuttering, drifting grid.

Step two: a Dynamic Bayesian Network cleans it up

A Dynamic Bayesian Network, or DBN, is a probabilistic decoder. Think of it as a model of "what a believable beat grid looks like" — it knows that beats want to be roughly evenly spaced, that tempo changes are usually gradual, and that downbeats arrive every few beats in a regular pattern. Given the noisy beat and downbeat probabilities from beat_this, the DBN finds the single most likely tidy grid that explains those probabilities, using the Viterbi algorithm.

Before the DBN runs, Rast estimates the song's overall tempo from the autocorrelation of the beat activations — basically asking "what spacing repeats most strongly?" — and locks the DBN's tempo search to a narrow band around that peak. This stops the decoder from flipping between half-time and double-time mid-song when the beat signal is bi-modal (a common failure mode on rebetiko, where slow-feeling 4/4 can be heard either as 60 BPM or 120 BPM). After Viterbi picks integer frame indices, a parabolic interpolation step refines each beat to sub-frame precision — roughly 5 ms instead of 20 — by fitting a small parabola through the activation values either side of the peak.

The DBN code lives in rust/rast-analysis/src/dbn/ and the orchestration in beat_detection.rs. There's an extra step called octave-dedup that catches the rare case where the DBN settled on exactly twice the right tempo and emits a beat between every real beat.

Where it struggles

  • Tempo changes — sudden ritardandos and rubato passages produce a grid that smooths over the change rather than tracking it. In a rubato amanedes introduction this is a bug.
  • Weak transients — solo voice-and-strings passages give the network little to latch onto. Beats during long sustained notes are guesswork.
  • Non-Western meters — beat_this was trained mostly on 4/4 and 3/4 music. For Greek asymmetric meters the beat positions are usually still accurate (a 9/8 zeibekiko still has nine pulses, and the model finds them), but the downbeat head mislabels them: most zeibekika get tagged as 4/4. The UI therefore shows a cosmetic every-fourth-beat bar grouping instead of the model's downbeat output. The downbeat activations still feed the DBN because the joint model produces more stable beat positions than a beat-only decode — we just don't trust the resulting bar-line labels for asymmetric music.

The output of this stage feeds Chord detection, which uses beat boundaries to snap chord segments to musically meaningful frames, and the chroma-similarity matrix that powers "find similar sections" in the editor.