Source Separation: Splitting the Mix

A finished song is a mix — voice, bouzouki, bass, drums, strings, all summed into two stereo channels. Source separation is the art of un-summing that, pulling individual instruments back out as if you had the original studio multi-tracks. Rast separates every imported song into two stems (a stem is just an isolated track):

vocals.flac — the singer
instrumental.flac — everything that isn't the singer

Both files live in your local cache at ~/Rast/cache/<hash>/, ready for playback. You can mute the singer to practice along, solo the singer to learn a melody, or just listen to the bouzouki without the kemenche on top.

Why bother — it's not just for karaoke

Separation also makes the rest of the pipeline smarter. Chord detection runs on the instrumental stem, not the full mix. The reason is straightforward: a sung melody is a pitched sound that floats over a chord, and the neural network we use for chord recognition can't always tell whether a high A is part of the harmony or the melody on top of it. Strip the voice out first and the model sees a much cleaner harmonic picture. Key detection inherits that benefit because it reads the chord stream.

Note transcription — the optional step that lists individual notes for use in key detection — also runs on the instrumental, for the same reason: bass lines and held instrumental notes are good signals for what dromos a song is in; vocal melisma is noisy.

Two engines, one job

Rast ships two separation backends and lets you pick per song. Both are pre-trained models exported to ONNX (Open Neural Network Exchange — a portable format that lets the same model run on CPU or GPU without a Python interpreter). Neither requires the network.

                        ┌──────────────────┐
                        │   full mix       │
                        │   stereo, 44.1k  │
                        └────────┬─────────┘
                                 │
              ┌──────────────────┴──────────────────┐
              v                                     v
   ┌────────────────────────┐          ┌────────────────────────┐
   │   Spleeter (2-stem)    │          │   Demucs (4-stem htx)  │
   │   STFT magnitude mask  │          │   hybrid time + freq   │
   │   ~smaller, ~faster    │          │   ~larger, ~cleaner    │
   └────────┬───────────────┘          └────────┬───────────────┘
            │                                   │
            v                                   v
   vocals + instrumental            drums + bass + other + vocals
                                           │
                                           ▼
                              instrumental = drums + bass + other

Spleeter — fast and serviceable

Spleeter (originally from Deezer) treats separation as a masking problem. It looks at the audio's short-time Fourier transform — a spectrogram, basically a heat-map of which frequencies are loud at each moment — and predicts a multiplier for each pixel: keep this much for vocals, throw the rest at the instrumental. Two ONNX networks run sequentially, one per stem. The math after that (inverse STFT) reconstructs each stream as audio. It's fast, the file is small, and it gets the job done for most rebetiko and laiko material.

Demucs — slower, cleaner

Demucs (specifically the hybrid transformer variant, "htdemucs", from Meta AI) goes further. It looks at the waveform and the spectrogram together, predicts four sources at once — drums, bass, other, vocals — and stitches the outputs in overlapping segments to keep the seams quiet. Rast then sums drums + bass + other into a single instrumental stem. The model is bigger and the inference is slower, but the result is noticeably less smeared, especially for percussive transients (a buzuki pluck or a darbuka slap survives the round-trip more intact). The frontend dialog hides whichever backend isn't installed.

Honest about artefacts

Source separation is a hard problem and these models are not magic. Expect to hear:

Bleed — faint traces of the singer in the instrumental, faint traces of the band in the vocal, especially on long held notes that overlap the chord harmonics.
Smearing — fast transients (a hi-hat, a finger snap) can sound mushy or "underwater" because the spectrogram representation throws away the sharpest edges.
Pitch ghosts — when an instrument tracks the melody closely (an oud playing the singer's line in unison), the network sometimes can't decide who the note belongs to and assigns it to both stems faintly.
Silence holes — during dense polyphony, the network occasionally drops a fraction of a second of an instrument completely.

For practising along this is fine; for archival use, accept that the stems are derivatives of the mix, not what came off the studio console. Demucs gives a meaningful quality bump for percussion-heavy material; for a sparse rebetiko trio, Spleeter is often indistinguishable.

The instrumental stem is the input to Chord detection; the full mix (not the instrumental) feeds Beat tracking so that drum hits and other percussive cues are not lost to the separator.

Source Separation: Splitting the Mix ​

Why bother — it's not just for karaoke ​

Two engines, one job ​

Spleeter — fast and serviceable ​

Demucs — slower, cleaner ​