HT-Demucs (single-file 4-stem) β€” ONNX

The first ONNX export of the standard htdemucs (non-FT) model on the Hugging Face Hub. Runs in onnxruntime on CPU out of the box, and on CoreML / CUDA / DirectML with a one-line provider change. No PyTorch required at inference.

This repo is the single-file companion to StemSplitio/htdemucs-ft-onnx. You get all 4 stems out of one 316 MB .onnx file (htdemucs.onnx), or 166 MB if you grab the fp16weights variant. The FT bag is higher quality; this single model is ~30% faster and uses 1 session instead of 4.


TL;DR

# 316 MB fp32 model:
pip install onnxruntime numpy soundfile
python infer.py your-song.mp3 ./out/ --write-all-stems
# writes ./out/{drums,bass,other,vocals}.wav at 44.1 kHz stereo

# 166 MB fp16weights variant (same runtime cost):
python infer.py your-song.mp3 ./out/ --small --write-all-stems

The repo contains:

  • htdemucs.onnx β€” 316 MB, opset 17, parity-verified vs PyTorch fp32.
  • htdemucs_fp16weights.onnx β€” 166 MB, fp16-stored weights, same runtime memory / latency.
  • infer.py β€” pure-numpy reference inference (~200 lines, no torch).
  • requirements.txt β€” three small packages, no PyTorch.

Quality

The official htdemucs model is the precursor to htdemucs_ft β€” same architecture, single set of weights instead of 4 specialist sub-models. On MUSDB18-HQ:

Metric htdemucs (this) htdemucs_ft (4-bag)
Median vocals SDR ~8.8 dB 9.19 dB
Median drums SDR ~9.5 dB 10.11 dB
Total model size 316 MB 1.26 GB
Sessions to load 1 4
Speed vs the bag ~1.4Γ— faster baseline

Parity vs PyTorch fp32 (random input, 7.8 s segment):

  • htdemucs.onnx max abs diff: 6.62 Γ— 10⁻⁴
  • htdemucs_fp16weights.onnx max abs diff (vs fp32 weights): 4.6 Γ— 10⁻⁡

Both well within the 1e-3 publish threshold.


Performance

Single 7.8 s segment, Apple M4 Pro CPU:

Variant RAM Latency RTF
htdemucs.onnx (fp32) ~1.1 GB ~1.6 s 0.20
htdemucs_fp16weights.onnx ~1.1 GB ~1.6 s 0.20
For comparison: htdemucs_ft (4-session bag) ~4.0 GB ~6.4 s 0.49

CUDA / DirectML / CoreML EPs are typically β‰₯ 5Γ— faster on real GPUs.


Quick start

Python

import soundfile as sf
import infer

audio, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
stems = infer.separate(audio.T, sr,
                       model_path=infer.DEFAULT_MODEL,
                       providers=["CPUExecutionProvider"])
for stem, arr in stems.items():
    sf.write(f"{stem}.wav", arr.T, sr)

CLI

python infer.py your-song.mp3 ./out/ --write-all-stems
python infer.py your-song.mp3 ./out/ --providers coreml   # macOS arm64
python infer.py your-song.mp3 ./out/ --providers cuda     # Linux + NVIDIA
python infer.py your-song.mp3 ./out/ --providers dml      # Windows + DX12
python infer.py your-song.mp3 ./out/ --small              # 166 MB variant

Mobile / Web (after pip install onnxruntime-mobile or onnxruntime-web)

// iOS / Swift
import onnxruntime_objc
let opts = try ORTSessionOptions()
try opts.appendCoreMLExecutionProvider(with: ORTCoreMLExecutionProviderOptions())
let session = try ORTSession(env: env,
    modelPath: Bundle.main.path(forResource: "htdemucs", ofType: "onnx")!,
    sessionOptions: opts)
// Browser / web
import * as ort from "onnxruntime-web";
const sess = await ort.InferenceSession.create("htdemucs_fp16weights.onnx", {
  executionProviders: ["wasm"],
});
const t = new ort.Tensor("float32", audioBuffer, [1, 2, 343980]);
const out = await sess.run({ mix: t });   // out.stems is (1, 4, 2, 343980)

For a turnkey browser demo with file-picker + chunked overlap-add, see demucs-onnx browser-demo.


Input / output spec

Tensor Name Shape Dtype Notes
Input mix (1, 2, 343980) float32 Stereo, 44.1 kHz, 7.8 s segment. Values in [-1, 1].
Output stems (1, 4, 2, 343980) float32 Stems in order [drums, bass, other, vocals]. All 4 are real predictions (unlike the FT specialists).

For longer audio, chunk with overlap-add β€” see infer.py::separate for a working 60-line implementation.


Tooling β€” demucs-onnx Python package

This model can be run (and re-exported from PyTorch) via the open-source demucs-onnx Python package on PyPI. It auto-downloads from this repo on first use, so you don't have to clone or wrangle file paths.

pip install demucs-onnx

# Single-file 4-stem flavor (this repo):
demucs-onnx separate song.mp3 stems/ --model htdemucs

# Python API:
python -c "from demucs_onnx import separate; \
  print(separate('song.mp3', model='htdemucs').keys())"

To re-export your own fine-tune:

pip install 'demucs-onnx[export]'
demucs-onnx export htdemucs out/htdemucs.onnx

How it was built

The export pipeline lives in the open-source demucs-onnx package at demucs_onnx/export/. It applies four patches to make torch.onnx.export work on htdemucs:

  1. Complex-typed torch.stft outputs β†’ Conv1d with sin/cos kernels.
  2. model.segment fractions.Fraction β†’ plain float.
  3. random.randrange in transformer pos-embedding β†’ hardcoded shift=0.
  4. aten::_native_multi_head_attention (no ONNX symbolic) β†’ drop-in nn.MultiheadAttention.forward built from Linear/bmm/softmax.

These are the four blockers every previous community attempt at "demucs onnx" stalled on. See the README of the demucs-onnx package for the full write-up with code references.


Related work

Sibling ONNX repos from the same export pipeline:

Repo Format Stems Use when
htdemucs-onnx (this) Single file 4 Faster startup, fewer sessions, ~30% lower latency than the FT bag.
htdemucs-ft-onnx Bag of 4 files 4 Best SDR, especially on vocals. The default in StemSplit production.
htdemucs-6s-onnx Single file 6 Need guitar + piano stems on top of the standard 4.
htdemucs-ft-{drums,bass,other,vocals}-onnx Single specialist 1 Fastest single-stem inference; 4Γ— faster than the bag.

Full benchmark across every popular open-source separator: StemSplitio/stem-separation-benchmark-2026.


Skip the infrastructure β€” use the StemSplit API

Don't want to bundle a 316 MB model in your app, manage a GPU pool, or write overlap-add chunking? Use the StemSplit API instead β€” same model under the hood, hosted for you, with credits and a dashboard.

Or use the no-code tools that ship the same model family:


License & attribution

This repo is MIT-licensed, matching the original HT-Demucs.

@inproceedings{rouard2023hybrid,
  title     = {Hybrid Transformers for Music Source Separation},
  author    = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
  booktitle = {ICASSP},
  year      = {2023}
}
  • Original PyTorch model: facebookresearch/demucs
  • ONNX export, parity verification, and packaging by StemSplit
  • Search keywords: htdemucs onnx, demucs onnx single file, demucs ios, demucs android, music source separation onnx, stem separation mobile.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train StemSplitio/htdemucs-onnx

Collection including StemSplitio/htdemucs-onnx