[Models] YJ Audio Stack
Curated
Text-to-Speech • 5B • Updated • 129k • 979Note FishAudio S2 Pro: default TTS. Already integrated, expressive, multilingual, voice cloning, inline emotion tags, good latency.
CohereLabs/cohere-transcribe-03-2026
Automatic Speech Recognition • Updated • 300k • 948Note Cohere Transcribe: ASR for offline transcription/evaluation. Strong batch transcription, but less interesting than Voxtral for realtime.
Soul-AILab/SoulX-Singer
Text-to-Speech • Updated • 522 • 153Note SoulX-Singer: only if we want sung lines, chants, musical taunts, or character songs. It adds singing synthesis, not normal speech.
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Text-to-Speech • 2B • Updated • 1.59M • 1.5kNote Qwen3-TTS: worth a quick bake-off, but lower priority. It overlaps with Fish/VoxCPM; main appeal is Apache-2.0 and smaller variants.
ResembleAI/Dramabox
Text-to-Speech • Updated • 1.35k • 220Note Dramabox: only for pre-rendered dramatic/acted clips. Good stage-direction prompting, but too heavy and niche for default TTS.
openbmb/VoxCPM2
Text-to-Speech • Updated • 200k • 1.32kNote VoxCPM2: best alternate TTS to prototype. Apache-2.0, ~8GB VRAM, 48kHz output, text-only voice design, controllable cloning, strong commercial fit.
k2-fsa/OmniVoice
Text-to-Speech • Updated • 2.19M • 913Note OmniVoice: only if we need very broad language coverage. 600+ languages is its main reason to exist for us.
Supertone/supertonic-3
Text-to-Speech • Updated • 37.5k • 550Note Supertonic 3: CPU/on-device fallback. Tiny compared with the others, ONNX, no GPU required, good for reliable local narration when Fish is unavailable.
mistralai/Voxtral-Mini-4B-Realtime-2602
Automatic Speech Recognition • 4B • Updated • 1.39M • 857Note Voxtral Mini Realtime: ASR for live player speech, voice commands, subtitles, or audio eval. Apache-2.0, realtime streaming, vLLM path.