[Models] YJ Audio Stack - a YellowjacketGames Collection

YellowjacketGames 's Collections

[Models] YJ Audio Stack

[mixed] Chess x AI

[papers] Gameplay Optimization

[models] RTX a6000 48gb

[models] GTX 1660 Super 6gb

[models] CPU-Offload &/|| A6000x2

[models] Sub-1gb for Edge

[models] iGPU-Capable < 512mb

[models] non-EN specialists

[models] Image Generation Stack

[mixed] ORCAssist "Work's Done!"

[mixed] Scientific Method

[papers] Distillation

[papers] RAG$ to Riche$

[papers] Sports Tech

[papers] Image & Video

[data] What a Dump!

[Models] YJ Audio Stack

updated 3 days ago

Curated

fishaudio/s2-pro

Text-to-Speech • 5B • Updated Mar 11 • 129k • 979

Note FishAudio S2 Pro: default TTS. Already integrated, expressive, multilingual, voice cloning, inline emotion tags, good latency.
CohereLabs/cohere-transcribe-03-2026

Automatic Speech Recognition • Updated 18 days ago • 300k • 948

Note Cohere Transcribe: ASR for offline transcription/evaluation. Strong batch transcription, but less interesting than Voxtral for realtime.
Soul-AILab/SoulX-Singer

Text-to-Speech • Updated Mar 13 • 522 • 153

Note SoulX-Singer: only if we want sung lines, chants, musical taunts, or character songs. It adds singing synthesis, not normal speech.
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Text-to-Speech • 2B • Updated Jan 29 • 1.59M • 1.5k

Note Qwen3-TTS: worth a quick bake-off, but lower priority. It overlaps with Fish/VoxCPM; main appeal is Apache-2.0 and smaller variants.
ResembleAI/Dramabox

Text-to-Speech • Updated 9 days ago • 1.35k • 220

Note Dramabox: only for pre-rendered dramatic/acted clips. Good stage-direction prompting, but too heavy and niche for default TTS.
openbmb/VoxCPM2

Text-to-Speech • Updated Apr 16 • 200k • 1.32k

Note VoxCPM2: best alternate TTS to prototype. Apache-2.0, ~8GB VRAM, 48kHz output, text-only voice design, controllable cloning, strong commercial fit.
k2-fsa/OmniVoice

Text-to-Speech • Updated 15 days ago • 2.19M • 913

Note OmniVoice: only if we need very broad language coverage. 600+ languages is its main reason to exist for us.
Supertone/supertonic-3

Text-to-Speech • Updated 4 days ago • 37.5k • 550

Note Supertonic 3: CPU/on-device fallback. Tiny compared with the others, ONNX, no GPU required, good for reliable local narration when Fish is unavailable.
mistralai/Voxtral-Mini-4B-Realtime-2602

Automatic Speech Recognition • 4B • Updated Mar 11 • 1.39M • 857

Note Voxtral Mini Realtime: ASR for live player speech, voice commands, subtitles, or audio eval. Apache-2.0, realtime streaming, vLLM path.