StyleTTS2 — Basque Multispeaker Emotional TTS
This is a Basque text-to-speech (TTS) model based on the StyleTTS2 architecture, adapted for emotional Basque speech synthesis. The model supports three emotional styles: neutral, happy (poza), and sad (tristura).
Examples (playable):
Sample 1 — Antton (Neutral) — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."
Sample 1 — Antton (Happy) — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."
Sample 1 — Antton (Sad) — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."
Sample 2 — Maider (Neutral) — "Gure patua hau izatea litekeena da, baina okerra deritzot."
Sample 2 — Maider (Happy) — "Gure patua hau izatea litekeena da, baina okerra deritzot."
Sample 2 — Maider (Sad) — "Gure patua hau izatea litekeena da, baina okerra deritzot."
Main modifications:
- PL-BERT-eu: PL-BERT model trained with WordPiece tokenizer for phonemized Basque text.
- ASR-eu: ASR model trained with a subset of the multispeaker speech corpus. It uses the same architecture as the original ASR from StyleTTS2.
- Phonemizer: We used code developed by Aholab to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at arrandi/phonemizer-eus-esp. Likewise, the code used to generate IPA phonemes can be found in the
phonemizerdirectory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.
Emotions
The original dataset contains four emotion categories. This model was trained on a subset of three emotions — Neutral, Happy, and Sad — as listed below.
| Emotion | Basque Tag | Description |
|---|---|---|
| Neutral | neu |
Neutral/calm delivery |
| Happy | poz |
Happy/expressive delivery (Poza) |
| Sad | tri |
Sad/contemplative delivery (Tristura) |
Model details
| Architecture | StyleTTS2 (from scratch) |
| Language | Basque (eu) |
| Speakers | Multispeaker (two speakers: Antton, Maider) |
| Emotions | Neutral, Happy (Poza), Sad (Tristura) |
| Text input | Basque IPA phonemes |
| Speech LM | WavLM-Base-Plus |
| Sample rate | 24 000 Hz |
| Decoder | HiFiGAN |
Training dataset
HiTZ-Aholab emotional speech synthesis dataset in Basque — emotional speech corpus.
- Number of speakers: two (Antton, Maider)
- Audio: 16,000 utterances per speaker, totalling approximately 43 hours and 58 minutes
- Maider: ~21h 22min
- Antton: ~22h 36min
- Emotions: four categories (4,000 utterances per emotion per speaker) — Poza (joy), Haserre (anger), Harridura (surprise), Tristura (sadness)
- Note: although the dataset contains four emotions, this model was trained on a balanced subset of three: Neutral, Happy (Poza), Sad (Tristura) — with the same number of samples per emotion.
- Dataset split: 100 samples for validation, 600 for testing (300 per speaker)
Training
Brief summary of training parameters used (from config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml):
- Device: cuda
- Stages: 1st-stage epochs = 50; 2nd-stage epochs = 30
- Batch: batch_size = 1
- Max length: max_len = 500
- Learning rates: lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5
- Audio / features: sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300)
- Model: multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN
- Diffusion / schedule: diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2)
- Loss highlights: lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0
Files in this repository
| File | Description |
|---|---|
config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml |
Training & model config → place at Models/Basque_Multispeaker_Phoneme_wavlm_emo_plbertemo_no_acc |
epoch_2nd_00030.pth |
Main TTS checkpoint → place at Models/Basque_Multispeaker_Phoneme_wavlm_emo/ |
epoch_00200.pth |
Basque ASR / text aligner → place at Utils/ASR_basque/ |
step_4000000.t7 |
Phoneme PLBERT → place at Utils/PLBERT_phoneme/ |
Note: The JDC F0 extractor (
Utils/JDC/bst.t7) is not Basque-specific — download it from the original StyleTTS2 repository and place it atUtils/JDC/bst.t7.
Setup
# 1. Clone the code repository
git clone https://github.com/AArriandiaga/StyleTTS2_basque
cd StyleTTS2_basque
# 2. Install dependencies
pip install -r requirements.txt
# 3. Download model weights from this HF repo and place them:
mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_emo Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC
# Download bst.t7 from the original StyleTTS2 repo (not Basque-specific):
wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7
# using huggingface_hub:
python - <<'EOF'
from huggingface_hub import hf_hub_download
import shutil
repo = "HiTZ/StyleTTS2-eu_emo"
files = {
"config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml",
"epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth",
"epoch_00200.pth": "Utils/ASR_basque/epoch_00200.pth",
"step_4000000.t7": "Utils/PLBERT_phoneme/step_4000000.t7",
}
# bst.t7 comes from the original StyleTTS2 repo — download separately:
# https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC
for hf_name, local_path in files.items():
src = hf_hub_download(repo_id=repo, filename=hf_name)
shutil.copy(src, local_path)
print(f"✓ {local_path}")
EOF
Inference
CLI:
python inference.py \
--config Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml \
--model Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth \
--ref Demo/ref_antton_poz.wav \
--text "Kaixo, zelan zaude?" \
--output output/kaixo.wav
Python API:
from inference import Synthesizer
synth = Synthesizer(
config='Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml',
checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth',
default_ref='Demo/ref_antton_neu.wav',
)
# Neutral emotion
wav = synth.run("Kaixo, zelan zaude?", ref='Demo/ref_antton_neu.wav')
synth.save(wav, "output/kaixo_neu.wav")
# Happy emotion (using poza reference)
wav2 = synth.run("Zorioneko gara!", ref='Demo/ref_antton_poz.wav')
synth.save(wav2, "output/kaixo_poz.wav")
# Sad emotion (using tristura reference)
wav3 = synth.run("Hau oso tristea da.", ref='Demo/ref_antton_tri.wav')
synth.save(wav3, "output/kaixo_tri.wav")
Key parameters for run():
| Parameter | Default | Description |
|---|---|---|
ref |
constructor default | Reference WAV for speaker & emotion style |
alpha |
0.3 | Timbre mixing (0 = reference, 1 = sampled) |
beta |
0.7 | Prosody mixing (0 = reference, 1 = sampled) |
diffusion_steps |
5 | Quality vs. speed trade-off |
embedding_scale |
1.0 | Expressiveness (>1 = more expressive) |
Reference speakers
Six reference audios are included in the repo under Demo/, covering both speakers and all three emotions:
| Speaker | Neutral | Happy | Sad |
|---|---|---|---|
| Antton (male) | ref_antton_neu.wav |
ref_antton_poz.wav |
ref_antton_tri.wav |
| Maider (female) | ref_maider_neu.wav |
ref_maider_poz.wav |
ref_maider_tri.wav |
All credit goes to the authors of StyleTTS2.
Citation
@inproceedings{li2023styletts2,
title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
booktitle = {Advances in Neural Information Processing Systems},
year = {2023},
}
Additional Information
Authors
- Ander Arriandiaga — Aholab (HiTZ), EHU
- Inmaculada Hernáez Rioja — Aholab (HiTZ), EHU
Contact
For further information, please send an email to inma.hernaez@ehu.eus.
Copyright
Copyright(c) 2026 by Aholab, HiTZ.
License
Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.