Update model card: fix audio encoder (conformer, not Whisper), SigLIP-400M, add official LoRA flags note
Browse files
README.md
CHANGED
|
@@ -259,7 +259,7 @@ print(response.choices[0].message.content)
|
|
| 259 |
| Vision (images) | β
| Requires `mmproj-phi4-mm-f16.gguf` + `--mmproj` flag |
|
| 260 |
| Audio / Speech | β | Not available β see below |
|
| 261 |
|
| 262 |
-
> **Audio is not supported in the GGUF files.** phi4-mm's speech capability
|
| 263 |
>
|
| 264 |
> **To use audio/speech transcription, use the vLLM path below** with the original bf16 safetensors model.
|
| 265 |
|
|
@@ -312,6 +312,17 @@ python -m vllm.entrypoints.openai.api_server \
|
|
| 312 |
| `--tool-call-parser phi4_mini_json` | β | phi4-mm emits `functools[...]` not Hermes β required for tool calling |
|
| 313 |
| `--trust-remote-code` | β | Required for phi4-mm's custom modelling code |
|
| 314 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 315 |
Wait for the server to finish loading (~60 s):
|
| 316 |
```bash
|
| 317 |
curl http://localhost:8080/health # β {"status":"ok"}
|
|
@@ -398,7 +409,7 @@ response = client.chat.completions.create(
|
|
| 398 |
print(response.choices[0].message.content)
|
| 399 |
```
|
| 400 |
|
| 401 |
-
> phi4-mm uses
|
| 402 |
> Supported formats: `wav`, `mp3`, `ogg`, `flac`.
|
| 403 |
|
| 404 |
#### Tool calling
|
|
@@ -434,10 +445,10 @@ PARAMETER temperature 0.7
|
|
| 434 |
| Total parameters | ~5.6 B |
|
| 435 |
| GGUF arch | `phi3` |
|
| 436 |
| Context length | 128 K tokens (131,072) |
|
| 437 |
-
| Modalities | Text, Vision (
|
| 438 |
|
| 439 |
-
The vision encoder (`mmproj-phi4-mm-f16.gguf`) is a
|
| 440 |
-
with a
|
| 441 |
|
| 442 |
---
|
| 443 |
|
|
|
|
| 259 |
| Vision (images) | β
| Requires `mmproj-phi4-mm-f16.gguf` + `--mmproj` flag |
|
| 260 |
| Audio / Speech | β | Not available β see below |
|
| 261 |
|
| 262 |
+
> **Audio is not supported in the GGUF files.** phi4-mm's speech capability uses a custom conformer-based audio encoder (24 conformer blocks, initialized from a proprietary AED ASR model) plus a rank-320 speech LoRA applied to the language decoder. The GGUF conversion pipeline (`convert_hf_to_gguf.py`) only exports the text transformer and the SigLIP vision encoder (mmproj) β the audio encoder tensors are not extracted. There is currently no `audioproj` equivalent in llama.cpp for phi4-mm.
|
| 263 |
>
|
| 264 |
> **To use audio/speech transcription, use the vLLM path below** with the original bf16 safetensors model.
|
| 265 |
|
|
|
|
| 312 |
| `--tool-call-parser phi4_mini_json` | β | phi4-mm emits `functools[...]` not Hermes β required for tool calling |
|
| 313 |
| `--trust-remote-code` | β | Required for phi4-mm's custom modelling code |
|
| 314 |
|
| 315 |
+
> **Official vLLM LoRA flags:** Microsoft's published vLLM command includes explicit LoRA adapter flags to activate
|
| 316 |
+
> the rank-320 vision and speech adapters stored in separate subfolders of the model directory:
|
| 317 |
+
> ```bash
|
| 318 |
+
> --enable-lora \
|
| 319 |
+
> --max-lora-rank 320 \
|
| 320 |
+
> --lora-extra-vocab-size 0 \
|
| 321 |
+
> --max-loras 2 \
|
| 322 |
+
> --lora-modules speech=~/phi4-mm-hf/speech-lora vision=~/phi4-mm-hf/vision-lora
|
| 323 |
+
> ```
|
| 324 |
+
> If you experience degraded vision or audio quality, add these flags to the launch command above.
|
| 325 |
+
|
| 326 |
Wait for the server to finish loading (~60 s):
|
| 327 |
```bash
|
| 328 |
curl http://localhost:8080/health # β {"status":"ok"}
|
|
|
|
| 409 |
print(response.choices[0].message.content)
|
| 410 |
```
|
| 411 |
|
| 412 |
+
> phi4-mm uses a custom conformer-based audio encoder with rank-320 speech LoRA β no separate ASR model needed.
|
| 413 |
> Supported formats: `wav`, `mp3`, `ogg`, `flac`.
|
| 414 |
|
| 415 |
#### Tool calling
|
|
|
|
| 445 |
| Total parameters | ~5.6 B |
|
| 446 |
| GGUF arch | `phi3` |
|
| 447 |
| Context length | 128 K tokens (131,072) |
|
| 448 |
+
| Modalities | Text, Vision (SigLIP-400M), Audio/Speech |
|
| 449 |
|
| 450 |
+
The vision encoder (`mmproj-phi4-mm-f16.gguf`) is a SigLIP-400M encoder finetuned
|
| 451 |
+
with LLM2CLIP, with a 2-layer MLP projector. Audio/speech is **not embedded in the GGUF** β see the audio limitation callout above.
|
| 452 |
|
| 453 |
---
|
| 454 |
|