Swicked86
/

phi4-mm-gguf

@@ -259,7 +259,7 @@ print(response.choices[0].message.content)
 | Vision (images) | ✅ | Requires `mmproj-phi4-mm-f16.gguf` + `--mmproj` flag |
 | Audio / Speech | ❌ | Not available — see below |
-> **Audio is not supported in the GGUF files.** phi4-mm's speech capability lives in a Whisper-based audio encoder that is embedded in the original safetensors weights. The GGUF conversion pipeline (`convert_hf_to_gguf.py`) only exports the text transformer and the SigLIP vision encoder (mmproj) — the audio encoder tensors are not extracted. There is currently no `audioproj` equivalent in llama.cpp for phi4-mm.
 >
 > **To use audio/speech transcription, use the vLLM path below** with the original bf16 safetensors model.
@@ -312,6 +312,17 @@ python -m vllm.entrypoints.openai.api_server \
 | `--tool-call-parser phi4_mini_json` | — | phi4-mm emits `functools[...]` not Hermes — required for tool calling |
 | `--trust-remote-code` | — | Required for phi4-mm's custom modelling code |
 Wait for the server to finish loading (~60 s):
 ```bash
 curl http://localhost:8080/health   # → {"status":"ok"}
@@ -398,7 +409,7 @@ response = client.chat.completions.create(
 print(response.choices[0].message.content)
 ```
-> phi4-mm uses its own native speech LoRA for audio — no separate Whisper model is needed.
 > Supported formats: `wav`, `mp3`, `ogg`, `flac`.
 #### Tool calling
@@ -434,10 +445,10 @@ PARAMETER temperature 0.7
 | Total parameters | ~5.6 B |
 | GGUF arch | `phi3` |
 | Context length | 128 K tokens (131,072) |
-| Modalities | Text, Vision (CLIP-based), Audio/Speech |
-The vision encoder (`mmproj-phi4-mm-f16.gguf`) is a CLIP-style image encoder
-with a projection MLP. Audio/speech capability is embedded in the base GGUF weights.
 ---

 | Vision (images) | ✅ | Requires `mmproj-phi4-mm-f16.gguf` + `--mmproj` flag |
 | Audio / Speech | ❌ | Not available — see below |
+> **Audio is not supported in the GGUF files.** phi4-mm's speech capability uses a custom conformer-based audio encoder (24 conformer blocks, initialized from a proprietary AED ASR model) plus a rank-320 speech LoRA applied to the language decoder. The GGUF conversion pipeline (`convert_hf_to_gguf.py`) only exports the text transformer and the SigLIP vision encoder (mmproj) — the audio encoder tensors are not extracted. There is currently no `audioproj` equivalent in llama.cpp for phi4-mm.
 >
 > **To use audio/speech transcription, use the vLLM path below** with the original bf16 safetensors model.
 | `--tool-call-parser phi4_mini_json` | — | phi4-mm emits `functools[...]` not Hermes — required for tool calling |
 | `--trust-remote-code` | — | Required for phi4-mm's custom modelling code |
+> **Official vLLM LoRA flags:** Microsoft's published vLLM command includes explicit LoRA adapter flags to activate
+> the rank-320 vision and speech adapters stored in separate subfolders of the model directory:
+> ```bash
+> --enable-lora \
+> --max-lora-rank 320 \
+> --lora-extra-vocab-size 0 \
+> --max-loras 2 \
+> --lora-modules speech=~/phi4-mm-hf/speech-lora vision=~/phi4-mm-hf/vision-lora
+> ```
+> If you experience degraded vision or audio quality, add these flags to the launch command above.
 Wait for the server to finish loading (~60 s):
 ```bash
 curl http://localhost:8080/health   # → {"status":"ok"}
 print(response.choices[0].message.content)
 ```
+> phi4-mm uses a custom conformer-based audio encoder with rank-320 speech LoRA — no separate ASR model needed.
 > Supported formats: `wav`, `mp3`, `ogg`, `flac`.
 #### Tool calling
 | Total parameters | ~5.6 B |
 | GGUF arch | `phi3` |
 | Context length | 128 K tokens (131,072) |
+| Modalities | Text, Vision (SigLIP-400M), Audio/Speech |
+The vision encoder (`mmproj-phi4-mm-f16.gguf`) is a SigLIP-400M encoder finetuned
+with LLM2CLIP, with a 2-layer MLP projector. Audio/speech is **not embedded in the GGUF** — see the audio limitation callout above.
 ---