Swicked86 commited on
Commit
1e67ce0
Β·
verified Β·
1 Parent(s): 9979b66

Update model card: fix audio encoder (conformer, not Whisper), SigLIP-400M, add official LoRA flags note

Browse files
Files changed (1) hide show
  1. README.md +16 -5
README.md CHANGED
@@ -259,7 +259,7 @@ print(response.choices[0].message.content)
259
  | Vision (images) | βœ… | Requires `mmproj-phi4-mm-f16.gguf` + `--mmproj` flag |
260
  | Audio / Speech | ❌ | Not available β€” see below |
261
 
262
- > **Audio is not supported in the GGUF files.** phi4-mm's speech capability lives in a Whisper-based audio encoder that is embedded in the original safetensors weights. The GGUF conversion pipeline (`convert_hf_to_gguf.py`) only exports the text transformer and the SigLIP vision encoder (mmproj) β€” the audio encoder tensors are not extracted. There is currently no `audioproj` equivalent in llama.cpp for phi4-mm.
263
  >
264
  > **To use audio/speech transcription, use the vLLM path below** with the original bf16 safetensors model.
265
 
@@ -312,6 +312,17 @@ python -m vllm.entrypoints.openai.api_server \
312
  | `--tool-call-parser phi4_mini_json` | β€” | phi4-mm emits `functools[...]` not Hermes β€” required for tool calling |
313
  | `--trust-remote-code` | β€” | Required for phi4-mm's custom modelling code |
314
 
 
 
 
 
 
 
 
 
 
 
 
315
  Wait for the server to finish loading (~60 s):
316
  ```bash
317
  curl http://localhost:8080/health # β†’ {"status":"ok"}
@@ -398,7 +409,7 @@ response = client.chat.completions.create(
398
  print(response.choices[0].message.content)
399
  ```
400
 
401
- > phi4-mm uses its own native speech LoRA for audio β€” no separate Whisper model is needed.
402
  > Supported formats: `wav`, `mp3`, `ogg`, `flac`.
403
 
404
  #### Tool calling
@@ -434,10 +445,10 @@ PARAMETER temperature 0.7
434
  | Total parameters | ~5.6 B |
435
  | GGUF arch | `phi3` |
436
  | Context length | 128 K tokens (131,072) |
437
- | Modalities | Text, Vision (CLIP-based), Audio/Speech |
438
 
439
- The vision encoder (`mmproj-phi4-mm-f16.gguf`) is a CLIP-style image encoder
440
- with a projection MLP. Audio/speech capability is embedded in the base GGUF weights.
441
 
442
  ---
443
 
 
259
  | Vision (images) | βœ… | Requires `mmproj-phi4-mm-f16.gguf` + `--mmproj` flag |
260
  | Audio / Speech | ❌ | Not available β€” see below |
261
 
262
+ > **Audio is not supported in the GGUF files.** phi4-mm's speech capability uses a custom conformer-based audio encoder (24 conformer blocks, initialized from a proprietary AED ASR model) plus a rank-320 speech LoRA applied to the language decoder. The GGUF conversion pipeline (`convert_hf_to_gguf.py`) only exports the text transformer and the SigLIP vision encoder (mmproj) β€” the audio encoder tensors are not extracted. There is currently no `audioproj` equivalent in llama.cpp for phi4-mm.
263
  >
264
  > **To use audio/speech transcription, use the vLLM path below** with the original bf16 safetensors model.
265
 
 
312
  | `--tool-call-parser phi4_mini_json` | β€” | phi4-mm emits `functools[...]` not Hermes β€” required for tool calling |
313
  | `--trust-remote-code` | β€” | Required for phi4-mm's custom modelling code |
314
 
315
+ > **Official vLLM LoRA flags:** Microsoft's published vLLM command includes explicit LoRA adapter flags to activate
316
+ > the rank-320 vision and speech adapters stored in separate subfolders of the model directory:
317
+ > ```bash
318
+ > --enable-lora \
319
+ > --max-lora-rank 320 \
320
+ > --lora-extra-vocab-size 0 \
321
+ > --max-loras 2 \
322
+ > --lora-modules speech=~/phi4-mm-hf/speech-lora vision=~/phi4-mm-hf/vision-lora
323
+ > ```
324
+ > If you experience degraded vision or audio quality, add these flags to the launch command above.
325
+
326
  Wait for the server to finish loading (~60 s):
327
  ```bash
328
  curl http://localhost:8080/health # β†’ {"status":"ok"}
 
409
  print(response.choices[0].message.content)
410
  ```
411
 
412
+ > phi4-mm uses a custom conformer-based audio encoder with rank-320 speech LoRA β€” no separate ASR model needed.
413
  > Supported formats: `wav`, `mp3`, `ogg`, `flac`.
414
 
415
  #### Tool calling
 
445
  | Total parameters | ~5.6 B |
446
  | GGUF arch | `phi3` |
447
  | Context length | 128 K tokens (131,072) |
448
+ | Modalities | Text, Vision (SigLIP-400M), Audio/Speech |
449
 
450
+ The vision encoder (`mmproj-phi4-mm-f16.gguf`) is a SigLIP-400M encoder finetuned
451
+ with LLM2CLIP, with a 2-layer MLP projector. Audio/speech is **not embedded in the GGUF** β€” see the audio limitation callout above.
452
 
453
  ---
454