model latency is too long.
model latency is too long. it took over 5 mins on starting generation on Radeon 680M, 12GB UMA DDR5 4800 VRAM.
What program are you using to run the model (LM Studio, Ollama, text-generation-webui, etc.)? Depending on the UI, switching the backend to Vulkan or forcing a ROCm environment variable usually fixes this exact issue for RDNA2 integrated graphics. Let me know and I can help you find the right setting!"
llama.cpp, on this setting. x64 vulkan, build 8660.
llama-server -m /home/user/Gemma4_Q4_K_M.gguf -fa on -np 1 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 -b 4096 -ub 1024 --port 11434 -ngl 99 -c 16384 --mmproj /home/user/Gemma4_mmproj.gguf
tried gemma 4 fix build 8661, nothing improved.
I just stress-tested the new Q4_K_M and Q5_K_M with your exact settings: -fa on, -ctk q8_0, -ctv q8_0, and -b 4096. It now hits 100+ t/s on prompt processing and responds instantly without the 5-minute hang. The model is now correctly identified as gemma4 in the logs, test it again, all the quants are update also.
it wasn't model's fault. it's my setup's fault. I updated the settings and llama build, and completely fine. very sorry and thanks.
I have updated the model for better heretical performance.
model latency is too long. it took over 5 mins on starting generation on Radeon 680M, 12GB UMA DDR5 4800 VRAM.
AI is made for CUDA. You can't expect fast inference and warmup on AMD.