model latency is too long.

#1
by AMator - opened

model latency is too long. it took over 5 mins on starting generation on Radeon 680M, 12GB UMA DDR5 4800 VRAM.

What program are you using to run the model (LM Studio, Ollama, text-generation-webui, etc.)? Depending on the UI, switching the backend to Vulkan or forcing a ROCm environment variable usually fixes this exact issue for RDNA2 integrated graphics. Let me know and I can help you find the right setting!"

llama.cpp, on this setting. x64 vulkan, build 8660.

llama-server -m /home/user/Gemma4_Q4_K_M.gguf -fa on -np 1 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 -b 4096 -ub 1024 --port 11434 -ngl 99 -c 16384 --mmproj /home/user/Gemma4_mmproj.gguf

tried gemma 4 fix build 8661, nothing improved.

I just stress-tested the new Q4_K_M and Q5_K_M with your exact settings: -fa on, -ctk q8_0, -ctv q8_0, and -b 4096. It now hits 100+ t/s on prompt processing and responds instantly without the 5-minute hang. The model is now correctly identified as gemma4 in the logs, test it again, all the quants are update also.

it wasn't model's fault. it's my setup's fault. I updated the settings and llama build, and completely fine. very sorry and thanks.

I have updated the model for better heretical performance.

model latency is too long. it took over 5 mins on starting generation on Radeon 680M, 12GB UMA DDR5 4800 VRAM.

AI is made for CUDA. You can't expect fast inference and warmup on AMD.

Sign up or log in to comment