model latency is too long.

by AMator - opened 16 days ago

Discussion

AMator

16 days ago

model latency is too long. it took over 5 mins on starting generation on Radeon 680M, 12GB UMA DDR5 4800 VRAM.

Abiray

Owner 16 days ago

What program are you using to run the model (LM Studio, Ollama, text-generation-webui, etc.)? Depending on the UI, switching the backend to Vulkan or forcing a ROCm environment variable usually fixes this exact issue for RDNA2 integrated graphics. Let me know and I can help you find the right setting!"

AMator

16 days ago

•

edited 16 days ago

llama.cpp, on this setting. x64 vulkan, build 8660.

llama-server -m /home/user/Gemma4_Q4_K_M.gguf -fa on -np 1 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 -b 4096 -ub 1024 --port 11434 -ngl 99 -c 16384 --mmproj /home/user/Gemma4_mmproj.gguf

AMator

16 days ago

tried gemma 4 fix build 8661, nothing improved.

Abiray

Owner 16 days ago

•

edited 16 days ago

I just stress-tested the new Q4_K_M and Q5_K_M with your exact settings: -fa on, -ctk q8_0, -ctv q8_0, and -b 4096. It now hits 100+ t/s on prompt processing and responds instantly without the 5-minute hang. The model is now correctly identified as gemma4 in the logs, test it again, all the quants are update also.

AMator

16 days ago

it wasn't model's fault. it's my setup's fault. I updated the settings and llama build, and completely fine. very sorry and thanks.

Abiray

Owner 16 days ago

I have updated the model for better heretical performance.

cooperdk

15 days ago

model latency is too long. it took over 5 mins on starting generation on Radeon 680M, 12GB UMA DDR5 4800 VRAM.

AI is made for CUDA. You can't expect fast inference and warmup on AMD.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment