Qwen3-Reranker-8B GGUF — Quantized by BatiAI

GGUF quantizations of Qwen/Qwen3-Reranker-8B — the top-tier of the Qwen3 reranker family for maximum ranking precision. Part of BatiAI's on-device RAG stack for BatiFlow.

What is a reranker?

RAG pipeline: embedding (coarse retrieve) → reranker (precise scoring) → LLM (answer).

A reranker takes (query, candidate_document) and returns a relevance score. It's the "second pass" after vector search — turns "probably relevant" candidates into an ordered top-K that the LLM can use confidently.

When to pick 8B over 0.6B / 4B?

Use case	Pick
Desktop workstation / plenty of RAM	8B — best ranking accuracy, clearest margin on adversarial/ambiguous negatives
Typical laptop / 32 GB Mac	4B — close to 8B quality at half the size
Edge / small Mac / batch rerank at scale	0.6B — 13× smaller than 8B, still hits 100 % pairwise accuracy on our test

All three from the same Qwen3-Reranker family, different sizes. 8B is the quality ceiling.

Quick Start (llama.cpp)

./llama-server -m Qwen3-Reranker-8B-Q8_0.gguf \
  --rerank --pooling rank -c 4096 \
  --host 127.0.0.1 --port 8090

curl http://127.0.0.1:8090/rerank -d '{
  "query": "What is RAG?",
  "documents": ["RAG ...", "Paris ..."]
}'

Note: Ollama doesn't have a native reranker endpoint yet, so this GGUF is intended for direct llama.cpp integration or tools like LangChain / LlamaIndex.

Available Quantizations

File	Quant	Size	Recommended
`Qwen3-Reranker-8B-Q6_K.gguf`	Q6_K	5.8 GB	balanced (recommended default)
`Qwen3-Reranker-8B-Q8_0.gguf`	Q8_0	7.5 GB	near-lossless

Quality Verification (measured)

Ran 40 (query, positive, negative) triples — 20 EN + 20 KO — on hard test (topically-close negatives):

Quant	Accuracy	Margin (pos-neg)
Q6_K	100 %	0.819
Q8_0	100 %	0.825

Pearson correlation Q6_K ↔ Q8_0: r = 0.9986 → quantization drift essentially zero.

8B vs smaller variants (same testset, same script):

Model	Hard margin	Drift (Q6↔Q8)
0.6B	0.723–0.751	r = 0.996
4B	0.650–0.672	r = 0.998
8B	0.819–0.825	r = 0.999

The 8B's larger margin on adversarial negatives is its key differentiator — the score separation between "right answer" and "close-but-wrong" is visibly wider, which helps in high-stakes retrieval where you can't afford the top-1 to be wrong.

Why Qwen3-Reranker?

SOTA among open rerankers — top of MTEB reranking benchmarks
Multilingual — en / ko / ja / zh
Apache 2.0 — commercial-friendly

Why BatiAI?

Quantized directly from Alibaba's BF16 safetensors
BatiAI-signed — general.author: BatiAI, general.url: https://flow.bati.ai
Part of a full on-device RAG stack

Technical Details

Original Model: Qwen/Qwen3-Reranker-8B
Architecture: Qwen3 Causal LM (cross-encoder scorer)
Parameters: 8 B
Context: 32 K
License: Apache 2.0
Quantized with: llama.cpp build bafae2765

BatiAI's RAG Stack

Role	Model	HF
Reranker (0.6 B)	Qwen3-Reranker-0.6B	batiai/Qwen3-Reranker-0.6B-GGUF
Reranker (4 B)	Qwen3-Reranker-4B	batiai/Qwen3-Reranker-4B-GGUF
Reranker (8 B)	Qwen3-Reranker-8B	this repo
VL Embedding (2 B)	Qwen3-VL-Embedding-2B	batiai/Qwen3-VL-Embedding-2B-GGUF
Chat LLM (35 B-A3B)	Qwen3.6-35B-A3B	batiai/Qwen3.6-35B-A3B-GGUF