Qwen3-Reranker-8B GGUF β Quantized by BatiAI
GGUF quantizations of Qwen/Qwen3-Reranker-8B β the top-tier of the Qwen3 reranker family for maximum ranking precision. Part of BatiAI's on-device RAG stack for BatiFlow.
What is a reranker?
RAG pipeline: embedding (coarse retrieve) β reranker (precise scoring) β LLM (answer).
A reranker takes (query, candidate_document) and returns a relevance score. It's the "second pass" after vector search β turns "probably relevant" candidates into an ordered top-K that the LLM can use confidently.
When to pick 8B over 0.6B / 4B?
| Use case | Pick |
|---|---|
| Desktop workstation / plenty of RAM | 8B β best ranking accuracy, clearest margin on adversarial/ambiguous negatives |
| Typical laptop / 32 GB Mac | 4B β close to 8B quality at half the size |
| Edge / small Mac / batch rerank at scale | 0.6B β 13Γ smaller than 8B, still hits 100 % pairwise accuracy on our test |
All three from the same Qwen3-Reranker family, different sizes. 8B is the quality ceiling.
Quick Start (llama.cpp)
./llama-server -m Qwen3-Reranker-8B-Q8_0.gguf \
--rerank --pooling rank -c 4096 \
--host 127.0.0.1 --port 8090
curl http://127.0.0.1:8090/rerank -d '{
"query": "What is RAG?",
"documents": ["RAG ...", "Paris ..."]
}'
Note: Ollama doesn't have a native reranker endpoint yet, so this GGUF is intended for direct llama.cpp integration or tools like LangChain / LlamaIndex.
Available Quantizations
| File | Quant | Size | Recommended |
|---|---|---|---|
Qwen3-Reranker-8B-Q6_K.gguf |
Q6_K | 5.8 GB | balanced (recommended default) |
Qwen3-Reranker-8B-Q8_0.gguf |
Q8_0 | 7.5 GB | near-lossless |
Quality Verification (measured)
Ran 40 (query, positive, negative) triples β 20 EN + 20 KO β on hard test (topically-close negatives):
| Quant | Accuracy | Margin (pos-neg) |
|---|---|---|
| Q6_K | 100 % | 0.819 |
| Q8_0 | 100 % | 0.825 |
Pearson correlation Q6_K β Q8_0: r = 0.9986 β quantization drift essentially zero.
8B vs smaller variants (same testset, same script):
| Model | Hard margin | Drift (Q6βQ8) |
|---|---|---|
| 0.6B | 0.723β0.751 | r = 0.996 |
| 4B | 0.650β0.672 | r = 0.998 |
| 8B | 0.819β0.825 | r = 0.999 |
The 8B's larger margin on adversarial negatives is its key differentiator β the score separation between "right answer" and "close-but-wrong" is visibly wider, which helps in high-stakes retrieval where you can't afford the top-1 to be wrong.
Why Qwen3-Reranker?
- SOTA among open rerankers β top of MTEB reranking benchmarks
- Multilingual β en / ko / ja / zh
- Apache 2.0 β commercial-friendly
Why BatiAI?
- Quantized directly from Alibaba's BF16 safetensors
- BatiAI-signed β
general.author: BatiAI,general.url: https://flow.bati.ai - Part of a full on-device RAG stack
Technical Details
- Original Model: Qwen/Qwen3-Reranker-8B
- Architecture: Qwen3 Causal LM (cross-encoder scorer)
- Parameters: 8 B
- Context: 32 K
- License: Apache 2.0
- Quantized with: llama.cpp build
bafae2765
BatiAI's RAG Stack
| Role | Model | HF |
|---|---|---|
| Reranker (0.6 B) | Qwen3-Reranker-0.6B | batiai/Qwen3-Reranker-0.6B-GGUF |
| Reranker (4 B) | Qwen3-Reranker-4B | batiai/Qwen3-Reranker-4B-GGUF |
| Reranker (8 B) | Qwen3-Reranker-8B | this repo |
| VL Embedding (2 B) | Qwen3-VL-Embedding-2B | batiai/Qwen3-VL-Embedding-2B-GGUF |
| Chat LLM (35 B-A3B) | Qwen3.6-35B-A3B | batiai/Qwen3.6-35B-A3B-GGUF |
License
Mirrors upstream Qwen Apache 2.0. Commercial use permitted.
- Downloads last month
- 131
6-bit
8-bit