Qwen3-Reranker-8B GGUF β€” Quantized by BatiAI

BatiFlow Upstream

GGUF quantizations of Qwen/Qwen3-Reranker-8B β€” the top-tier of the Qwen3 reranker family for maximum ranking precision. Part of BatiAI's on-device RAG stack for BatiFlow.

What is a reranker?

RAG pipeline: embedding (coarse retrieve) β†’ reranker (precise scoring) β†’ LLM (answer).

A reranker takes (query, candidate_document) and returns a relevance score. It's the "second pass" after vector search β€” turns "probably relevant" candidates into an ordered top-K that the LLM can use confidently.

When to pick 8B over 0.6B / 4B?

Use case Pick
Desktop workstation / plenty of RAM 8B β€” best ranking accuracy, clearest margin on adversarial/ambiguous negatives
Typical laptop / 32 GB Mac 4B β€” close to 8B quality at half the size
Edge / small Mac / batch rerank at scale 0.6B β€” 13Γ— smaller than 8B, still hits 100 % pairwise accuracy on our test

All three from the same Qwen3-Reranker family, different sizes. 8B is the quality ceiling.

Quick Start (llama.cpp)

./llama-server -m Qwen3-Reranker-8B-Q8_0.gguf \
  --rerank --pooling rank -c 4096 \
  --host 127.0.0.1 --port 8090

curl http://127.0.0.1:8090/rerank -d '{
  "query": "What is RAG?",
  "documents": ["RAG ...", "Paris ..."]
}'

Note: Ollama doesn't have a native reranker endpoint yet, so this GGUF is intended for direct llama.cpp integration or tools like LangChain / LlamaIndex.

Available Quantizations

File Quant Size Recommended
Qwen3-Reranker-8B-Q6_K.gguf Q6_K 5.8 GB balanced (recommended default)
Qwen3-Reranker-8B-Q8_0.gguf Q8_0 7.5 GB near-lossless

Quality Verification (measured)

Ran 40 (query, positive, negative) triples β€” 20 EN + 20 KO β€” on hard test (topically-close negatives):

Quant Accuracy Margin (pos-neg)
Q6_K 100 % 0.819
Q8_0 100 % 0.825

Pearson correlation Q6_K ↔ Q8_0: r = 0.9986 β†’ quantization drift essentially zero.

8B vs smaller variants (same testset, same script):

Model Hard margin Drift (Q6↔Q8)
0.6B 0.723–0.751 r = 0.996
4B 0.650–0.672 r = 0.998
8B 0.819–0.825 r = 0.999

The 8B's larger margin on adversarial negatives is its key differentiator β€” the score separation between "right answer" and "close-but-wrong" is visibly wider, which helps in high-stakes retrieval where you can't afford the top-1 to be wrong.

Why Qwen3-Reranker?

  • SOTA among open rerankers β€” top of MTEB reranking benchmarks
  • Multilingual β€” en / ko / ja / zh
  • Apache 2.0 β€” commercial-friendly

Why BatiAI?

  • Quantized directly from Alibaba's BF16 safetensors
  • BatiAI-signed β€” general.author: BatiAI, general.url: https://flow.bati.ai
  • Part of a full on-device RAG stack

Technical Details

  • Original Model: Qwen/Qwen3-Reranker-8B
  • Architecture: Qwen3 Causal LM (cross-encoder scorer)
  • Parameters: 8 B
  • Context: 32 K
  • License: Apache 2.0
  • Quantized with: llama.cpp build bafae2765

BatiAI's RAG Stack

Role Model HF
Reranker (0.6 B) Qwen3-Reranker-0.6B batiai/Qwen3-Reranker-0.6B-GGUF
Reranker (4 B) Qwen3-Reranker-4B batiai/Qwen3-Reranker-4B-GGUF
Reranker (8 B) Qwen3-Reranker-8B this repo
VL Embedding (2 B) Qwen3-VL-Embedding-2B batiai/Qwen3-VL-Embedding-2B-GGUF
Chat LLM (35 B-A3B) Qwen3.6-35B-A3B batiai/Qwen3.6-35B-A3B-GGUF

License

Mirrors upstream Qwen Apache 2.0. Commercial use permitted.

Downloads last month
131
GGUF
Model size
8B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for batiai/Qwen3-Reranker-8B-GGUF

Quantized
(40)
this model

Collection including batiai/Qwen3-Reranker-8B-GGUF