MiniMax-M2.7 JANGTQ4
MiniMax M2.7 228B MoE — 4-bit routed codebook + 8-bit affine, ~113 GB
Highest-quality JANGTQ profile — near-bf16 accuracy at ~50% of bf16 disk.
⚠️ Recommended: Run in MLX Studio or Osaurus. Both bundle the JANGTQ runtime (custom Metal kernels for codebook + Hadamard matmul). Stock
mlx_lm.load()will NOT load this checkpoint.
Follow development on Twitter: @jangq_ai
What is JANGTQ4?
JANGTQ (JANG TurboQuant) is the most-compressed, highest-quality JANG
quantization format. Routed expert weights stay in a compact codebook +
Hadamard-rotated form at runtime — no decompression to affine — and the
matmul path uses custom Metal kernels that read packed uint32 weights,
look up centroids in a 16-entry codebook (at 4-bit), and accumulate dot
products against a Hadamard-rotated input (QuIP# rotate-input-once math).
JANGTQ4 trades disk size for quality — 4-bit codebooks capture the routed-
expert weight distribution near-losslessly. Pick JANGTQ4 when RAM is
available and you want the closest quality to bf16 on Apple Silicon. Pick
JANGTQ (2-bit) for the smallest footprint at minimal quality cost.
JANGTQ2 vs JANGTQ4 vs bf16
| JANGTQ (2-bit) | JANGTQ4 | bf16 | |
|---|---|---|---|
| Disk | 56.5 GB | ~113 GB | ~457 GB |
| Routed expert bits | 2 | 4 | 16 |
| Codebook size | 4 entries | 16 entries | — |
| Avg bits/param | ~2.15 | ~4.10 | 16 |
| MMLU 200q (baseline) | 91.5% | TBD (expected ≥ 94%) | 95.5% |
| Decode tok/s (M3 Ultra) | 44.3 | TBD | baseline |
Pick JANGTQ4 when RAM is available and you want the highest-quality 4-bit MiniMax M2.7. Pick the 2-bit JANGTQ for the smallest disk / RAM footprint.
Model Details
| Metric | Value |
|---|---|
| Source | MiniMaxAI/MiniMax-M2.7 (FP8 E4M3) |
| Architecture | MoE (256 experts, top-8 active), standard Q/K/V attention, partial RoPE |
| Total parameters | 228.7 B |
| Active per token | ~1.4 B |
| Profile | JANGTQ4 |
| Format | JANGTQ (codebook+Hadamard) — weight_format: mxtq in jang_config.json |
| Avg bits/param | ~4.10 |
| Disk | ~113 GB |
| Context | 192 K tokens |
| Chat template | Always-reasoning (<think>\n opened at assistant start) |
JANGTQ4 Bit Allocation
| Component | Bits | Format | Why |
|---|---|---|---|
| Routed expert MLP (gate/up/down) — 98% of params | 4 | JANGTQ codebook + Hadamard | 16-entry codebook captures the routed-expert distribution near-losslessly |
| Attention (Q/K/V/O) | 8 | affine (nn.QuantizedLinear, group_size=64) |
Runs on every token; quality-critical |
| Shared expert | 8 | affine | Runs on every token |
| Embed tokens / LM head | 8 | affine | Quality-critical input/output projections |
| Router gate | fp16 | unquantized nn.Linear |
Routing precision matters |
| RMSNorms / RoPE / biases | fp16 | unquantized | Already tiny |
Important Settings
MiniMax M2.7 is an always-reasoning model. The chat template
unconditionally opens <think>\n at each assistant turn.
| Setting | Value | Notes |
|---|---|---|
| Temperature | 1.0 | REQUIRED — temp=0 can cause thinking loops |
| Top P | 0.95 | |
| Top K | 40 | |
| Repetition Penalty | 1.1 | Optional, helps prevent loops |
| max_tokens | ≥ 8192 | Give reasoning room to converge |
Strip <think>…</think> from the response before using the final answer.
Usage
pip install jang-tools
from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate
model_path = snapshot_download("JANGQ-AI/MiniMax-M2.7-JANGTQ4")
model, tokenizer = load_jangtq_model(model_path)
messages = [{"role": "user", "content": "Explain photosynthesis in 5 sentences."}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
out = generate(model, tokenizer, prompt, max_tokens=600, verbose=True, temperature=1.0)
Swift — MLX Studio / Osaurus
Both clients auto-detect the JANGTQ runtime from jang_config.json and route
through the MiniMaxJANGTQModel class. Just load the repo — no extra flags.
What's In This Repo
| File | Role |
|---|---|
model-*.safetensors |
Weights — 4-bit routed TQ + 8-bit affine |
model.safetensors.index.json |
Shard index |
jangtq_runtime.safetensors |
Codebooks + Hadamard signs sidecar (Swift loader) |
jang_config.json |
JANG metadata + Tier-1 capabilities stamp (reasoning=qwen3, tool=minimax) |
config.json |
HF model config (minimax_m2, weight_format=mxtq, mxtq_bits=4) |
tokenizer.*, chat_template.jinja, vocab.json, merges.txt |
Tokenizer + chat template |
Parser Capabilities (Tier-1 auto-detected by vmlx)
{
"reasoning_parser": "qwen3",
"tool_parser": "minimax",
"think_in_template": true,
"supports_tools": true,
"supports_thinking": true,
"family": "minimax_m2",
"modality": "text",
"cache_type": "kv"
}
<think>...</think> and <tool_call>...</tool_call> are non-special tokens
by design — the application layer parses them. vmlx's CapabilityDetector
reads this block verbatim and wires the qwen3 reasoning parser + minimax
tool parser automatically, so streamed responses route reasoning_content
and tool_calls into the OpenAI-compatible SSE fields instead of leaking
into content.
License
MiniMax non-commercial (inherits from upstream — see LICENSE).
Credits
Created by Jinho Jang — eric@jangq.ai
Based on MiniMaxAI's MiniMax M2.7. JANGTQ quantization © JANGQ-AI.
Kernels: hadamardRotate, fusedGateUpSwiGLU (P17 OPT=10), gatherTQ,
compiled-router math (P15) — all in jang_tools/turboquant/.
- Downloads last month
- 321
Quantized
Model tree for JANGQ-AI/MiniMax-M2.7-JANGTQ4
Base model
MiniMaxAI/MiniMax-M2.7