MiniMax-M2.7 JANGTQ4

MiniMax M2.7 228B MoE — 4-bit routed codebook + 8-bit affine, ~113 GB

Highest-quality JANGTQ profile — near-bf16 accuracy at ~50% of bf16 disk.

⚠️ Recommended: Run in MLX Studio or Osaurus. Both bundle the JANGTQ runtime (custom Metal kernels for codebook + Hadamard matmul). Stock mlx_lm.load() will NOT load this checkpoint.

Follow development on Twitter: @jangq_ai

What is JANGTQ4?

JANGTQ (JANG TurboQuant) is the most-compressed, highest-quality JANG quantization format. Routed expert weights stay in a compact codebook + Hadamard-rotated form at runtime — no decompression to affine — and the matmul path uses custom Metal kernels that read packed uint32 weights, look up centroids in a 16-entry codebook (at 4-bit), and accumulate dot products against a Hadamard-rotated input (QuIP# rotate-input-once math).

JANGTQ4 trades disk size for quality — 4-bit codebooks capture the routed- expert weight distribution near-losslessly. Pick JANGTQ4 when RAM is available and you want the closest quality to bf16 on Apple Silicon. Pick JANGTQ (2-bit) for the smallest footprint at minimal quality cost.

JANGTQ2 vs JANGTQ4 vs bf16

	JANGTQ (2-bit)	JANGTQ4	bf16
Disk	56.5 GB	~113 GB	~457 GB
Routed expert bits	2	4	16
Codebook size	4 entries	16 entries	—
Avg bits/param	~2.15	~4.10	16
MMLU 200q (baseline)	91.5%	TBD (expected ≥ 94%)	95.5%
Decode tok/s (M3 Ultra)	44.3	TBD	baseline

Pick JANGTQ4 when RAM is available and you want the highest-quality 4-bit MiniMax M2.7. Pick the 2-bit JANGTQ for the smallest disk / RAM footprint.

Model Details

Metric	Value
Source	`MiniMaxAI/MiniMax-M2.7` (FP8 E4M3)
Architecture	MoE (256 experts, top-8 active), standard Q/K/V attention, partial RoPE
Total parameters	228.7 B
Active per token	~1.4 B
Profile	JANGTQ4
Format	JANGTQ (codebook+Hadamard) — `weight_format: mxtq` in `jang_config.json`
Avg bits/param	~4.10
Disk	~113 GB
Context	192 K tokens
Chat template	Always-reasoning (`<think>\n` opened at assistant start)

JANGTQ4 Bit Allocation

Component	Bits	Format	Why
Routed expert MLP (gate/up/down) — 98% of params	4	JANGTQ codebook + Hadamard	16-entry codebook captures the routed-expert distribution near-losslessly
Attention (Q/K/V/O)	8	affine (`nn.QuantizedLinear`, group_size=64)	Runs on every token; quality-critical
Shared expert	8	affine	Runs on every token
Embed tokens / LM head	8	affine	Quality-critical input/output projections
Router gate	fp16	unquantized `nn.Linear`	Routing precision matters
RMSNorms / RoPE / biases	fp16	unquantized	Already tiny

Important Settings

MiniMax M2.7 is an always-reasoning model. The chat template unconditionally opens <think>\n at each assistant turn.

Setting	Value	Notes
Temperature	1.0	REQUIRED — temp=0 can cause thinking loops
Top P	0.95
Top K	40
Repetition Penalty	1.1	Optional, helps prevent loops
max_tokens	≥ 8192	Give reasoning room to converge

Strip <think>…</think> from the response before using the final answer.

Usage

pip install jang-tools

from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model_path = snapshot_download("JANGQ-AI/MiniMax-M2.7-JANGTQ4")
model, tokenizer = load_jangtq_model(model_path)

messages = [{"role": "user", "content": "Explain photosynthesis in 5 sentences."}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)
out = generate(model, tokenizer, prompt, max_tokens=600, verbose=True, temperature=1.0)

Swift — MLX Studio / Osaurus

Both clients auto-detect the JANGTQ runtime from jang_config.json and route through the MiniMaxJANGTQModel class. Just load the repo — no extra flags.

What's In This Repo

File	Role
`model-*.safetensors`	Weights — 4-bit routed TQ + 8-bit affine
`model.safetensors.index.json`	Shard index
`jangtq_runtime.safetensors`	Codebooks + Hadamard signs sidecar (Swift loader)
`jang_config.json`	JANG metadata + Tier-1 `capabilities` stamp (`reasoning=qwen3`, `tool=minimax`)
`config.json`	HF model config (`minimax_m2`, `weight_format=mxtq`, `mxtq_bits=4`)
`tokenizer.*`, `chat_template.jinja`, `vocab.json`, `merges.txt`	Tokenizer + chat template

Parser Capabilities (Tier-1 auto-detected by vmlx)

{
  "reasoning_parser": "qwen3",
  "tool_parser": "minimax",
  "think_in_template": true,
  "supports_tools": true,
  "supports_thinking": true,
  "family": "minimax_m2",
  "modality": "text",
  "cache_type": "kv"
}

<think>...</think> and <tool_call>...</tool_call> are non-special tokens by design — the application layer parses them. vmlx's CapabilityDetector reads this block verbatim and wires the qwen3 reasoning parser + minimax tool parser automatically, so streamed responses route reasoning_content and tool_calls into the OpenAI-compatible SSE fields instead of leaking into content.

License

MiniMax non-commercial (inherits from upstream — see LICENSE).

Credits

Created by Jinho Jang — eric@jangq.ai

Based on MiniMaxAI's MiniMax M2.7. JANGTQ quantization © JANGQ-AI.

Kernels: hadamardRotate, fusedGateUpSwiGLU (P17 OPT=10), gatherTQ, compiled-router math (P15) — all in jang_tools/turboquant/.

Downloads last month: 321

Safetensors

Model size

29B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for JANGQ-AI/MiniMax-M2.7-JANGTQ4

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(79)

this model