MLX Studio

JANGQ

MiniMax-M2.7 JANGTQ4

MiniMax M2.7 228B MoE — 4-bit routed codebook + 8-bit affine, ~113 GB

Highest-quality JANGTQ profile — near-bf16 accuracy at ~50% of bf16 disk.

⚠️ Recommended: Run in MLX Studio or Osaurus. Both bundle the JANGTQ runtime (custom Metal kernels for codebook + Hadamard matmul). Stock mlx_lm.load() will NOT load this checkpoint.

Follow development on Twitter: @jangq_ai


What is JANGTQ4?

JANGTQ (JANG TurboQuant) is the most-compressed, highest-quality JANG quantization format. Routed expert weights stay in a compact codebook + Hadamard-rotated form at runtime — no decompression to affine — and the matmul path uses custom Metal kernels that read packed uint32 weights, look up centroids in a 16-entry codebook (at 4-bit), and accumulate dot products against a Hadamard-rotated input (QuIP# rotate-input-once math).

JANGTQ4 trades disk size for quality — 4-bit codebooks capture the routed- expert weight distribution near-losslessly. Pick JANGTQ4 when RAM is available and you want the closest quality to bf16 on Apple Silicon. Pick JANGTQ (2-bit) for the smallest footprint at minimal quality cost.

JANGTQ2 vs JANGTQ4 vs bf16

JANGTQ (2-bit) JANGTQ4 bf16
Disk 56.5 GB ~113 GB ~457 GB
Routed expert bits 2 4 16
Codebook size 4 entries 16 entries
Avg bits/param ~2.15 ~4.10 16
MMLU 200q (baseline) 91.5% TBD (expected ≥ 94%) 95.5%
Decode tok/s (M3 Ultra) 44.3 TBD baseline

Pick JANGTQ4 when RAM is available and you want the highest-quality 4-bit MiniMax M2.7. Pick the 2-bit JANGTQ for the smallest disk / RAM footprint.


Model Details

Metric Value
Source MiniMaxAI/MiniMax-M2.7 (FP8 E4M3)
Architecture MoE (256 experts, top-8 active), standard Q/K/V attention, partial RoPE
Total parameters 228.7 B
Active per token ~1.4 B
Profile JANGTQ4
Format JANGTQ (codebook+Hadamard)weight_format: mxtq in jang_config.json
Avg bits/param ~4.10
Disk ~113 GB
Context 192 K tokens
Chat template Always-reasoning (<think>\n opened at assistant start)

JANGTQ4 Bit Allocation

Component Bits Format Why
Routed expert MLP (gate/up/down) — 98% of params 4 JANGTQ codebook + Hadamard 16-entry codebook captures the routed-expert distribution near-losslessly
Attention (Q/K/V/O) 8 affine (nn.QuantizedLinear, group_size=64) Runs on every token; quality-critical
Shared expert 8 affine Runs on every token
Embed tokens / LM head 8 affine Quality-critical input/output projections
Router gate fp16 unquantized nn.Linear Routing precision matters
RMSNorms / RoPE / biases fp16 unquantized Already tiny

Important Settings

MiniMax M2.7 is an always-reasoning model. The chat template unconditionally opens <think>\n at each assistant turn.

Setting Value Notes
Temperature 1.0 REQUIRED — temp=0 can cause thinking loops
Top P 0.95
Top K 40
Repetition Penalty 1.1 Optional, helps prevent loops
max_tokens ≥ 8192 Give reasoning room to converge

Strip <think>…</think> from the response before using the final answer.


Usage

pip install jang-tools
from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model_path = snapshot_download("JANGQ-AI/MiniMax-M2.7-JANGTQ4")
model, tokenizer = load_jangtq_model(model_path)

messages = [{"role": "user", "content": "Explain photosynthesis in 5 sentences."}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)
out = generate(model, tokenizer, prompt, max_tokens=600, verbose=True, temperature=1.0)

Swift — MLX Studio / Osaurus

Both clients auto-detect the JANGTQ runtime from jang_config.json and route through the MiniMaxJANGTQModel class. Just load the repo — no extra flags.


What's In This Repo

File Role
model-*.safetensors Weights — 4-bit routed TQ + 8-bit affine
model.safetensors.index.json Shard index
jangtq_runtime.safetensors Codebooks + Hadamard signs sidecar (Swift loader)
jang_config.json JANG metadata + Tier-1 capabilities stamp (reasoning=qwen3, tool=minimax)
config.json HF model config (minimax_m2, weight_format=mxtq, mxtq_bits=4)
tokenizer.*, chat_template.jinja, vocab.json, merges.txt Tokenizer + chat template

Parser Capabilities (Tier-1 auto-detected by vmlx)

{
  "reasoning_parser": "qwen3",
  "tool_parser": "minimax",
  "think_in_template": true,
  "supports_tools": true,
  "supports_thinking": true,
  "family": "minimax_m2",
  "modality": "text",
  "cache_type": "kv"
}

<think>...</think> and <tool_call>...</tool_call> are non-special tokens by design — the application layer parses them. vmlx's CapabilityDetector reads this block verbatim and wires the qwen3 reasoning parser + minimax tool parser automatically, so streamed responses route reasoning_content and tool_calls into the OpenAI-compatible SSE fields instead of leaking into content.

License

MiniMax non-commercial (inherits from upstream — see LICENSE).

Credits

Created by Jinho Jangeric@jangq.ai

Based on MiniMaxAI's MiniMax M2.7. JANGTQ quantization © JANGQ-AI.

Kernels: hadamardRotate, fusedGateUpSwiGLU (P17 OPT=10), gatherTQ, compiled-router math (P15) — all in jang_tools/turboquant/.

Downloads last month
321
Safetensors
Model size
29B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/MiniMax-M2.7-JANGTQ4

Quantized
(79)
this model