MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10

NVFP4-GB10 quantization of a 25%-REAP-pruned MiniMax-M2.7, targeted at NVIDIA DGX Spark (GB10) and Blackwell-family hardware. Both the REAP pruning AND the NVFP4 calibration use a 6-dataset agentic mix for coherent preservation of tool-use, code-generation, math, and software-engineering agent capabilities. 98.9 GB on disk — fits in a single 128 GB DGX Spark.

Model Details


Base Model (BF16)	saricles/MiniMax-M2.7-REAP-172B-A10B-BF16
Original Base	MiniMaxAI/MiniMax-M2.7
Architecture	MiniMaxM2ForCausalLM (MoE, 192 experts, top-K=8)
Total Parameters	172B (REAP-pruned from 230B)
Active Parameters	~10B per token
Hidden Layers	62
Quantization	NVFP4 (4-bit floating point) with GB10-tuned ignore list
Format	compressed-tensors (safetensors)
Size on Disk	98.9 GB
Deployment	1× DGX Spark (fits in a single 128 GB Spark)
License	Other (inherited from MiniMaxAI/MiniMax-M2.7)

Lineage

Base: MiniMaxAI/MiniMax-M2.7 (230B, 256 experts, FP8 native)
Dequantize: FP8 → BF16
REAP prune: BF16 → 75% keep (192/256 experts), agentic 6-dataset calibration → saricles/MiniMax-M2.7-REAP-172B-A10B-BF16
NVFP4 quantize (this model): REAP BF16 + same 6-dataset agentic calibration + GB10-tuned ignore list

Both pruning AND quantization calibration use the same agentic mix — preserves coherent capability targeting end-to-end.

Quantization Details

Method: Post-training quantization via NVIDIA TensorRT Model Optimizer (nvidia-modelopt)
Scheme: mtq.NVFP4_DEFAULT_CFG + GB10-tuned disable list applied post-calibration
Calibration Datasets: same 6-dataset agentic mix as the base REAP'd model (see base model card)
Calibration Samples: 64 per dataset × 6 = ~384 total (before ≥32-token length-filter drops)
Max Sequence Length: 2048 tokens
Preserved in BF16 (ignore list): lm_head, *block_sparse_moe.gate (MoE router gate)
GB10 specialization: self_attn stays QUANTIZED (vs. standard NVFP4 reference, which keeps attention BF16). Trades a small quality loss for meaningful speedup on GB10 NVFP4 kernels.
Phase 2.5 safety net: amax force-populated from weight statistics on any expert that never activated during calibration (critical for MoE with top-K=8 routing over 192 experts)
Hardware Used: Hugging Face Jobs, 4× NVIDIA H200 141 GB. (Note: some h200x4 allocations may hit NVIDIA Fabric Manager initialization errors on cold start; retrying until a good node is allocated is a known workaround.)
Recipe script: quantize-nvfp4-gb10-agentic.py — env-var-configurable; applies to other MoE architectures with minor adjustments.

Performance

Benchmarks pending GB10 deployment validation runs — will be added when available. The companion full-size model (saricles/MiniMax-M2.7-NVFP4-GB10) benchmarks at ~26 tok/s decode on 2× DGX Spark; this REAP variant is expected to achieve similar or slightly higher decode throughput on a single Spark due to reduced per-layer expert count and elimination of inter-node communication.

Running on 1× DGX Spark (single-node)

At 98.9 GB this model fits comfortably in a single DGX Spark's 128 GB unified memory, leaving headroom for KV cache and activation buffers. Single-node deployment eliminates the RoCE / Ray setup required for tensor parallel across two Sparks.

Environment variables

# NVFP4 kernel selection (validated on GB10 SM 12.1)
VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
VLLM_USE_FLASHINFER_MOE_FP4=0
SAFETENSORS_FAST_GPU=1
OMP_NUM_THREADS=8
TORCHINDUCTOR_MAX_AUTOTUNE=0

vLLM server

vllm serve /models/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10 \
  --host 0.0.0.0 --port 30000 \
  --served-model-name minimax-m2.7-reap \
  --gpu-memory-utilization 0.85 \
  --max-model-len 196608 \
  --kv-cache-dtype fp8_e4m3 \
  --attention-backend flashinfer \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-seqs 32 \
  --max-num-batched-tokens 8192 \
  --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --compilation-config '{"cudagraph_mode":"none","inductor_compile_config":{"combo_kernels":false,"benchmark_combo_kernel":false,"max_autotune":false,"max_autotune_gemm":false}}'

First boot loads weights and runs JIT compilation — plan for 10-15 minutes before /v1/models responds.

Test it

curl http://<HOST>:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax-m2.7-reap",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01,
    "max_tokens": 512
  }'

Gotchas

--gpu-memory-utilization 0.85 is safe on a 128 GB Spark; raise cautiously.
--max-num-seqs 32 is chosen to leave KV-cache headroom on a single Spark at 196 K context (half of the dual-Spark default 64, since you have half the aggregate KV budget). Raise if your typical contexts stay short.
cudagraph_mode: none is load-bearing — CUDA graphs deadlock on MiniMax-M2 MoE on GB10. Leave torch.compile otherwise enabled.
Do not pass --enable-expert-parallel — uneven per-node memory on MoE breaks under current kernels.

Recommended Sampling Parameters

Per MiniMax documentation:

{ "temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01 }

When to choose REAP-NVFP4 vs. full MiniMax-M2.7-NVFP4-GB10

Use this REAP-NVFP4 if: you have one DGX Spark, you need agentic + coder performance, you can accept 25% fewer experts in exchange for single-node deployment simplicity.
Use full if: you have 2× Spark with RoCE, and want maximum capability with all 256 experts preserved.

Target Hardware

Quantized for and tested on NVIDIA DGX Spark (GB10, 128 GB unified memory, 221 GB/s bandwidth). Should work on other Blackwell-class GPUs with NVFP4 tensor-core support. On Hopper-class hardware (H100/H200) the model will load and run, but the ignore list was tuned for Blackwell and will leave some performance on the table.

Calibration notes (2026-04-17 correction)

Both the upstream REAP pruning AND this NVFP4 calibration used the same dataset-extractor, which silently dropped texts from SWE-bench/SWE-smith-trajectories because that dataset stores messages as a JSON-encoded string (not a list-of-dicts). Our extractor treated the string as an iterable of characters, found no dict entries, and collected zero texts.

Net effect on this artifact:

REAP scoring used 5 of 6 documented datasets (see base model card for details)
NVFP4 calibration used 5 of 6 documented datasets (same set: evol-codealpaca, xlam-function-calling, Mixture-of-Thoughts code/math/science)
SWE-smith-trajectories did NOT contribute to either pruning or quantization calibration

Fix: the recipe script quantize-nvfp4-gb10-agentic.py has been updated to json.loads() string-encoded messages, plus per-dataset assertions that fail the run if any dataset yields zero texts or if any selected dataset fails to load. Future variants will include SWE-smith as originally intended.

Practical implication: agentic tool-use calibration still came through via xlam-function-calling (128 activations), and code/math/science reasoning via Mixture-of-Thoughts. What's missing is the specific long-horizon SWE-agent trajectory pattern. For typical OpenClaw / Aider / Claude Code use cases (single-call agentic + code), this is likely imperceptible; for long multi-step SWE-bench-style workflows, scales may be slightly misaligned at deep positions.

Acknowledgments

Base model by MiniMax
REAP methodology: Cerebras Research (Lasby et al., arXiv:2510.13999)
Quantization tooling: NVIDIA TensorRT Model Optimizer
GB10 quantization profile guidance (self_attn QUANTIZED insight): Scott Glover (scottgl)
Multi-Spark runtime tuning: eugr/spark-vllm-docker

Downloads last month: 1,368

Safetensors

Model size

87B params

Tensor type

BF16

F8_E4M3

Model tree for saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10

Base model

MiniMaxAI/MiniMax-M2.7

Finetuned

saricles/MiniMax-M2.7-REAP-172B-A10B-BF16

Finetuned

(1)

this model

Paper for saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19