MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10
NVFP4-GB10 quantization of a 25%-REAP-pruned MiniMax-M2.7, targeted at NVIDIA DGX Spark (GB10) and Blackwell-family hardware. Both the REAP pruning AND the NVFP4 calibration use a 6-dataset agentic mix for coherent preservation of tool-use, code-generation, math, and software-engineering agent capabilities. 98.9 GB on disk — fits in a single 128 GB DGX Spark.
Model Details
| Base Model (BF16) | saricles/MiniMax-M2.7-REAP-172B-A10B-BF16 |
| Original Base | MiniMaxAI/MiniMax-M2.7 |
| Architecture | MiniMaxM2ForCausalLM (MoE, 192 experts, top-K=8) |
| Total Parameters | 172B (REAP-pruned from 230B) |
| Active Parameters | ~10B per token |
| Hidden Layers | 62 |
| Quantization | NVFP4 (4-bit floating point) with GB10-tuned ignore list |
| Format | compressed-tensors (safetensors) |
| Size on Disk | 98.9 GB |
| Deployment | 1× DGX Spark (fits in a single 128 GB Spark) |
| License | Other (inherited from MiniMaxAI/MiniMax-M2.7) |
Lineage
- Base: MiniMaxAI/MiniMax-M2.7 (230B, 256 experts, FP8 native)
- Dequantize: FP8 → BF16
- REAP prune: BF16 → 75% keep (192/256 experts), agentic 6-dataset calibration → saricles/MiniMax-M2.7-REAP-172B-A10B-BF16
- NVFP4 quantize (this model): REAP BF16 + same 6-dataset agentic calibration + GB10-tuned ignore list
Both pruning AND quantization calibration use the same agentic mix — preserves coherent capability targeting end-to-end.
Quantization Details
- Method: Post-training quantization via NVIDIA TensorRT Model Optimizer (
nvidia-modelopt) - Scheme:
mtq.NVFP4_DEFAULT_CFG+ GB10-tuned disable list applied post-calibration - Calibration Datasets: same 6-dataset agentic mix as the base REAP'd model (see base model card)
- Calibration Samples: 64 per dataset × 6 = ~384 total (before ≥32-token length-filter drops)
- Max Sequence Length: 2048 tokens
- Preserved in BF16 (ignore list):
lm_head,*block_sparse_moe.gate(MoE router gate) - GB10 specialization:
self_attnstays QUANTIZED (vs. standard NVFP4 reference, which keeps attention BF16). Trades a small quality loss for meaningful speedup on GB10 NVFP4 kernels. - Phase 2.5 safety net:
amaxforce-populated from weight statistics on any expert that never activated during calibration (critical for MoE with top-K=8 routing over 192 experts) - Hardware Used: Hugging Face Jobs, 4× NVIDIA H200 141 GB. (Note: some h200x4 allocations may hit NVIDIA Fabric Manager initialization errors on cold start; retrying until a good node is allocated is a known workaround.)
- Recipe script:
quantize-nvfp4-gb10-agentic.py— env-var-configurable; applies to other MoE architectures with minor adjustments.
Performance
Benchmarks pending GB10 deployment validation runs — will be added when available. The companion full-size model (saricles/MiniMax-M2.7-NVFP4-GB10) benchmarks at ~26 tok/s decode on 2× DGX Spark; this REAP variant is expected to achieve similar or slightly higher decode throughput on a single Spark due to reduced per-layer expert count and elimination of inter-node communication.
Running on 1× DGX Spark (single-node)
At 98.9 GB this model fits comfortably in a single DGX Spark's 128 GB unified memory, leaving headroom for KV cache and activation buffers. Single-node deployment eliminates the RoCE / Ray setup required for tensor parallel across two Sparks.
Environment variables
# NVFP4 kernel selection (validated on GB10 SM 12.1)
VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
VLLM_USE_FLASHINFER_MOE_FP4=0
SAFETENSORS_FAST_GPU=1
OMP_NUM_THREADS=8
TORCHINDUCTOR_MAX_AUTOTUNE=0
vLLM server
vllm serve /models/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10 \
--host 0.0.0.0 --port 30000 \
--served-model-name minimax-m2.7-reap \
--gpu-memory-utilization 0.85 \
--max-model-len 196608 \
--kv-cache-dtype fp8_e4m3 \
--attention-backend flashinfer \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-seqs 32 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--compilation-config '{"cudagraph_mode":"none","inductor_compile_config":{"combo_kernels":false,"benchmark_combo_kernel":false,"max_autotune":false,"max_autotune_gemm":false}}'
First boot loads weights and runs JIT compilation — plan for 10-15 minutes before /v1/models responds.
Test it
curl http://<HOST>:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "minimax-m2.7-reap",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01,
"max_tokens": 512
}'
Gotchas
--gpu-memory-utilization 0.85is safe on a 128 GB Spark; raise cautiously.--max-num-seqs 32is chosen to leave KV-cache headroom on a single Spark at 196 K context (half of the dual-Spark default 64, since you have half the aggregate KV budget). Raise if your typical contexts stay short.cudagraph_mode: noneis load-bearing — CUDA graphs deadlock on MiniMax-M2 MoE on GB10. Leavetorch.compileotherwise enabled.- Do not pass
--enable-expert-parallel— uneven per-node memory on MoE breaks under current kernels.
Recommended Sampling Parameters
{ "temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01 }
When to choose REAP-NVFP4 vs. full MiniMax-M2.7-NVFP4-GB10
- Use this REAP-NVFP4 if: you have one DGX Spark, you need agentic + coder performance, you can accept 25% fewer experts in exchange for single-node deployment simplicity.
- Use full if: you have 2× Spark with RoCE, and want maximum capability with all 256 experts preserved.
Target Hardware
Quantized for and tested on NVIDIA DGX Spark (GB10, 128 GB unified memory, 221 GB/s bandwidth). Should work on other Blackwell-class GPUs with NVFP4 tensor-core support. On Hopper-class hardware (H100/H200) the model will load and run, but the ignore list was tuned for Blackwell and will leave some performance on the table.
Calibration notes (2026-04-17 correction)
Both the upstream REAP pruning AND this NVFP4 calibration used the same dataset-extractor, which silently dropped texts from SWE-bench/SWE-smith-trajectories because that dataset stores messages as a JSON-encoded string (not a list-of-dicts). Our extractor treated the string as an iterable of characters, found no dict entries, and collected zero texts.
Net effect on this artifact:
- REAP scoring used 5 of 6 documented datasets (see base model card for details)
- NVFP4 calibration used 5 of 6 documented datasets (same set: evol-codealpaca, xlam-function-calling, Mixture-of-Thoughts code/math/science)
- SWE-smith-trajectories did NOT contribute to either pruning or quantization calibration
Fix: the recipe script quantize-nvfp4-gb10-agentic.py has been updated to json.loads() string-encoded messages, plus per-dataset assertions that fail the run if any dataset yields zero texts or if any selected dataset fails to load. Future variants will include SWE-smith as originally intended.
Practical implication: agentic tool-use calibration still came through via xlam-function-calling (128 activations), and code/math/science reasoning via Mixture-of-Thoughts. What's missing is the specific long-horizon SWE-agent trajectory pattern. For typical OpenClaw / Aider / Claude Code use cases (single-call agentic + code), this is likely imperceptible; for long multi-step SWE-bench-style workflows, scales may be slightly misaligned at deep positions.
Acknowledgments
- Base model by MiniMax
- REAP methodology: Cerebras Research (Lasby et al., arXiv:2510.13999)
- Quantization tooling: NVIDIA TensorRT Model Optimizer
- GB10 quantization profile guidance (
self_attnQUANTIZED insight): Scott Glover (scottgl) - Multi-Spark runtime tuning:
eugr/spark-vllm-docker
- Downloads last month
- 1,368
Model tree for saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10
Base model
MiniMaxAI/MiniMax-M2.7