Step-3.7-Flash MTP draft (for the NVFP4 checkpoint)
A tiny Multi-Token-Prediction (MTP / nextn) draft for stepfun-ai/Step-3.7-Flash-NVFP4, so you can run
speculative decoding on the NVFP4 checkpoint in vLLM.
Why this exists: the official
Step-3.7-Flash-NVFP4checkpoint declaresnum_nextn_predict_layers: 3in its config but ships zero MTP weights — the 3 nextn layers were dropped during quantization, and the per-layer config arrays were truncated to 45 (so even loading them wouldIndexError). The BF16 and FP8 releases keep the MTP weights, but the NVFP4 one — the SM120-friendly, smallest one — cannot do speculative decoding out of the box. This repo is the missing piece: the 3 MTP layers extracted from the BF16 release, kept in BF16 (they're tiny), packaged as a vLLM-loadable draft.
- ~5.9 GB, BF16. Base = NVFP4 (mixed precision is fine; the draft is small).
- Lossless in the speculative sense: vLLM's rejection sampling provably matches
the target distribution; at
temperature=0it follows the target's greedy tokens. - Drop-in: point vLLM's
--speculative-configat this directory.
Usage (vLLM, stepfun37 image / vLLM ≥ the build with Step3p5MTP)
The draft is auto-routed to vLLM's native Step3p5MTP + Step3p5MTPProposer
because its config is model_type: step3p7 with num_nextn_predict_layers > 0.
docker run -d --gpus all --ipc=host --shm-size=64g --network host \
-v /path/to/Step-3.7-Flash-NVFP4:/model:ro \
-v /path/to/Step-3.7-Flash-MTP-draft:/draft:ro \
vllm/vllm-openai:stepfun37 \
/model \
--served-model-name step3p7 --port 8000 \
--trust-remote-code --tensor-parallel-size 2 --enable-expert-parallel \
--quantization modelopt --kv-cache-dtype fp8 \
--max-model-len 262144 --gpu-memory-utilization 0.92 --async-scheduling \
--speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":1}'
JSON for --speculative-config must have no spaces (brace-expansion safety).
num_speculative_tokens: 1 (K=1) is the sweet spot — see below.
Benchmarks (2× RTX PRO 6000 Blackwell, SM120, TP=2)
Measured on the NVFP4 base + this draft, K=1, vs. NVFP4 with speculation off.
per_req = decode tok/s a single user feels (prefill excluded). Acceptance ≈ 0.80 in production traffic.
Single-stream decode (short context):
| workload | base | + MTP K=1 | speedup | accept |
|---|---|---|---|---|
| free-form | 106.8 | 125.5 | +17.5% | 0.77 |
| code | 106.7 | 133.7 | +25.3% | 0.88 |
| Japanese | 107.0 | 129.3 | +20.9% | 0.80 |
| tool-call | 106.9 | 135.4 | +26.6% | 0.90 |
Decode speedup grows with context length (longer KV → base is more memory-bound → bigger speculative win):
| context | C=1 | C=2 | C=4 | C=8 |
|---|---|---|---|---|
| 1K | +20% | +8% | +32% | +34% |
| 8K | +22% | +24% | +25% | +44% |
| 32K | +22% | +26% | +20% | +17% |
| 128K | +28% | +33% | +38% | — |
Net-positive across the whole concurrency range we tested (MoE stays memory-bound
to high batch). Best K: K=1 (K=2/K=3 lose to draft cost — later positions
have lower acceptance and add forward cost). NaN-free on SM120 (Gate0 5/5).
How it was built (reproducible)
The draft is not retrained — it's the original StepFun MTP layers, extracted verbatim:
- From
stepfun-ai/Step-3.7-Flash(BF16), take the 52 tensors ofmodel.layers.{45,46,47}.*(the 3 nextn layers, dense-MLP, 17 tensors each) plusmodel.embed_tokens.weight. They all live in one shard (model-00024.safetensors). - Keep the original BF16 weight names — vLLM's
Step3p5MTPloader does its own renaming (.transformer.strip,shared_head.output→head,.mtp_block.insert). config.json= the BF16 original config (NOT the NVFP4 one): its per-layer arrays (layer_types,partial_rotary_factors, …) are length 48 and cover the MTP layer indices 45-47. Stripquantization_configso the draft loads as BF16.
Full scripts + benchmark harness: GitHub repo (build_draft.py,
launch_mtp.sh, eval_mtp.py, bench_matrix.py).
License & attribution
Apache-2.0, inherited from the base model stepfun-ai/Step-3.7-Flash. These are
StepFun's weights, redistributed unchanged (only re-sharded/re-packaged as a draft).
All credit for the model and the MTP layers goes to StepFun.
- Downloads last month
- -
Model tree for Hikari07jp/Step-3.7-Flash-MTP-draft
Base model
stepfun-ai/Step-3.7-Flash