Qwen3.6-35B-A3B-heretic — NVFP4 (v2 multimodal-preserved)
🚀 PRODUCTION DEPLOYMENT GUIDE: github.com/AEON-7/Qwen3.6-NVFP4-DFlash
The GitHub repo is the definitive turn-key setup for DGX Spark — pre-built Docker image, end-to-end deployment guide, validated OpenClaw config, the 8 vLLM patches that actually make this work on SM121, and a concurrency-sweep benchmark harness.
- Image:
ghcr.io/aeon-7/vllm-spark-omni-q36:v1.2(vLLM HEAD source-built for cu130/sm_120 + 5 source patches + flashinfer 0.6.8 + Marlin GEMM enforcement)- Pairs with
z-lab/Qwen3.6-35B-A3B-DFlashdrafter (must be post 2026-04-19 revision)- Production-stable under sustained chat load — measured 116.8 tok/s single-stream / 785.3 tok/s aggregate at 128-concurrency on DGX Spark
What changed in v2 (2026-04-19)
v1 of this checkpoint had model.language_model.layers.X.* keys remapped to
model.layers.X.* so vLLM's text-only Qwen3_5MoeForCausalLM loader would pick them up.
That layout was unstable in production — intermittent NaN/crash in the prefix-strip
codepath during real chat sessions.
v2 re-quantizes the same source (tvall43/Qwen3.6-35B-A3B-heretic) with
AutoModelForImageTextToText, preserving the canonical multimodal layout:
- Architecture:
Qwen3_5MoeForConditionalGeneration(vLLM's canonical class — no registry hack required) - Keys:
model.language_model.layers.X.*retained natively (no post-quantization key rewriting) - 27-block ViT vision encoder preserved BF16
- 30 linear-attention (Mamba/GDN) layers preserved BF16
- All 122,880 per-expert NVFP4 keys (40 layers × 256 experts × 3 projections × 4 quant components)
vLLM serves it via the canonical multimodal class with no prefix-strip code path in the inference hot loop. Result: rock-solid stability where v1 was crashing on virtually every interaction.
⚠️ If you cloned v1 of this repo, delete and re-pull. Same URL — v2 commits replaced v1.
NVFP4-quantized version of tvall43/Qwen3.6-35B-A3B-heretic — an abliterated (decensored, 5/100 refusal rate) Qwen 3.6 35B-A3B Mixture-of-Experts multimodal model with thinking/reasoning capabilities.
Quantized using llmcompressor with the compressed-tensors nvfp4-pack-quantized format. Calibrated with 256 samples from open-platypus over 40 sequential decoder-layer stages. Vision encoder, linear-attention (Mamba/GDN) layers, MoE routers, gates, norms, and lm_head/embed_tokens preserved in BF16.
Designed for deployment on NVIDIA DGX Spark (GB10, Blackwell SM 12.0+) with native FP4 tensor-core support. Pairs with z-lab/Qwen3.6-35B-A3B-DFlash for spec-decode acceptance of 2.7-4.4 mean accepted tokens per target step on greedy workloads.
Performance Benchmarks
Test Setup
Hardware: NVIDIA DGX Spark (GB10, SM 12.1, 128 GB unified memory)
Software: vLLM HEAD source-built (image ghcr.io/aeon-7/vllm-spark-omni-q36:v1.2), flashinfer 0.6.8, Marlin GEMM (CUTLASS NVFP4 unstable on SM121), DFlash speculative decoding with num_speculative_tokens=15, BF16 KV cache, --gpu-memory-utilization 0.85
Bench config (production): --max-num-seqs 128, --max-model-len 262144, --max-num-batched-tokens 65536. Single config — unlike other models with separate "single-stream" vs "throughput" configs, Qwen3.6 ships one production config that handles both well.
Methodology: All tests run enable_thinking=false for clean decode-rate measurement (production with thinking on adds reasoning-token overhead but does not change throughput). Greedy sampling (T=0) unless explicitly noted stochastic. SSE streaming. Median across N runs. Mixed-domain prompt set (code, math, QA, reasoning). Zero errors across 1,200+ requests in the full test.
⚠️ DFlash speedup is workload-dependent. Per-prompt decode rate ranges from 41 to 127 tok/s in the single-stream test, depending on how predictable the drafter finds the target's output. Greedy reasoning workloads (math, code) hit the upper end (78%+ acceptance). Creative / sampled workloads are more variable.
1. Single-Stream Performance
Best for interactive chat and agentic UX. All measurements greedy (T=0) unless noted.
Decode rate (10 trials, 200-token outputs)
| Statistic | tok/s |
|---|---|
| Median | 83.9 |
| p95 | 127.5 |
| Min | 41.1 |
| Max | 127.5 |
Variance reflects DFlash acceptance differences across prompt classes — math/code prompts hit ~125 tok/s with high drafter agreement, more open-ended prompts settle around 60-90 tok/s.
TTFT by prompt length (5 trials per class)
| Prompt class | Approx. input tokens | TTFT p50 | TTFT p95 | TTFT min | Effective prefill |
|---|---|---|---|---|---|
| Tiny | 2 | 99 ms | 102 ms | 98 ms | 20 tok/s |
| Short | 7 | 114 ms | 115 ms | 110 ms | 62 tok/s |
| Medium | 50 | 123 ms | 128 ms | 121 ms | 407 tok/s |
| Long | 465 | 259 ms | 314 ms | 257 ms | 1,797 tok/s |
Sub-130ms TTFT for any prompt under ~50 tokens — fixed kernel-launch overhead dominates short prefill.
Decode rate by output length (3 trials per length)
| Max tokens | Actual tokens (median) | TTFT | Decode rate | Total latency |
|---|---|---|---|---|
| 50 | 50 | 113 ms | 70.1 tok/s | 0.82 s |
| 200 | 200 | 112 ms | 88.4 tok/s | 2.37 s |
| 500 | 331* | 116 ms | 115.6 tok/s | 4.44 s |
| 1000 | 330* | 113 ms | 118.3 tok/s | 6.28 s |
* model emitted EOS naturally before hitting max_tokens.
Decode rate increases with output length — DFlash steady-state amortization improves over the first 100-200 tokens once the drafter and target lock into a stable acceptance pattern.
Sampling: greedy vs stochastic (5 trials per mode)
| Mode | Decode p50 | Decode p95 | TTFT p50 |
|---|---|---|---|
| Greedy (T=0) | 76.5 tok/s | 123.0 tok/s | 115 ms |
| Stochastic (T=0.7) | 64.8 tok/s | 125.4 tok/s | 113 ms |
15% degradation T=0 → T=0.7. Less dramatic than typical for spec-decode systems — DFlash's drafter remains useful even at moderate sampling. Use T=0 for max DFlash speedup; T=0.7 for diversity.
Long-prompt prefill (RAG / document workloads)
| Input tokens | TTFT (≈ prefill) | Prefill rate | Decode rate after prefill |
|---|---|---|---|
| 1K | 519 ms | 1,973 tok/s | 48.8 tok/s |
| 4K | 2,594 ms | 1,579 tok/s | 41.1 tok/s |
| 16K | 8,007 ms | 2,046 tok/s | 34.6 tok/s |
| 32K | 19,368 ms | 1,692 tok/s | 23.0 tok/s |
Prefill rate plateaus around 2K tok/s due to (a) the drafter prefilling the same context in parallel and (b) Qwen3.6's 30 linear-attention (Mamba/GDN) layers having higher prefill constant factor than parallel softmax attention. Decode-after-prefill drops gracefully (~50% from 1K → 32K context).
Single-stream summary
| Metric | Value |
|---|---|
| Single-stream decode (200-tok output) | 83.9 tok/s median |
| Decode @ 500-1000 tok output (DFlash steady state) | 115-118 tok/s |
| Short-prompt TTFT | 99-128 ms |
| 16K-prompt TTFT | 8.0 s |
| 32K-prompt TTFT | 19.4 s |
| Peak prefill throughput | ~2,046 tok/s @ 16K prompt |
| Decode rate with 32K context | 23.0 tok/s (53% drop vs short context) |
2. Concurrent-Session Performance
Best for agent fleets and multi-user serving. 3 trials per level, median run reported (sorted by aggregate throughput). Mixed prompts, 200-token output, T=0.7 (stochastic — production-realistic), SSE streaming.
Throughput scaling (N concurrent clients, 200-tok output)
| Concurrent | Errors | Agg tok/s (median of 3) | Per-req decode p50 | Per-req decode min | TTFT p50 | TTFT p95 |
|---|---|---|---|---|---|---|
| 1 | 0 | 102.9 | 109.1 | 109.1 | 111 ms | 111 ms |
| 2 | 0 | 131.3 | 94.0 | 68.9 | 144 ms | 144 ms |
| 4 | 0 | 128.1 | 48.5 | 38.9 | 191 ms | 191 ms |
| 8 | 0 | 163.3 | 29.2 | 14.2 | 355 ms | 356 ms |
| 16 | 0 | 227.6 | 19.3 | 8.6 | 501 ms | 503 ms |
| 32 | 0 | 275.5 | 11.6 | 5.2 | 701 ms | 703 ms |
| 64 | 0 | 310.8 | 6.9 | 3.3 | 1.07 s | 11.2 s |
| 128 | 0 | 313.6 | 6.5 | 3.0 | 14.1 s | 46.7 s |
Zero errors across all 384 requests in the concurrent sweep (3 runs × 128-conc top level alone = 384, plus all lower levels = 1,200+ total).
Aggregate throughput plateaus at ~313 tok/s from 64 concurrent onward — that's the GPU's compute wall on this 35B-active-3B MoE with linear-attention KV reads + DFlash drafter overhead. TTFT spikes severely at 128 concurrent (14s p50, 47s p95) because all 128 sequences fit in the scheduler but compute is fully saturated, so each token's worth of work is divided across 128 streams. For latency-sensitive UX, target 16-32 concurrent; for max throughput, use the full 128.
TTFT-only scaling (1-token output, prefill + first-token)
Measures pure scheduler queue contention — critical for agent UX:
| Concurrent | TTFT p50 | TTFT p95 | TTFT min | TTFT max |
|---|---|---|---|---|
| 1 | 74 ms | 75 ms | 72 ms | 75 ms |
| 4 | 99 ms | 100 ms | 97 ms | 100 ms |
| 16 | 249 ms | 263 ms | 238 ms | 263 ms |
| 64 | 560 ms | 698 ms | 451 ms | 707 ms |
TTFT stays sub-700ms through 64 concurrent — smooth UX for small agent fleets. Beyond 64, TTFT accumulates queue-wait time as compute is fully consumed.
Concurrent with 1K-token prompts (RAG-style workload)
50-token output with 1,024-token prompts — simulates agents doing document QA or retrieval-augmented responses. Median of 2 runs.
| Concurrent | Errors | Agg tok/s | TTFT p50 | TTFT p95 | Decode p50 |
|---|---|---|---|---|---|
| 1 | 0 | 23.1 | 494 ms | 494 ms | 44.1 |
| 4 | 0 | 39.5 | 1,673 ms | 1,720 ms | 24.6 |
| 16 | 0 | 47.1 | 6,179 ms | 6,180 ms | 10.6 |
| 64 | 0 | 49.8 | 19,297 ms | 33,352 ms | 2.5 |
RAG throughput peaks around 50 tok/s at 16-64 concurrent. The aggregate is lower than the short-prompt sweep because each request spends most of its wall-clock in prefill (1K tokens) rather than decode. Use prefix caching if your RAG workload has repeated context blocks — the production compose enables --enable-prefix-caching which can give 5-10× speedup on shared-prefix RAG.
Concurrent-session summary
| Metric | Value |
|---|---|
| Peak aggregate throughput | 313.6 tok/s @ 128 concurrent (median of 3 trials) |
| Scaling from 1 → 128 | 3.05× throughput (compute-bound — DFlash + 35B MoE saturates GB10 around 64 streams) |
| Per-request decode @ 128 | 6.5 tok/s p50, 3.0 min |
| TTFT @ 64 concurrent | 1.07 s p50 (acceptable for agent fleets) |
| TTFT @ 128 concurrent | 14.1 s p50 (queue-bound — useful for batch only) |
| Error rate across full bench | 0.0% (1,200+ requests, conc 1 → 128) |
| Best concurrency for chat UX | 4-16 (per-req 19-48 tok/s, TTFT < 500 ms) |
| Best concurrency for max throughput | 64-128 (saturated compute, TTFT trade-off) |
Key Performance Metrics Summary
| Metric | Value |
|---|---|
| Single-stream decode (200-tok output) | 83.9 tok/s median |
| Single-stream decode @ DFlash steady state | 118 tok/s (1000-tok output) |
| Short-prompt TTFT | 99-128 ms |
| Peak aggregate throughput | 313.6 tok/s @ 128 concurrent |
| TTFT @ 16 concurrent (smooth UX) | 501 ms p50 |
| TTFT @ 64 concurrent (still usable) | 1.07 s p50 |
| Greedy vs stochastic decode penalty | 15% (76.5 → 64.8 tok/s) |
| DFlash position-0 acceptance (greedy workloads) | 62-78% |
| Mean accepted tokens per target step | 2.7-4.4 |
| Long-context decode @ 32K prompt | 23.0 tok/s |
| Total bench wall-clock | 11 minutes (1,200+ requests, 0 errors) |
Scaling efficiency (200-tok concurrent test)
| Concurrency | Throughput gain vs 1-req |
|---|---|
| 1 | 1.0× |
| 4 | 1.2× |
| 16 | 2.2× |
| 64 | 3.0× |
| 128 | 3.05× |
Scaling is GPU-compute-bound rather than memory-bound — DFlash on a 35B MoE with hybrid linear+full attention saturates the GB10's compute around 64 concurrent. Per-request throughput degrades from 109 tok/s (1-req) to 6.5 tok/s (128-req). For comparison, a non-spec-decode setup would scale much more linearly but lose the ~2-4× single-stream speedup DFlash provides.
Test methodology notes
enable_thinking=false— bench disables Qwen3.6's thinking tag for clean decode-rate measurement. Production with thinking on adds reasoning-token overhead before content emission (usemax_tokens≥ 2048 for thinking-enabled requests).- DFlash speedup is workload-dependent — math, code, agentic, and reasoning workloads at T=0 hit the highest acceptance rates. Creative writing or open-ended chat sees lower acceptance.
- Mixed-prompt set in concurrent tests: code, math, QA, creative writing, single-line answers — to avoid biasing toward DFlash-friendly prompts.
- 3 trials per concurrency level for the throughput sweep, median run (by aggregate tok/s) reported. RAG section uses 2 trials.
- 200-token output as the standard test length (except TTFT-only test which uses 1 token, RAG which uses 50, and decode-by-output which sweeps 50→1000).
- Error tracking: 0/1,200+ requests failed across the full test (all sections combined).
- Reproducible: bench script at
scripts/bench_full.py; raw JSON results atbench/qwen36_v2_2026-04-20.json.
⚠️ IMPORTANT REQUIREMENTS
| # | Requirement | Why |
|---|---|---|
| 1 | Native Blackwell GPU (SM 10.0+ — B200, GB10, RTX PRO 6000 Blackwell, RTX 5090) | NVFP4 needs hardware FP4 tensor cores |
| 2 | vLLM with sm_120 NVFP4 kernels — use ghcr.io/aeon-7/vllm-spark-omni-q36:v1.2 (or build from Qwen3.6-NVFP4-DFlash) |
Stock vLLM wheels don't compile FP4 kernels for SM 12.x; the SM121 SMEM workarounds aren't upstream yet |
| 3 | --quantization compressed-tensors (NOT modelopt) |
This checkpoint uses llmcompressor's compressed-tensors NVFP4 format |
| 4 | --trust-remote-code |
Qwen3.6 ships custom modeling code |
| 5 | --attention-backend flash_attn (when using DFlash) |
DFlash spec decode requires flash_attn backend |
| 6 | VLLM_TEST_FORCE_FP8_MARLIN=1 env (mandatory on DGX Spark) |
CUTLASS NVFP4 path is broken on SM121 (101 KB SMEM vs 228 KB on SM100) |
| 7 | DFlash drafter from post 2026-04-19 revision | Earlier z-lab drafter had a long-context crash bug |
| 8 | Latest transformers (≥5.5.4) |
qwen3_5_moe model_type registration |
Quick Start (DGX Spark, with DFlash spec decode)
# 1. Pull the image (anonymous public GHCR pull — anyone can run this)
docker pull ghcr.io/aeon-7/vllm-spark-omni-q36:v1.2
# 2. Pull both models
sudo mkdir -p /opt/qwen36 && sudo chown $USER:$USER /opt/qwen36
cd /opt/qwen36
export HF_HUB_ENABLE_HF_TRANSFER=1
hf download AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4 --local-dir ./qwen36-nvfp4 &
hf download z-lab/Qwen3.6-35B-A3B-DFlash --local-dir ./qwen36-dflash &
wait
# 3. Get the production compose file
curl -fsSL \
https://raw.githubusercontent.com/AEON-7/Qwen3.6-NVFP4-DFlash/main/examples/docker-compose.yml \
-o docker-compose.yml
# 4. Start
docker compose up -d
docker compose logs -f # wait for "Application startup complete" (~3-5 min)
# 5. Test (use temperature=0 + ≥2048 max_tokens for thinking-enabled requests)
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen36-fast",
"messages": [{"role":"user","content":"What is 17 × 23? Show your work."}],
"max_tokens": 2048,
"temperature": 0
}'
Full step-by-step (with pre-flight checks, smoke tests, systemd service, OpenClaw integration): github.com/AEON-7/Qwen3.6-NVFP4-DFlash/blob/main/docs/dgx-spark-setup.md
Production docker-compose (the actual flags that work)
services:
vllm:
image: ghcr.io/aeon-7/vllm-spark-omni-q36:v1.2
container_name: vllm-qwen36-heretic
restart: unless-stopped
network_mode: host
environment:
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- TORCH_MATMUL_PRECISION=high
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- NVIDIA_FORWARD_COMPAT=1
- VLLM_TEST_FORCE_FP8_MARLIN=1 # MANDATORY on DGX Spark / SM121
volumes:
- /opt/qwen36/qwen36-nvfp4:/models/qwen36
- /opt/qwen36/qwen36-dflash:/models/qwen36-dflash
command:
- bash
- -c
- |
exec vllm serve /models/qwen36 \
--served-model-name qwen36-35b-heretic qwen36-fast qwen36-deep \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--dtype auto \
--quantization compressed-tensors \
--max-model-len 262144 \
--max-num-seqs 128 \
--max-num-batched-tokens 65536 \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--enable-prefix-caching \
--load-format safetensors \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--speculative-config '{"method":"dflash","model":"/models/qwen36-dflash","num_speculative_tokens":15}' \
--attention-backend flash_attn
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Note:
--enforce-eageris NOT required with the v1.2 image + the post-2026-04-19 DFlash drafter. Earlier writeups recommended it as a workaround for two separate bugs (drafter long-context crash + cudagraph capture-size misalignment). Both are now fixed — the drafter on HF, and the cudagraph alignment via the v1.2 image'spatch_cudagraph_align.py. Running with cudagraphs enabled gives ~30% throughput over eager mode.
What's inside the v1.2 image (the 8 modifications)
| # | Modification | What it solves |
|---|---|---|
| 1 | register_qwen3_5_text.py |
Adds text-only registry entries (used by v1 weights only — harmless on v2) |
| 2 | patch_cuda_optional_import.py |
Wraps _C_stable_libtorch import in RTLD_LAZY so SM100-only MXFP4 symbols don't break sm_120 |
| 3 | patch_kv_cache_utils.py (×4) |
Defaults mamba_block_size = cache_config.block_size or 16 for hybrid attention layers |
| 4 | patch_mrope_text_fallback.py |
Inline M-RoPE fallback (T=H=W=arange) — neither Qwen3.6 class implements get_mrope_input_positions upstream |
| 5 | patch_cudagraph_align.py |
Removes the FULL-only gate on cudagraph capture-size alignment so PIECEWISE+spec-decode doesn't hit cudaErrorIllegalAddress |
| 6 | VLLM_TEST_FORCE_FP8_MARLIN=1 (env, baked default) |
Forces Marlin GEMM — CUTLASS NVFP4 broken on SM121 |
| 7 | TORCH_CUDA_ARCH_LIST="12.0+PTX" (build) |
sm_120 build with PTX → driver JITs to sm_121a on Spark |
| 8 | flashinfer-python>=0.6.8 |
sm_120 NVFP4 KV-cache decode kernels |
Full per-patch breakdown with upstream-issue references: github.com/AEON-7/Qwen3.6-NVFP4-DFlash/blob/main/docs/patches.md
Model Architecture
| Property | Value |
|---|---|
| Architecture | qwen3_5_moe (multimodal — Qwen3_5MoeForConditionalGeneration) |
| Total params | ~35B |
| Active params | ~3B / token |
| Layers | 40 (3× Gated DeltaNet + 1× Gated Attention, repeating ×10) |
| Hidden | 2048 |
| Experts | 256 routed + 1 shared, top-8 per token |
| Vocabulary | 248,320 |
| Native context | 262,144 (256K) |
| Extended context (YaRN) | 1,010,000 (1M+) |
| Multimodal | 27-block ViT vision encoder (preserved BF16) |
Hybrid Attention
| Attention type | Layers | Q/K/V heads | Head dim |
|---|---|---|---|
| Gated DeltaNet (linear, BF16) | 30 (3 of every 4) | QK 16, V 32 | 128 |
| Gated Attention (NVFP4) | 10 (1 of every 4) | Q 16, KV 2 | 256 (rotary 64) |
Quantization Details
| Parameter | Value |
|---|---|
| Tool | llmcompressor |
| Format | compressed-tensors nvfp4-pack-quantized |
| Scheme | NVFP4 (FP4 E2M1 + per-block FP8 e4m3 scales + per-tensor FP32 scales) |
| Block size | 16 |
| Calibration data | open-platypus (256 samples) |
| Calibration seq_len | 2048 |
| Pipeline | Sequential (Qwen3_5MoeDecoderLayer, layer-by-layer to GPU) |
| Hardware | NVIDIA RTX PRO 6000 Blackwell (96 GB) |
| Calibration wall-clock | ~3 hours (40 decoder layers × ~3-4 min each) |
| Output | 9 safetensors shards, ~22 GB total |
| Expert keys (NVFP4) | 122,880 (40 × 256 × 3 × 4) |
| Visual keys (BF16) | ~333 |
| Linear-attn keys (BF16) | ~270 |
Quantized layers (NVFP4)
- Gated Attention projections:
q_proj,k_proj,v_proj,o_proj(10 layers) - MoE experts (256 × 40 layers = 10,240 expert modules):
gate_proj,up_proj,down_proj - Shared expert: same projections
Excluded from quantization (kept BF16)
lm_head,embed_tokens— accuracy-critical token projections*.mlp.gate,*.shared_expert_gate— MoE routing (sparsity-critical)*.norm.*— all RMSNorm layers*.visual.*— 27-block ViT vision tower*.linear_attn.*— 30 Gated DeltaNet (Mamba) layers (small relative to MoE; quantizing them tanks accuracy)
The exact recipe + script that produced this checkpoint is at scripts/qwen36_requant_v2.py.
Recommended sampling parameters
From the Qwen3.6 model card:
| Mode | General | Coding | Math/Reasoning |
|---|---|---|---|
| Thinking | T=1.0, P=0.95, K=20, PP=1.5 | T=0.6, P=0.95, K=20, PP=0.0 | T=1.0, P=1.0, K=40, PP=2.0 |
| Instruct (no think) | T=0.7, P=0.8, K=20, PP=1.5 | — | T=1.0, P=0.95, K=20, PP=1.5 |
For maximum DFlash speedup: use T=0 (greedy). The drafter ↔ target agreement rate collapses with sampling — at T=0.7 you typically see ~10-20% acceptance vs. 60-78% at T=0.
The production compose registers 3 served-model aliases for the same backend so chat clients can route greedy vs sampled requests separately:
qwen36-fast→ intended for greedy/agentic (T=0)qwen36-deep→ intended for creative/sampled (T=0.7)qwen36-35b-heretic→ canonical name
Disable thinking per-request:
{"chat_template_kwargs": {"enable_thinking": false}}
Preserve thinking across multi-turn:
{"chat_template_kwargs": {"preserve_thinking": true}}
Common gotcha: with thinking enabled (default), Qwen3.6 spends most of its
max_tokensbudget on<think>reasoning before emittingcontent. Usemax_tokens≥ 2048 for thinking-enabled requests — lower budgets often producecontent: nullwithfinish_reason: "length".
Hardware Requirements
| Tier | GPU | Notes |
|---|---|---|
| Target — production-validated | NVIDIA DGX Spark (128 GB unified, GB10 SM 12.1) | Full 256K context, 128 concurrent streams, this image |
| Compatible | RTX PRO 6000 Blackwell (96 GB) | What v2 was calibrated on; vLLM image must be rebuilt without VLLM_TEST_FORCE_FP8_MARLIN=1 (CUTLASS NVFP4 works on this chip) |
| Compatible | B200 / GB200 | Image rebuild required (SM 10.0, not SM 12.x) |
| Compatible | RTX 5090 (32 GB) | Reduced context, low concurrency |
| Minimum | Any Blackwell GPU (SM 10.0+) | Required for native FP4; sub-Blackwell can run via Marlin W4A16 fallback but with reduced throughput |
Files
| File | Size | Description |
|---|---|---|
model-00001-of-00009.safetensors … model-00009-of-00009.safetensors |
~22 GB total | NVFP4 quantized weights (~123,724 tensors across 9 shards) |
model.safetensors.index.json |
~5 MB | shard index |
config.json |
~7 KB | Model + quantization config (Qwen3_5MoeForConditionalGeneration) |
tokenizer.json |
~20 MB | Qwen tokenizer (248K vocab) |
tokenizer_config.json |
~1 KB | |
chat_template.jinja |
~8 KB | Qwen3.6 chat template (thinking + tool calling) |
preprocessor_config.json |
~500 B | Image preprocessor (kept for multimodal compat) |
generation_config.json |
~213 B | |
recipe.yaml |
~500 B | llmcompressor recipe used |
Disclaimer
THIS IS AN UNCENSORED MODEL. By downloading, accessing, or using this model you expressly assume full and sole responsibility for all outputs generated, all actions taken based on outputs, and compliance with applicable laws. The authors are not responsible for any harmful, illegal, or objectionable content. These tools serve legitimate purposes including security research, red-teaming, content analysis, and creative work. Implement safeguards appropriate to your use case and jurisdiction.
License
Apache 2.0 (inherited from Qwen3.6 base).
Credits
- Base model: tvall43/Qwen3.6-35B-A3B-heretic — abliteration via Heretic v1.2.0
- Original target: Qwen/Qwen3.6-35B-A3B by Alibaba Tongyi
- DFlash drafter: z-lab/Qwen3.6-35B-A3B-DFlash — z-lab (Soroush Mohri et al.)
- Quantization tool: llmcompressor — Neural Magic / RedHat
- vLLM build chain: vllm-project/vllm HEAD source-built for cu130/sm_120
- SM121 stability investigation: rmagur1203/vllm-dgx-spark — independent 4-day, 144-config investigation that surfaced the Marlin requirement
- Quantized by: AEON-7 on NVIDIA RTX PRO 6000 Blackwell (RunPod)
- Production-validated on: NVIDIA DGX Spark (GB10)
- Downloads last month
- 9,994