Skip to main content
graphwiz.aigraphwiz.ai
← Back to AI

Qwen3.6-35B-A3B: What the Numbers Actually Show

AIMachine Learning
qwenmoellmopen-sourceagenticcodingalibaba

On 16 April 2026, Alibaba's Qwen team released Qwen3.6-35B-A3B, the first open-weight model in the Qwen3.6 series. It arrives two months after the Qwen3.5 line and targets a specific niche: agentic coding workflows where a model must plan multi-file edits, call tools, and iterate across conversation turns. The headline numbers are impressive. SWE-bench Verified jumps from 70.0 to 73.4. Terminal-Bench 2.0 leaps from 40.5 to 51.5. But before treating this as a generational leap, it helps to look at what actually changed, and what didn't.

Architecture: Same Bones, Different Training

Let's get the obvious point out of the way. Qwen3.6-35B-A3B is a post-training update, not an architectural one. The Hugging Face model card lists the architecture class as qwen3_5_moe, identical to its predecessor. The underlying model topology has not changed.

Specification Value
Total parameters 35B
Active parameters per token 3B
Number of experts 256
Active experts 8 routed + 1 shared
Number of layers 40
Hidden dimension 2048
Native context length 262,144 tokens
Extended context (YaRN) 1,010,000 tokens
Licence Apache 2.0

The model is a multimodal causal language model with a vision encoder, supporting text, image, and video input. It ships in BF16 and requires roughly 22 GB of VRAM for a Q4_K_M quantised variant, making it runnable on a single RTX 3090, 4090, or 5090.

The Hybrid Attention Layout

The architecture uses a repeating pattern of ten blocks, each structured as 3 x (Gated DeltaNet + MoE) followed by 1 x (Gated Attention + MoE). That means 75% of the 40 layers use linear attention (O(n) scaling with sequence length) and 25% use full quadratic attention. The 3:1 ratio is not arbitrary: both Alibaba and Kimi independently converged on this proportion for their respective hybrid architectures, suggesting it strikes a practical balance between long-context efficiency and precise token-level reasoning.

Gated DeltaNet

The linear attention layers use a Gated DeltaNet formulation. There are 32 linear attention heads for the value projection and 16 for the query-key projection, with a head dimension of 128. The design draws on Mamba2's gated decay mechanism combined with DeltaNet's structured recurrence, giving the model the ability to maintain long-range dependencies without the quadratic cost of standard attention.

Gated Attention

The remaining 25% of layers use Gated Attention with an extreme form of grouped-query attention: 16 query heads but only 2 key-value heads. Head dimension is 256, and the rotary position embedding dimension is 64. This aggressive KV compression means the attention layers are cheap to run despite being quadratic in nature.

Naming Convention: A3B Is Not New

The "A3B" suffix denotes "Active 3 Billion" parameters, meaning only 3 of the 35 billion total parameters are used per forward pass. This naming convention was introduced with Qwen3-30B-A3B in April 2025 and continued through the Qwen3.5-35B-A3B release in February 2026. It tells you nothing new about Qwen3.6 specifically.

Benchmarks: Strong Gains in Coding, Flat Elsewhere

The benchmark results, sourced directly from the Hugging Face model card, paint a clear picture of where the post-training improvements landed.

Coding Agent Benchmarks

Benchmark Qwen3.5-35B-A3B Qwen3.6-35B-A3B Delta
SWE-bench Verified 70.0 73.4 +3.4
SWE-bench Multilingual 60.3 67.2 +6.9
SWE-bench Pro 44.6 49.5 +4.9
Terminal-Bench 2.0 40.5 51.5 +11.0
MCPMark 27.0 37.0 +10.0
Claw-Eval Avg 65.4 68.7 +3.3
SkillsBench Avg5 4.4 28.7 +24.3
NL2Repo 20.5 29.4 +8.9

These numbers are solid. Terminal-Bench 2.0 and MCPMark, both measuring real tool-calling ability in agentic settings, show double-digit gains. SWE-bench Verified at 73.4 is competitive with much larger models.

General Knowledge and Reasoning

Benchmark Qwen3.5-35B-A3B Qwen3.6-35B-A3B Delta
MMLU-Pro 85.3 85.2 -0.1
GPQA 84.2 86.0 +1.8
HLE 22.4 21.4 -1.0
LiveCodeBench v6 74.6 80.4 +5.8
AIME 26 91.0 92.7 +1.7

MMLU-Pro is effectively flat. HLE dropped slightly. The coding-specific LiveCodeBench improved by nearly six points, consistent with the model's targeted post-training focus. The story here is straightforward: Alibaba trained for agentic coding capability, and the benchmarks reflect that trade-off.

Red Flags

Healthy scepticism is warranted. Several aspects of this release deserve scrutiny.

No Peer-Reviewed Paper

There is no arXiv paper for Qwen3.6. The technical details come from a blog post on qwen.ai and the Hugging Face model card. The most recent Qwen technical report on arXiv (2505.09388) covers the Qwen3 series, not 3.6. Without independent peer review, the post-training methodology, data composition, and alignment strategy remain opaque.

Internal and Unverified Benchmarks

Several of the most impressive-sounding numbers come from internal benchmarks. QwenWebBench, which shows a 419-point Elo jump (978 to 1397), is described as "an internal front-end code generation benchmark" with an "auto-render + multimodal judge." QwenClawBench is "an internal real-user-distribution Claw agent benchmark" that is "open-sourcing soon." Internal benchmarks evaluated by internal judges are not trustworthy evidence.

The SkillsBench Jump Demands Explanation

SkillsBench Avg5 went from 4.4 to 28.7, a gain of 24.3 points. That is an extraordinary improvement for a post-training update on an unchanged architecture. The model card notes that the evaluation for 3.6 was conducted "via OpenCode on 78 tasks (self-contained subset, excluding API-dependent tasks)" with an average of five runs, but does not explain why the Qwen3.5 score was so low in the first place. A jump this large in a single training iteration could indicate a benchmark evaluation fix rather than a genuine capability improvement.

Missing Standard Benchmarks

Notable absences from the benchmark table include HumanEval, GSM8K, and MATH. These are among the most widely reported and independently reproducible coding and reasoning benchmarks. Their omission is conspicuous.

Cherry-Picked Comparisons

The benchmark table on the model card compares Qwen3.6 against Qwen3.5-27B, Gemma4-31B, Qwen3.5-35B-A3B, and Gemma4-26B-A4B. For the vision-language section, Claude Sonnet 4.5 appears as a comparison point. The current frontier models (Claude 4.6, GPT-5.4) are absent. Comparing against last-generation proprietary models makes the numbers look stronger than they would against current ones.

Thinking Preservation

One genuinely useful feature is Thinking Preservation. By default, Qwen models only retain the thinking trace from the most recent message in a conversation. In agentic coding, where an assistant might make dozens of tool calls across many turns, losing the reasoning chain from earlier messages forces the model to re-derive context it has already worked through. Thinking Preservation retains reasoning context from historical messages across turns, enabled by setting preserve_thinking: True in the chat template kwargs.

This addresses a real pain point. Agents that plan across multiple iterations benefit from seeing their own prior reasoning, and the Qwen team claims it can reduce overall token consumption by cutting redundant reasoning while improving KV cache utilisation. The claim is plausible, though independent verification is needed.

Deployment

The model integrates with the standard inference stack. vLLM 0.19.0 or later and SGLang 0.5.10 or later both support it out of the box, including tool-calling parsers and multi-token prediction. AMD published day-zero support via ROCm 7.0 for MI300X, MI325X, MI350X, and MI355X GPUs.

For local deployment, Ollama supports Q4_K_M quantisation at roughly 24 GB of VRAM. MLX provides Apple Silicon support. The quantised model fits comfortably on a single consumer GPU, which is the whole point of the 3B-active design.

Community Reaction

The Hacker News thread was positive but measured. Several commenters noted the contrast with earlier concerns about Alibaba's open-source commitment following reported internal restructuring and the departure of Qwen lead Junyang Lin. One commenter described it as "a relief to see the Qwen team still publishing open weights." Simon Willison tested the model on creative generation tasks and reported his findings on his blog. Within two days of release, the model accumulated 753 likes on Hugging Face and over 21,000 downloads.

The practical feedback from early users running quantised versions on consumer hardware has been generally favourable for coding tasks, with some noting that the model remains behind frontier proprietary models (Opus 4.7, GPT-5.4) on complex reasoning, as expected for its size class.

Verdict

Qwen3.6-35B-A3B delivers genuine improvements in agentic coding over its predecessor. The SWE-bench, Terminal-Bench, and MCPMark gains are substantial and, for the benchmarks with established methodology, credible. The model is open-weight, Apache-licensed, and runnable on a single consumer GPU. Thinking Preservation is a practical innovation for multi-turn agent workflows.

But this is not an architectural advance. The model shares the same qwen3_5_moe architecture as its predecessor. The post-training methodology is undocumented beyond a blog post. Several of the most striking benchmark results come from internal, unaudited evaluations. The omission of standard benchmarks like HumanEval and GSM8K, and the choice to compare against last-generation proprietary models, suggest selective presentation.

The honest framing is this: Alibaba took a proven architecture, applied targeted post-training for agentic coding, and produced a stronger model in that domain. The results are real where they can be independently verified. The rest should be treated with appropriate caution until third-party evaluations arrive.