Skip to main content
graphwiz.aigraphwiz.ai
← Back to AI

Arcee AI Trinity-Large-Thinking: The $20M Open Model Chasing Claude

AIOpen Source
arcee-aitrinitymoeopen-sourceapache-2llmagentic-aireasoning

Twenty-six people. Twenty million dollars. One model that outperforms GPT-5 on reasoning benchmarks.

Arcee AI's Trinity-Large-Thinking isn't supposed to exist. The conventional wisdom says you need hundreds of researchers and nine-figure budgets to train a frontier model. A startup that pivoted to frontier training less than a year ago shouldn't be topping leaderboards. Yet here we are.

What Is Trinity-Large-Thinking?

Trinity-Large-Thinking is a mixture-of-experts language model with roughly 400 billion total parameters, but only 13 billion active during any single forward pass. It was released in April 2026 under the Apache 2.0 licence, making it the only fully open US-trained model operating at this performance tier.

The model sits in an unusual category: strong enough to compete with closed-source leaders on reasoning tasks, cheap enough to serve at scale, and open enough for anyone to inspect, modify, or deploy on their own hardware. That combination didn't exist before.

The Company Behind It

Arcee AI was founded in late 2023 by CEO Mark McQuade (previously at Hugging Face) and CTO Jacob Solawetz (previously at Roboflow). The company started out building small language models for enterprise use, then made a sharp pivot toward frontier-scale training about nine months before Trinity's release.

The team is roughly 26 people. Total funding sits around $49 million. Those numbers are absurdly small for a company producing models that trade blows with Claude Opus 4.6 and GPT-5, but they reflect a deliberate strategy: spend heavily on compute for a single training run rather than maintaining a large research org.

Architecture

Trinity uses a sparse mixture-of-experts design with some unusual choices. The architecture class is registered as AfmoeForCausalLM, and the full specification is laid out in their technical report (arXiv:2602.17004).

Parameter Value
Total parameters ~400B
Active parameters ~13B
Number of experts 256
Active experts per token 4
Sparsity 1.56%
Routing mechanism Sigmoid
Dense layers 6
Attention pattern 3:1 interleaved local/global (GQA + gated)
Load balancing SMEBU
Optimiser Muon (not AdamW)
Context length 512K tokens

A few things stand out.

256 experts with only 4 active. Most MoE models use 8 to 16 experts. Arcee went with a much larger expert pool, giving the model more specialised paths for different types of reasoning while keeping active parameters at just 13B. The 1.56% sparsity ratio is aggressive even by MoE standards.

Sigmoid routing, not softmax. Arcee uses a sigmoid function instead of the standard softmax gate, arguing it produces more stable expert specialisation. Whether this holds up under independent analysis remains to be seen.

Muon optimiser. Nearly every large model trained in the last three years uses AdamW or a variant. Muon is based on matrix orthogonalisation, and Arcee reports better convergence on this specific architecture. If that result replicates, it could shift how the field approaches MoE training.

SMEBU load balancing. Expert collapse, where routing sends most tokens to a small subset of experts, is a persistent MoE problem. SMEBU ("Sparse Mixture of Experts with Balanced Utilisation") is Arcee's custom loss term for keeping usage even. Training logs show consistent 90% utilisation across the expert pool.

Training

The training run was large but not unprecedented by 2026 standards. What makes it notable is the efficiency.

Data: 17 trillion tokens curated by DatologyAI, plus over 8 trillion synthetic tokens. Arcee hasn't disclosed the exact composition, but emphasises heavy filtering for quality over raw volume.

Hardware: 2,048 NVIDIA B300 GPUs. Arcee was one of the first non-hyperscaler organisations to get a full cluster of the newest data centre chips. The pre-training run took 33 days at roughly 90% hardware utilisation, exceptionally high for a distributed job at this scale.

Post-training: A separate phase on 1,152 H100 GPUs focused on agentic reinforcement learning. This is where the "Thinking" in the name comes from. The model was trained with chain-of-thought reasoning traces, tool-use trajectories, and multi-step problem decomposition.

Total cost: Approximately $20 million all-in. That's roughly 1/50th of what Google or OpenAI spend on comparable frontier models.

Benchmarks

The numbers tell the story better than prose.

Benchmark Score Rank
τ²-Airline 88.0 #1
LiveCodeBench 98.2 #1
PinchBench 91.9 #2
AIME25 96.3 #2
GPQA 76.3
MMLU-Pro 83.4
SWE-bench 63.2
IFBench 52.3

The top-line results are striking. Trinity holds the number one spot on τ²-Airline (a benchmark specifically designed to test agentic reasoning in realistic software engineering tasks) and LiveCodeBench (live coding problems against production test suites). It sits second on PinchBench and AIME25, two benchmarks where the only models ahead of it are Claude Opus 4.6 and, on AIME25, a specialised math model.

But the scorecard isn't uniform. GPQA at 76.3 and MMLU-Pro at 83.4 are competitive but not leading. SWE-bench at 63.2 is solid for an open model, though behind proprietary leaders. IFBench at 52.3 is a genuine weak spot. Instruction following remains one of Trinity's clear limitations.

Pricing and Availability

Trinity is available through OpenRouter at $0.90 per million output tokens. For comparison, Claude Opus 4.6 costs $25/M output tokens. That's a 28× price gap for a model that beats it on several reasoning benchmarks.

In its first two months on OpenRouter, Trinity has served over 3.37 trillion tokens. That kind of adoption volume suggests the market is hungry for a capable open model at this price point.

The model also works with OpenClaw and Hermes Agent frameworks for plugging into existing agentic tooling pipelines.

Running It Yourself

For self-hosting, Trinity works with vLLM. Launch it like this:

python -m vllm.entrypoints.openai.api_server \
  --model arcee-ai/Trinity-Large-Thinking \
  --tensor-parallel-size 8 \
  --max-model-len 524288 \
  --trust-remote-code

You'll need at least 8× H100 80GB or equivalent GPUs to serve the model with reasonable latency. The 400B total parameters don't need to all live in GPU memory simultaneously thanks to expert offloading, but the active 13B parameters plus KV cache for 512K context still require substantial VRAM.

For API access via OpenRouter:

import openai

client = openai.OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY"
)

response = client.chat.completions.create(
    model="arcee-ai/trinity-large-thinking",
    messages=[
        {"role": "user", "content": "Explain why sigmoid routing might outperform softmax in MoE models."}
    ],
    temperature=0.7,
    max_tokens=4096
)

print(response.choices[0].message.content)

Limitations and Caveats

Trinity has real weaknesses.

The post-training phase was limited compared to what companies like Anthropic and OpenAI do. Arcee describes the current release as "preliminary." Instruction following (IFBench) is the most visible symptom: the model sometimes misinterprets complex multi-part instructions or drifts from the specified output format.

No independent peer review of the training methodology exists yet. Claims about Muon's superiority over AdamW, the benefits of sigmoid routing, and SMEBU effectiveness all need external validation. The 512K context window hasn't been stress-tested via NIAH evaluations either.

Arcee is currently raising for a planned 1T parameter model with an estimated training cost around $60M.

Who Should Use Trinity

Trinity-Large-Thinking makes sense for three groups right now.

Teams building agentic systems. The model's strength on τ²-Airline and tool-use tasks, combined with its low serving cost, makes it a strong candidate for production agent pipelines where you're running many inference calls per task. The 28× cost advantage over Claude compounds fast at scale.

Researchers studying MoE architectures. An Apache 2.0 model with 256 experts, sigmoid routing, and Muon optimisation is a gift for anyone investigating sparse model design. You can inspect, modify, and retrain every component without licensing restrictions.

Organisations that need open models. If your compliance or security requirements demand on-premises deployment with full model weights, Trinity is currently the strongest option trained outside China. DeepSeek and Qwen offer competitive open models, but data sovereignty and export regulations make US-trained models preferable for many enterprise contexts.

For everyone else, the current limitations in instruction following and the preliminary nature of the release suggest waiting. Arcee has signalled that updated checkpoints with improved post-training are coming. The architecture is proven. The execution needs refinement.

The Bigger Picture

Trinity's existence matters not because it beats every closed model. It doesn't. It matters because a 26-person startup with $49M in funding built something that competes at the frontier, released it openly, and priced it at a fraction of the competition.

The cost to train a frontier-capable model keeps dropping. Trinity is evidence the trend is accelerating. If $20M buys you a model that ranks first on agentic reasoning benchmarks in 2026, the economics of frontier AI look very different from what most people assumed a year ago.

The open-source AI ecosystem has been waiting for a US-trained model that can credibly challenge the closed-source incumbents. Trinity isn't perfect, but it's the first model that makes that claim with a straight face.


Technical details sourced from arXiv:2602.17004 and Arcee AI's public documentation.