Tag: inference

4 articles

llm dgx-spark atlas multi-model inference qwen deepseek moe

Atlas Engine: Sub-2-Minute Cold Start for Multi-Model Orchestration on DGX Spark

May 10, 2026 · 7 min read

Run 3 specialised LLMs on a single DGX Spark in under 2 minutes with 100+ tok/s throughput. Production orchestration patterns revealed.

atlasdgx-sparkmulti-modelllminferenceqwen

DeepSeek V4: 1.6T Parameters, FP4 Precision, and the Huawei NPU Question

April 25, 2026 · 6 min read

DeepSeek V4 ships two open-weight MoE models — a 1.6T Pro and a 284B Flash — with novel sparse attention, FP4 quantisation, 1M token context, and validated Huawei Ascend NPU support. Here's what actually changed.

deepseekmoellmopen-sourcehuaweinpuinferencefp4

vLLM vs SGLang: Choosing an LLM Inference Framework in 2026

April 13, 2026 · 7 min read

A technical comparison of vLLM and SGLang, the two leading open-source LLM inference engines, covering architecture, performance, and when to pick each one.

vllmsglangllminferencemachine-learninggpuserving

Self-Hosted LLM Inference: A Complete vLLM Setup Guide

February 25, 2026 · 8 min read

A practical guide to deploying production-ready LLM inference using vLLM on NVIDIA DGX Spark hardware, covering configuration, troubleshooting, and performance optimization.

vllmllmself-hosteddockernvidiainferenceqwen