Skip to main content
graphwiz.ai
← Back to AI

Self-Hosted LLM Inference: A Complete vLLM Setup Guide

AISelf-Hosting
vllmllmself-hosteddockernvidiainferenceqwen

Self-Hosted LLM Inference with vLLM

Running your own LLM inference server gives you complete control over data privacy, latency, and costs. This guide walks through deploying a production-ready vLLM server on NVIDIA DGX Spark hardware, with real-world troubleshooting tips from actual deployment experience.

Why Self-Host LLM Inference?

Before diving into the technical setup, consider the benefits of self-hosting:

  • Data Privacy: Sensitive data never leaves your infrastructure
  • Predictable Costs: No per-token API charges for heavy workloads
  • Low Latency: Local inference eliminates network round-trips
  • Model Freedom: Run any model, including fine-tuned variants
  • No Rate Limits: Scale horizontally without API throttling

Hardware Platform: NVIDIA DGX Spark (GB10)

This guide is based on deployment experience with the ASUS Ascent GX10, powered by NVIDIA's DGX Spark platform featuring the GB10 Grace Blackwell Superchip.

Key Specifications

The GB10 is a high-performance AI-focused system-on-a-chip (SoC) designed for desktop AI workstations:

Component Specification
CPU 20-core ARM v9.2-A (10× Cortex-X925 @ 3GHz + 10× Cortex-A725 @ 2GHz)
GPU Blackwell architecture, 6,144 shaders, 5th Gen Tensor Cores, 4th Gen RT Cores
AI Performance 1,000 TOPS FP4 (NVFP4), 31.03 TFLOPS FP32
Memory 128 GB LPDDR5X-9400 (256-bit bus, 273–301 GB/s bandwidth)
Interconnect NVLink-C2C (600 GB/s bidirectional CPU↔GPU)
Cache 32 MB L3 + 24 MB GPU L2 + 16 MB L4 system cache
Power 140 W TDP
Form Factor 150mm × 150mm × 50.5mm desktop
Storage Up to 4 TB NVMe SSD
Connectivity HDMI 2.1a, 4× USB-C, 10 GbE, 200 Gbps ConnectX-7, Wi-Fi 7, BT 5.4

The GB10 Grace Blackwell Superchip is optimized for inference workloads with:

  • Native FP4, FP8, and INT4 support for efficient quantization
  • Transformer Engine acceleration
  • Unified coherent memory architecture
  • High memory bandwidth for large context windows

vLLM: The Inference Engine

vLLM is a high-performance LLM inference engine that provides:

  • PagedAttention: Efficient memory management for KV cache
  • Continuous Batching: Dynamic request batching
  • Optimized Kernels: FlashAttention, FlashInfer integration
  • OpenAI-Compatible API: Drop-in replacement for OpenAI clients

Docker Deployment

The recommended deployment method uses Docker for reproducibility and isolation.

Docker Compose Configuration

services:
  vllm:
    image: vllm-node:latest
    container_name: vllm-qwen
    restart: unless-stopped
    runtime: nvidia
    ports:
      - "0.0.0.0:8888:8888"
    volumes:
      - ./templates/qwen3_chat.jinja:/templates/qwen3_chat.jinja:ro
      - ./data/sharegpt.json:/data/sharegpt.json:ro
      - ~/.cache/huggingface:/root/.cache/huggingface
    ipc: host
    shm_size: 32g
    environment:
      - VLLM_USE_FLASHINFER_MOE_FP8=1
      - VLLM_FLASHINFER_MOE_BACKEND=latency
      - VLLM_ATTENTION_BACKEND=FLASH_ATTN
      - VLLM_USE_DEEP_GEMM=0
    command:
      - vllm
      - serve
      - QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ
      - --port
      - "8888"
      - --host
      - 0.0.0.0
      - --max-model-len
      - "131072"
      - --gpu-memory-utilization
      - "0.7"
      - --load-format
      - fastsafetensors
      - --max-num-seqs
      - "64"
      - --max-num-batched-tokens
      - "8192"
      - --trust-remote-code
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder
      - --chat-template-kwargs
      - '{"enable_thinking": false}'

Configuration Parameters Explained

Parameter Value Purpose
--max-model-len 131072 128K token context window
--gpu-memory-utilization 0.7 Reserve 70% GPU memory for model
--load-format fastsafetensors Fast model loading
--max-num-seqs 64 Maximum concurrent sequences
--max-num-batched-tokens 8192 Token batch limit for prefill
--enable-auto-tool-choice flag Enable function calling
--tool-call-parser qwen3_coder Tool call format parser

Resource Allocation

With this configuration on GB10 hardware:

Resource Allocation
Model Memory ~17 GiB
KV Cache 62.16 GiB
KV Cache Tokens 678,912
Max Concurrency 5.18x (at 131K context)

Common Issues and Solutions

Issue 1: Tool Choice Configuration Error

Error:

"auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set

Solution: Add both flags to enable tool calling:

--enable-auto-tool-choice
--tool-call-parser qwen3_coder

Note: Qwen3-VL models DO support tool calling with the qwen3_coder parser, despite some documentation suggesting otherwise.

Issue 2: Context Window Overflow

Error:

You passed 23643 input tokens and requested 32000 output tokens.
However, the model's context length is only 32768.

Solution: Increase --max-model-len to accommodate both input and output:

--max-model-len 131072  # 128K tokens

Issue 3: Thinking Mode Output Leakage

Symptom: Model outputs internal reasoning tokens like <think, >>>, or exposed chain-of-thought.

Cause: Qwen3 models default to "thinking mode" which outputs reasoning before responses.

Solution: Disable at server level:

--chat-template-kwargs '{"enable_thinking": false}'

Or per-request in the API call:

{
  "model": "QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ",
  "messages": [...],
  "extra_body": {"enable_thinking": false}
}

Issue 4: Triton Kernels Warning

Warning:

Failed to import Triton kernels. No module named 'triton_kernels.routing'

Impact: None - this is a warning only. vLLM falls back to FLASH_ATTN backend.

Optional Fix:

pip install triton-kernels@git+https://github.com/triton-lang/triton.git@v3.5.0#subdirectory=python/triton_kernels

Issue 5: NVIDIA Driver 590.x Compatibility

Problem: Driver 590.x introduces compatibility issues with GB10 Blackwell systems.

Root Causes:

  • Library renaming (libnvidia-computelibnvidia_compute)
  • Incomplete sm_121 compute capability support
  • FlashInfer kernel compilation failures

Recommended Solution: Use driver version 535.x series (stable, tested with vLLM).

Verification:

nvidia-smi  # Check driver version
nvcc --version  # Verify CUDA compatibility

Performance Benchmarks

Throughput Results

Test Concurrency Tokens/Request Throughput Latency
Sequential 1 128 868.81 tok/s 3.19s
Concurrent 8 128 315.85 tok/s 3.24s
Sustained 16 256 387.71 tok/s 10.23s
Long-form 4 512 187.85 tok/s 10.90s

Context Window Scaling

Prompt Size Actual Tokens Prefill Time Prefill Speed
10K 10,387 1.80s 5,768 tok/s
30K 31,547 7.15s 4,413 tok/s
50K 53,307 12.66s 4,211 tok/s
80K 85,947 29.06s 2,958 tok/s
100K 107,707 26.43s 4,074 tok/s
120K 129,467 32.03s 4,042 tok/s

Quick Start Commands

# Start server
cd /path/to/vllm-docker
docker compose up -d

# Check status
curl http://localhost:8888/v1/models
docker logs vllm-qwen --tail 50

# Stop server
docker compose down

API Usage

Available Endpoints

Endpoint Method Description
/v1/models GET List available models
/v1/chat/completions POST Chat completion
/v1/completions POST Text completion
/health GET Health check
/metrics GET Prometheus metrics

Python Client Example

#!/usr/bin/env python3
"""Test script for vLLM server."""
import requests

BASE_URL = "http://localhost:8888/v1"
MODEL = "QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ"

response = requests.post(
    f"{BASE_URL}/chat/completions",
    headers={"Content-Type": "application/json"},
    json={
        "model": MODEL,
        "messages": [
            {"role": "user", "content": "Hello! Please introduce yourself briefly."}
        ],
        "max_tokens": 256,
        "temperature": 0.7,
    },
    timeout=120,
)

if response.status_code == 200:
    data = response.json()
    print(data["choices"][0]["message"]["content"])
else:
    print(f"Error: {response.status_code}")
    print(response.text)

Using with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8888/v1",
    api_key="not-needed"  # vLLM doesn't require API keys
)

response = client.chat.completions.create(
    model="QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ",
    messages=[
        {"role": "user", "content": "Explain quantum computing in one paragraph."}
    ],
    max_tokens=200
)

print(response.choices[0].message.content)

Optimization Tips

Memory Optimization

  1. Adjust GPU memory utilization based on your workload:

    --gpu-memory-utilization 0.8  # For single model, aggressive
    --gpu-memory-utilization 0.5  # For multi-tenant, conservative
    
  2. Use quantized models (AWQ, GPTQ) to reduce memory footprint

  3. Tune batch sizes for your typical request patterns:

    --max-num-seqs 32  # Lower for memory-constrained
    --max-num-batched-tokens 4096  # Lower for latency-sensitive
    

Latency Optimization

  1. Use FlashAttention for faster prefill:

    --attention-backend FLASH_ATTN
    
  2. Enable prefix caching for repeated prompts:

    --enable-prefix-caching
    
  3. Consider speculative decoding for faster generation:

    --speculative-model [smaller-model]
    --num-speculative-tokens 4
    

Production Considerations

High Availability

  • Deploy multiple vLLM instances behind a load balancer
  • Use health checks for automatic failover
  • Implement request queuing for burst handling

Monitoring

  • Enable Prometheus metrics: /metrics endpoint
  • Monitor GPU memory utilization
  • Track request latency and throughput
  • Set up alerts for error rates

Security

  • Bind to localhost only (127.0.0.1) for internal services
  • Use reverse proxy (nginx, Traefik) with TLS for external access
  • Implement rate limiting
  • Consider authentication for multi-tenant deployments

References


Self-hosting LLM inference puts you in control of your AI infrastructure. With vLLM and proper hardware, you can achieve production-grade performance while maintaining data privacy and predictable costs. Start with the basic configuration, benchmark your workloads, and optimize based on your specific needs.

Next Steps