Skip to main content
graphwiz.aigraphwiz.ai
← Back to AI

Self-Hosted LLM Inference: A Complete vLLM Setup Guide

AISelf-Hosting
vllmllmself-hosteddockernvidiainferenceqwen

Self-Hosted LLM Inference with vLLM

Running your own LLM inference server gives you complete control over data privacy, latency, and costs. This guide walks through deploying a production-ready vLLM server on NVIDIA DGX Spark hardware, with real-world troubleshooting tips from actual deployment experience.

Why Self-Host LLM Inference?

Before diving into the technical setup, consider the benefits of self-hosting:

  • Data Privacy: Sensitive data never leaves your infrastructure
  • Predictable Costs: No per-token API charges for heavy workloads
  • Low Latency: Local inference eliminates network round-trips
  • Model Freedom: Run any model, including fine-tuned variants
  • No Rate Limits: Scale horizontally without API throttling

Hardware Platform: NVIDIA DGX Spark (GB10)

This guide is based on deployment experience with the ASUS Ascent GX10, powered by NVIDIA's DGX Spark platform featuring the GB10 Grace Blackwell Superchip.

Key Specifications

The GB10 is a high-performance AI-focused system-on-a-chip (SoC) designed for desktop AI workstations:

ComponentSpecification
CPU20-core ARM v9.2-A (10× Cortex-X925 @ 3GHz + 10× Cortex-A725 @ 2GHz)
GPUBlackwell architecture, 6,144 shaders, 5th Gen Tensor Cores, 4th Gen RT Cores
AI Performance1,000 TOPS FP4 (NVFP4), 31.03 TFLOPS FP32
Memory128 GB LPDDR5X-9400 (256-bit bus, 273–301 GB/s bandwidth)
InterconnectNVLink-C2C (600 GB/s bidirectional CPU↔GPU)
Cache32 MB L3 + 24 MB GPU L2 + 16 MB L4 system cache
Power140 W TDP
Form Factor150mm × 150mm × 50.5mm desktop
StorageUp to 4 TB NVMe SSD
ConnectivityHDMI 2.1a, 4× USB-C, 10 GbE, 200 Gbps ConnectX-7, Wi-Fi 7, BT 5.4

The GB10 Grace Blackwell Superchip is optimized for inference workloads with:

  • Native FP4, FP8, and INT4 support for efficient quantization
  • Transformer Engine acceleration
  • Unified coherent memory architecture
  • High memory bandwidth for large context windows

vLLM: The Inference Engine

vLLM is a high-performance LLM inference engine that provides:

  • PagedAttention: Efficient memory management for KV cache
  • Continuous Batching: Dynamic request batching
  • Optimized Kernels: FlashAttention, FlashInfer integration
  • OpenAI-Compatible API: Drop-in replacement for OpenAI clients

Docker Deployment

The recommended deployment method uses Docker for reproducibility and isolation.

Docker Compose Configuration

services:
  vllm:
    image: vllm-node:latest
    container_name: vllm-qwen
    restart: unless-stopped
    runtime: nvidia
    ports:
      - "0.0.0.0:8888:8888"
    volumes:
      - ./templates/qwen3_chat.jinja:/templates/qwen3_chat.jinja:ro
      - ./data/sharegpt.json:/data/sharegpt.json:ro
      - ~/.cache/huggingface:/root/.cache/huggingface
    ipc: host
    shm_size: 32g
    environment:
      - VLLM_USE_FLASHINFER_MOE_FP8=1
      - VLLM_FLASHINFER_MOE_BACKEND=latency
      - VLLM_ATTENTION_BACKEND=FLASH_ATTN
      - VLLM_USE_DEEP_GEMM=0
    command:
      - vllm
      - serve
      - QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ
      - --port
      - "8888"
      - --host
      - 0.0.0.0
      - --max-model-len
      - "131072"
      - --gpu-memory-utilization
      - "0.7"
      - --load-format
      - fastsafetensors
      - --max-num-seqs
      - "64"
      - --max-num-batched-tokens
      - "8192"
      - --trust-remote-code
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder
      - --chat-template-kwargs
      - '{"enable_thinking": false}'
```text

### Configuration Parameters Explained

| Parameter | Value | Purpose |
| ----------- | ------- | --------- |
| `--max-model-len` | 131072 | 128K token context window |
| `--gpu-memory-utilization` | 0.7 | Reserve 70% GPU memory for model |
| `--load-format` | fastsafetensors | Fast model loading |
| `--max-num-seqs` | 64 | Maximum concurrent sequences |
| `--max-num-batched-tokens` | 8192 | Token batch limit for prefill |
| `--enable-auto-tool-choice` | flag | Enable function calling |
| `--tool-call-parser` | qwen3_coder | Tool call format parser |

### Resource Allocation

With this configuration on GB10 hardware:

| Resource | Allocation |
| ---------- | ------------ |
| Model Memory | ~17 GiB |
| KV Cache | 62.16 GiB |
| KV Cache Tokens | 678,912 |
| Max Concurrency | 5.18x (at 131K context) |

## Common Issues and Solutions

### Issue 1: Tool Choice Configuration Error

**Error:**

```text
"auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set
```text

**Solution:**
Add both flags to enable tool calling:

```bash
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
```text

**Note:** Qwen3-VL models DO support tool calling with the `qwen3_coder` parser, despite some documentation suggesting otherwise.

### Issue 2: Context Window Overflow

**Error:**

```text
You passed 23643 input tokens and requested 32000 output tokens.
However, the model's context length is only 32768.
```text

**Solution:**
Increase `--max-model-len` to accommodate both input and output:

```bash
--max-model-len 131072  # 128K tokens
```text

### Issue 3: Thinking Mode Output Leakage

**Symptom:**
Model outputs internal reasoning tokens like `<think`, `>>>`, or exposed chain-of-thought.

**Cause:**
Qwen3 models default to "thinking mode" which outputs reasoning before responses.

**Solution:**
Disable at server level:

```bash
--chat-template-kwargs '{"enable_thinking": false}'
```text

Or per-request in the API call:

```json
{
  "model": "QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ",
  "messages": [...],
  "extra_body": {"enable_thinking": false}
}
```text

### Issue 4: Triton Kernels Warning

**Warning:**

```text
Failed to import Triton kernels. No module named 'triton_kernels.routing'
```text

**Impact:** None - this is a warning only. vLLM falls back to FLASH_ATTN backend.

**Optional Fix:**

```bash
pip install triton-kernels@git+https://github.com/triton-lang/triton.git@v3.5.0#subdirectory=python/triton_kernels
```text

### Issue 5: NVIDIA Driver 590.x Compatibility

**Problem:**
Driver 590.x introduces compatibility issues with GB10 Blackwell systems.

**Root Causes:**

- Library renaming (`libnvidia-compute` → `libnvidia_compute`)
- Incomplete sm_121 compute capability support
- FlashInfer kernel compilation failures

**Recommended Solution:**
Use driver version 535.x series (stable, tested with vLLM).

**Verification:**

```bash
nvidia-smi  # Check driver version
nvcc --version  # Verify CUDA compatibility
```text

## Performance Benchmarks

### Throughput Results

| Test | Concurrency | Tokens/Request | Throughput | Latency |
| ------ | ------------- | ---------------- | ------------ | --------- |
| Sequential | 1 | 128 | 868.81 tok/s | 3.19s |
| Concurrent | 8 | 128 | 315.85 tok/s | 3.24s |
| Sustained | 16 | 256 | 387.71 tok/s | 10.23s |
| Long-form | 4 | 512 | 187.85 tok/s | 10.90s |

### Context Window Scaling

| Prompt Size | Actual Tokens | Prefill Time | Prefill Speed |
| ------------- | --------------- | -------------- | --------------- |
| 10K | 10,387 | 1.80s | 5,768 tok/s |
| 30K | 31,547 | 7.15s | 4,413 tok/s |
| 50K | 53,307 | 12.66s | 4,211 tok/s |
| 80K | 85,947 | 29.06s | 2,958 tok/s |
| 100K | 107,707 | 26.43s | 4,074 tok/s |
| 120K | 129,467 | 32.03s | 4,042 tok/s |

## Quick Start Commands

```bash
# Start server
cd /path/to/vllm-docker
docker compose up -d

# Check status
curl http://localhost:8888/v1/models
docker logs vllm-qwen --tail 50

# Stop server
docker compose down
```text

## API Usage

### Available Endpoints

| Endpoint | Method | Description |
| ---------- | -------- | ------------- |
| `/v1/models` | GET | List available models |
| `/v1/chat/completions` | POST | Chat completion |
| `/v1/completions` | POST | Text completion |
| `/health` | GET | Health check |
| `/metrics` | GET | Prometheus metrics |

### Python Client Example

```python
#!/usr/bin/env python3
"""Test script for vLLM server."""
import requests

BASE_URL = "http://localhost:8888/v1"
MODEL = "QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ"

response = requests.post(
    f"{BASE_URL}/chat/completions",
    headers={"Content-Type": "application/json"},
    json={
        "model": MODEL,
        "messages": [
            {"role": "user", "content": "Hello! Please introduce yourself briefly."}
        ],
        "max_tokens": 256,
        "temperature": 0.7,
    },
    timeout=120,
)

if response.status_code == 200:
    data = response.json()
    print(data["choices"][0]["message"]["content"])
else:
    print(f"Error: {response.status_code}")
    print(response.text)
```text

### Using with OpenAI SDK

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8888/v1",
    api_key="not-needed"  # vLLM doesn't require API keys
)

response = client.chat.completions.create(
    model="QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ",
    messages=[
        {"role": "user", "content": "Explain quantum computing in one paragraph."}
    ],
    max_tokens=200
)

print(response.choices[0].message.content)
```text

## Optimization Tips

### Memory Optimization

1. **Adjust GPU memory utilization** based on your workload:

   ```bash
   --gpu-memory-utilization 0.8  # For single model, aggressive
   --gpu-memory-utilization 0.5  # For multi-tenant, conservative
  1. Use quantized models (AWQ, GPTQ) to reduce memory footprint

  2. Tune batch sizes for your typical request patterns:

    --max-num-seqs 32  # Lower for memory-constrained
    --max-num-batched-tokens 4096  # Lower for latency-sensitive
    

Latency Optimization

  1. Use FlashAttention for faster prefill:

    --attention-backend FLASH_ATTN
    
  2. Enable prefix caching for repeated prompts:

    --enable-prefix-caching
    
  3. Consider speculative decoding for faster generation:

    --speculative-model [smaller-model]
    --num-speculative-tokens 4
    

Production Considerations

High Availability

  • Deploy multiple vLLM instances behind a load balancer
  • Use health checks for automatic failover
  • Implement request queuing for burst handling

Monitoring

  • Enable Prometheus metrics: /metrics endpoint
  • Monitor GPU memory utilization
  • Track request latency and throughput
  • Set up alerts for error rates

Security

  • Bind to localhost only (127.0.0.1) for internal services
  • Use reverse proxy (nginx, Traefik) with TLS for external access
  • Implement rate limiting
  • Consider authentication for multi-tenant deployments

References


Self-hosting LLM inference puts you in control of your AI infrastructure. With vLLM and proper hardware, you can achieve production-grade performance while maintaining data privacy and predictable costs. Start with the basic configuration, benchmark your workloads, and optimize based on your specific needs.

Next Steps