Single DGX Spark Deployment Recipe

Your Outcome

By the end of this recipe you'll have a production-ready vLLM inference server running on your DGX Spark, serving models from 14B to 70B parameters. No guesswork, no trial-and-error tuning — just copy, paste, deploy.

The Problem

The DGX Spark (GB10) is a 128GB unified memory workstation, but the internet is full of conflicting advice on model sizing, memory utilization flags, and tensor parallelism. You can waste hours — or days — tweaking gpu-memory-utilization values, hitting OOMs, and debugging Docker configurations that worked for the author but not for you. There's no single authoritative source that tells you exactly what works on this specific hardware.

What This Recipe Does

This is a distilled, battle-tested recipe for deploying vLLM on a single DGX Spark. It covers three model sizes (14B BF16, 32B BF16, 70B INT4) with exact configurations that fit within 128GB unified memory. You get the Docker commands, systemd service files, memory tuning tables, and a LiteLLM proxy setup — everything you need to go from zero to inference in under five minutes.

What You'll Be Able To Do

Deploy any 14B–70B model on vLLM — with exact flags that fit your DGX Spark's 128GB memory envelope
Tune memory utilization by model size — know which --gpu-memory-utilization value works for each quantization
Run as a systemd service — production-grade auto-restart, logging, and dependency management
Set up a LiteLLM proxy — API key management, rate limiting, model routing for multi-model setups
Health-check your deployment — curl endpoints, readiness probes, and monitoring basics
Mount model caches efficiently — volume mounts that don't waste RAM on duplicated weights

Who Will Benefit

Engineers who just got a DGX Spark and want to serve models immediately
Teams deploying private LLM inference on a single workstation
Anyone tired of guessing vLLM flags for 128GB unified memory hardware

What Success Looks Like

You have a running vLLM server at localhost:8000 that responds to chat completions in under 30 seconds from cold start. Models are cached, memory is stable, and your systemd service survives reboots. You can switch between 14B, 32B, and 70B models by changing a single flag.

Format & Delivery

Self-contained README (inside zip archive)
Copy-paste Docker commands and systemd configuration
Memory tuning table for 14B, 32B, and 70B quantizations
LiteLLM proxy setup guide
Immediate digital download (.zip)

Your Outcome

The Problem

What This Recipe Does

What You'll Be Able To Do

Deploy any 14B–70B model on vLLM — with exact flags that fit your DGX Spark's 128GB memory envelope
Tune memory utilization by model size — know which --gpu-memory-utilization value works for each quantization
Run as a systemd service — production-grade auto-restart, logging, and dependency management
Set up a LiteLLM proxy — API key management, rate limiting, model routing for multi-model setups
Health-check your deployment — curl endpoints, readiness probes, and monitoring basics
Mount model caches efficiently — volume mounts that don't waste RAM on duplicated weights

Who Will Benefit

Engineers who just got a DGX Spark and want to serve models immediately
Teams deploying private LLM inference on a single workstation
Anyone tired of guessing vLLM flags for 128GB unified memory hardware

What Success Looks Like

Format & Delivery

Self-contained README (inside zip archive)
Copy-paste Docker commands and systemd configuration
Memory tuning table for 14B, 32B, and 70B quantizations
LiteLLM proxy setup guide
Immediate digital download (.zip)

Single DGX Spark Deployment Recipe

Details

Your Outcome

The Problem

What This Recipe Does

What You'll Be Able To Do

Who Will Benefit

What Success Looks Like

Format & Delivery

Related Products

Database Foundation Stack

MinIO + S3 Backup Stack

n8n Production Stack

SSL Reverse Proxy Stack

Single DGX Spark Deployment Recipe

Details

Your Outcome

The Problem

What This Recipe Does

What You'll Be Able To Do

Who Will Benefit

What Success Looks Like

Format & Delivery

Related Products

Database Foundation Stack

MinIO + S3 Backup Stack

n8n Production Stack

SSL Reverse Proxy Stack