← Back to Store
infrastructure
Single DGX Spark Deployment Recipe
🎯
Your Outcome
Deploy LLMs on a single NVIDIA DGX Spark (GB10) with 128GB unified memory — from zero to serving in under 5 minutes.
Digital Download$39.00
Details
## Your Outcome
By the end of this recipe you'll have a production-ready vLLM inference server running on your DGX Spark, serving models from 14B to 70B parameters. No guesswork, no trial-and-error tuning — just copy, paste, deploy.
## The Problem
The DGX Spark (GB10) is a 128GB unified memory workstation, but the internet is full of conflicting advice on model sizing, memory utilization flags, and tensor parallelism. You can waste hours — or days — tweaking `gpu-memory-utilization` values, hitting OOMs, and debugging Docker configurations that worked for the author but not for you. There's no single authoritative source that tells you exactly what works on this specific hardware.
## What This Recipe Does
This is a distilled, battle-tested recipe for deploying vLLM on a single DGX Spark. It covers three model sizes (14B BF16, 32B BF16, 70B INT4) with exact configurations that fit within 128GB unified memory. You get the Docker commands, systemd service files, memory tuning tables, and a LiteLLM proxy setup — everything you need to go from zero to inference in under five minutes.
## What You'll Be Able To Do
- **Deploy any 14B–70B model on vLLM** — with exact flags that fit your DGX Spark's 128GB memory envelope
- **Tune memory utilization by model size** — know which `--gpu-memory-utilization` value works for each quantization
- **Run as a systemd service** — production-grade auto-restart, logging, and dependency management
- **Set up a LiteLLM proxy** — API key management, rate limiting, model routing for multi-model setups
- **Health-check your deployment** — curl endpoints, readiness probes, and monitoring basics
- **Mount model caches efficiently** — volume mounts that don't waste RAM on duplicated weights
## Who Will Benefit
- Engineers who just got a DGX Spark and want to serve models immediately
- Teams deploying private LLM inference on a single workstation
- Anyone tired of guessing vLLM flags for 128GB unified memory hardware
## What Success Looks Like
You have a running vLLM server at `localhost:8000` that responds to chat completions in under 30 seconds from cold start. Models are cached, memory is stable, and your systemd service survives reboots. You can switch between 14B, 32B, and 70B models by changing a single flag.
## Format & Delivery
- Self-contained README (inside zip archive)
- Copy-paste Docker commands and systemd configuration
- Memory tuning table for 14B, 32B, and 70B quantizations
- LiteLLM proxy setup guide
- Immediate digital download (.zip)