Orchestrating 25+ LLMs Through a Single Proxy
Running one LLM through a single API key is straightforward. Running 25+ models across three providers — local GPUs, institutional HPC clusters, and commercial APIs — without coupling any agent to any specific model is an infrastructure problem. Here's how I solved it with LiteLLM, OpenCode, and Oh-My-OpenAgent.
The Core Problem
Every provider has its own API surface. SAIA uses OpenAI-compatible endpoints but with rate limits that reset at midnight. Z.ai serves GLM models through a Chinese CDN with different latency characteristics. Local GPU nodes running Qwen3.5 397B have no rate limits but limited availability when VPN drops.
Hard-coding any agent to any provider means you're coupling your workflow to someone else's uptime schedule. The embedding model SAIA offered was silently removed last month — every chunk in the knowledge graph failed to embed because the code expected a response that never came.
LiteLLM: One Endpoint to Rule Them All
LiteLLM Proxy sits at http://127.0.0.1:4000/v1 and presents an OpenAI-compatible interface. Every agent in the system sees one endpoint, one authentication method, one retry policy. Behind the proxy, requests fan out to SAIA, Z.ai, or local GPU nodes.
# litellm_config.yaml (simplified)
model_list:
- model_name: glm-5-turbo
litellm_params:
model: openai/glm-5-turbo
api_base: https://open.bigmodel.cn/api/paas/v4
api_key: ${ZAI_API_KEY}
- model_name: saia/gpt-oss-120b
litellm_params:
model: openai/gpt-oss-120b
api_base: https://chat-ai.academiccloud.de/v1
api_key: ${SAIA_API_KEY}
- model_name: qwen3.5-397b
litellm_params:
model: ollama/qwen3.5:397b
api_base: http://gpu-node:11434
The agent config just references aliases:
{
"provider": {
"litellm": {
"models": {
"glm-5-turbo": { "name": "GLM 5 Turbo (Z.ai)" },
"saia/gpt-oss-120b": { "name": "SAIA GPT OSS 120B" },
"qwen3.5-397b": { "name": "Qwen3.5 397B (local)" }
},
"options": {
"baseURL": "http://127.0.0.1:4000/v1",
"timeout": 120000,
"maxRetries": 5
}
}
}
}
Fallback Chains
When SAIA hits its 30 requests/minute rate limit, the proxy doesn't return a 429 to the agent. It routes to the next model in the fallback chain:
router_settings:
routing_strategy: "simple-shuffle"
fallbacks:
- "glm-5-turbo": ["saia/glm-4.7", "qwen3.5-397b"]
- "saia/gpt-oss-120b": ["glm-5-turbo", "saia/qwen3.5-122b-a10b"]
The agent never knows a fallback happened. It receives a completion either way.
Supporting Infrastructure
LiteLLM runs behind Docker Compose with PostgreSQL for request logging and Redis for prompt caching:
# docker-compose.yml
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports: ["4000:4000"]
volumes: ["./litellm_config.yaml:/app/config.yaml"]
environment:
- DATABASE_URL=postgresql://...
- REDIS_URL=redis://redis:6379
postgresql:
image: pgvector/pgvector:pg16
redis:
image: redis:7-alpine
Every API call is logged with token counts, latency, cost, and model routing decisions. When a model disappears (as SAIA's embedding endpoint did), you see it immediately in the logs rather than discovering it through cascading agent failures.
OpenCode: The Agent Layer
OpenCode is a CLI-based coding agent — no editor plugin, no GUI. It reads files, runs shell commands, queries LSP servers, and modifies code through a text interface. The key architectural decision: OpenCode doesn't know which model it's talking to. It sends every request through the LiteLLM proxy and receives completions from whichever model the proxy selects.
Plugin System
Three plugins extend OpenCode's capabilities:
| Plugin | Function |
|---|---|
oh-my-openagent |
Multi-agent orchestration (10 agents, 10 categories) |
@tarquinen/opencode-dcp |
Dynamic Context Protocol — context-sensitive tool management |
opencode-morph-plugin |
Runtime configuration adjustments |
The SAIA Plugin is a separate OpenCode plugin that automatically fetches all available models from the SAIA API and generates an opencode.json configuration. It marks reasoning models with can_reason: true and optionally routes through the LiteLLM proxy for caching and fallback support.
MCP Integrations
OpenCode connects to external tools through the Model Context Protocol:
- Inkscape — SVG creation and manipulation for graphics workflows
- CodeGraphContext — Code-indexed knowledge graph (Neo4j), queried directly from the agent via Cypher
CodeGraphContext (CGC) indexes code repositories into a Neo4j graph database. Currently 4 repositories are indexed: 131 files, 736 functions, 86 classes. The agent can query call chains, class hierarchies, and cross-repository relationships without leaving the conversation. CGC runs as an MCP server, connecting to Neo4j through an SSH tunnel.
Oh-My-OpenAgent: Multi-Agent Orchestration
The real power emerges when you stop thinking about "one agent, one model" and start routing tasks to specialised sub-agents. Oh-My-OpenAgent provides this orchestration layer on top of OpenCode.
Agent Hierarchy
10 agents, each with a model matched to its cognitive requirements:
| Agent | Role | Model | Why |
|---|---|---|---|
| Sisyphus | Orchestrator | GLM 5 Turbo | Strong reasoning for delegation decisions |
| Sisyphus-Junior | Worker | Devstral 2 123B | Code generation, cheaper than the orchestrator |
| Prometheus | Planner | GLM 4.7 | Structured work breakdowns |
| Oracle | Consultant | GPT OSS 120B | High-IQ read-only analysis |
| Librarian | Reference search | Qwen3.5 35B | External docs and API research |
| Explore | Code search | Gemma 4 26B | Fast local grep (runs locally, no VPN needed) |
| Metis | Pre-planning | GLM 5 Turbo | Ambiguity detection and edge case analysis |
| Momus | Reviewer | Qwen3.5 122B | Plan quality assurance |
| Atlas | Worker | GLM 5 Turbo | Implementation execution |
| Multimodal-Looker | Vision | GLM 4.6V | Image analysis, screenshots |
Category-Based Routing
When Sisyphus delegates a task, it assigns a category. Each category maps to an optimal model:
| Category | Model | Use case |
|---|---|---|
local-quick |
Gemma 4 26B | Trivial local tasks (fast, free, no VPN) |
local-deep |
Qwen3.5 397B | Heavy local tasks (no internet needed) |
quick |
Gemma 4 26B | Single-file typo fixes |
deep |
GLM 4.7 | Complex multi-file implementation |
ultrabrain |
GLM 5 Turbo | Hardest reasoning tasks |
visual-engineering |
Qwen3.5 122B | UI/UX, styling, animation |
artistry |
GLM 5 Turbo | Creative problem-solving |
unspecified-high |
Devstral 2 123B | General tasks, high effort |
writing |
GLM 5.1 | Documentation, articles |
Up to 30 background agents run concurrently, each with its own model selection. A single user request can spawn parallel Explore agents searching different parts of the codebase while a Librarian agent researches external documentation — all simultaneously.
A Real Workflow
Here's what happened when I asked the system to fix a scheduler bug in a Neo4j knowledge base:
1. Sisyphus (GLM 5 Turbo) → Intent Gate: bugfix detected
2. Explore (Gemma 4 26B) → Searched codebase for scheduler code
3. Librarian (Qwen3.5 35B) → Researched SAIA embedding API documentation
4. Oracle (GPT OSS 120B) → Root cause analysis: embedding model silently removed
5. Sisyphus → Created fix plan and delegated implementation
6. Sisyphus-Junior (Devstral 2) → Implemented config changes
7. Sisyphus → Verified build, confirmed tests pass
Six different models, roughly 15 API calls, all through one LiteLLM endpoint. The agent that investigated the codebase (Gemma 4, running locally) cost nothing. The agent that analysed the root cause (GPT OSS 120B) ran on SAIA's cluster. The user typed one sentence.
ACP: Cross-Editor Orchestration
The Agent Client Protocol (ACP), standardised by Zed Industries, solves a different problem: coupling agents to editors. ACP defines a JSON-RPC protocol between clients and agents — the same pattern LSP established for language servers.
With 30+ compatible agents and 40+ clients, ACP enables the ACP Orchestrator — a meta-orchestrator that drives multiple agents across multiple repositories:
# Batch prompt across all Python projects
acp run "check for outdated dependencies" --tags=python
# Autonomous improvement loop with session rotation
acp loop next-graphwiz-ai --max-iter 100 --rotate 25
Session rotation every 25 iterations prevents context saturation. The orchestrator supports five agents: OpenCode, Claude Code, Gemini CLI, Codex CLI, and Goose.
Infrastructure as Code
All of this is deployed through Ansible playbooks — 9 hosts, idempotent, reproducible:
| Playbook | Function |
|---|---|
litellm-proxy |
Docker Compose: LiteLLM + PostgreSQL + Redis |
opencode-deploy |
Build and install OpenCode from source (Go) |
opencode-sync |
Sync agent configs across all hosts |
knowledge-graph |
Neo4j + CodeGraphContext deployment |
vpn-hub / vpn-peers |
WireGuard mesh networking |
traefik |
Reverse proxy + TLS (Let's Encrypt) |
ansible-playbook site.yml --limit ai_cluster
What Actually Matters
The architecture has three decoupling layers:
- LiteLLM decouples models from agents. Swap any model, add any provider, remove any endpoint — agents don't change.
- Oh-My-OpenAgent decouples orchestration from implementation. The orchestrator decides what to do and who should do it. Sub-agents decide how.
- ACP decouples agents from editors. Any client can drive any agent through a standard protocol.
Each layer is independently replaceable. That's not architecture for its own sake — it's the difference between a system that breaks when one API changes and one that reroutes in milliseconds.
The full presentation is available at tobias-weiss.org/presentations.