Skip to main content
graphwiz.aigraphwiz.ai
← Back to AI

Orchestrating 25+ LLMs Through a Single Proxy

AI
litellmmulti-agentopencodellm-orchestrationmcp

Running one LLM through a single API key is straightforward. Running 25+ models across three providers — local GPUs, institutional HPC clusters, and commercial APIs — without coupling any agent to any specific model is an infrastructure problem. Here's how I solved it with LiteLLM, OpenCode, and Oh-My-OpenAgent.

The Core Problem

Every provider has its own API surface. SAIA uses OpenAI-compatible endpoints but with rate limits that reset at midnight. Z.ai serves GLM models through a Chinese CDN with different latency characteristics. Local GPU nodes running Qwen3.5 397B have no rate limits but limited availability when VPN drops.

Hard-coding any agent to any provider means you're coupling your workflow to someone else's uptime schedule. The embedding model SAIA offered was silently removed last month — every chunk in the knowledge graph failed to embed because the code expected a response that never came.

LiteLLM: One Endpoint to Rule Them All

LiteLLM Proxy sits at http://127.0.0.1:4000/v1 and presents an OpenAI-compatible interface. Every agent in the system sees one endpoint, one authentication method, one retry policy. Behind the proxy, requests fan out to SAIA, Z.ai, or local GPU nodes.

# litellm_config.yaml (simplified)
model_list:
  - model_name: glm-5-turbo
    litellm_params:
      model: openai/glm-5-turbo
      api_base: https://open.bigmodel.cn/api/paas/v4
      api_key: ${ZAI_API_KEY}

  - model_name: saia/gpt-oss-120b
    litellm_params:
      model: openai/gpt-oss-120b
      api_base: https://chat-ai.academiccloud.de/v1
      api_key: ${SAIA_API_KEY}

  - model_name: qwen3.5-397b
    litellm_params:
      model: ollama/qwen3.5:397b
      api_base: http://gpu-node:11434

The agent config just references aliases:

{
  "provider": {
    "litellm": {
      "models": {
        "glm-5-turbo": { "name": "GLM 5 Turbo (Z.ai)" },
        "saia/gpt-oss-120b": { "name": "SAIA GPT OSS 120B" },
        "qwen3.5-397b": { "name": "Qwen3.5 397B (local)" }
      },
      "options": {
        "baseURL": "http://127.0.0.1:4000/v1",
        "timeout": 120000,
        "maxRetries": 5
      }
    }
  }
}

Fallback Chains

When SAIA hits its 30 requests/minute rate limit, the proxy doesn't return a 429 to the agent. It routes to the next model in the fallback chain:

router_settings:
  routing_strategy: "simple-shuffle"
  fallbacks:
    - "glm-5-turbo": ["saia/glm-4.7", "qwen3.5-397b"]
    - "saia/gpt-oss-120b": ["glm-5-turbo", "saia/qwen3.5-122b-a10b"]

The agent never knows a fallback happened. It receives a completion either way.

Supporting Infrastructure

LiteLLM runs behind Docker Compose with PostgreSQL for request logging and Redis for prompt caching:

# docker-compose.yml
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports: ["4000:4000"]
    volumes: ["./litellm_config.yaml:/app/config.yaml"]
    environment:
      - DATABASE_URL=postgresql://...
      - REDIS_URL=redis://redis:6379
  postgresql:
    image: pgvector/pgvector:pg16
  redis:
    image: redis:7-alpine

Every API call is logged with token counts, latency, cost, and model routing decisions. When a model disappears (as SAIA's embedding endpoint did), you see it immediately in the logs rather than discovering it through cascading agent failures.

OpenCode: The Agent Layer

OpenCode is a CLI-based coding agent — no editor plugin, no GUI. It reads files, runs shell commands, queries LSP servers, and modifies code through a text interface. The key architectural decision: OpenCode doesn't know which model it's talking to. It sends every request through the LiteLLM proxy and receives completions from whichever model the proxy selects.

Plugin System

Three plugins extend OpenCode's capabilities:

Plugin Function
oh-my-openagent Multi-agent orchestration (10 agents, 10 categories)
@tarquinen/opencode-dcp Dynamic Context Protocol — context-sensitive tool management
opencode-morph-plugin Runtime configuration adjustments

The SAIA Plugin is a separate OpenCode plugin that automatically fetches all available models from the SAIA API and generates an opencode.json configuration. It marks reasoning models with can_reason: true and optionally routes through the LiteLLM proxy for caching and fallback support.

MCP Integrations

OpenCode connects to external tools through the Model Context Protocol:

  • Inkscape — SVG creation and manipulation for graphics workflows
  • CodeGraphContext — Code-indexed knowledge graph (Neo4j), queried directly from the agent via Cypher

CodeGraphContext (CGC) indexes code repositories into a Neo4j graph database. Currently 4 repositories are indexed: 131 files, 736 functions, 86 classes. The agent can query call chains, class hierarchies, and cross-repository relationships without leaving the conversation. CGC runs as an MCP server, connecting to Neo4j through an SSH tunnel.

Oh-My-OpenAgent: Multi-Agent Orchestration

The real power emerges when you stop thinking about "one agent, one model" and start routing tasks to specialised sub-agents. Oh-My-OpenAgent provides this orchestration layer on top of OpenCode.

Agent Hierarchy

10 agents, each with a model matched to its cognitive requirements:

Agent Role Model Why
Sisyphus Orchestrator GLM 5 Turbo Strong reasoning for delegation decisions
Sisyphus-Junior Worker Devstral 2 123B Code generation, cheaper than the orchestrator
Prometheus Planner GLM 4.7 Structured work breakdowns
Oracle Consultant GPT OSS 120B High-IQ read-only analysis
Librarian Reference search Qwen3.5 35B External docs and API research
Explore Code search Gemma 4 26B Fast local grep (runs locally, no VPN needed)
Metis Pre-planning GLM 5 Turbo Ambiguity detection and edge case analysis
Momus Reviewer Qwen3.5 122B Plan quality assurance
Atlas Worker GLM 5 Turbo Implementation execution
Multimodal-Looker Vision GLM 4.6V Image analysis, screenshots

Category-Based Routing

When Sisyphus delegates a task, it assigns a category. Each category maps to an optimal model:

Category Model Use case
local-quick Gemma 4 26B Trivial local tasks (fast, free, no VPN)
local-deep Qwen3.5 397B Heavy local tasks (no internet needed)
quick Gemma 4 26B Single-file typo fixes
deep GLM 4.7 Complex multi-file implementation
ultrabrain GLM 5 Turbo Hardest reasoning tasks
visual-engineering Qwen3.5 122B UI/UX, styling, animation
artistry GLM 5 Turbo Creative problem-solving
unspecified-high Devstral 2 123B General tasks, high effort
writing GLM 5.1 Documentation, articles

Up to 30 background agents run concurrently, each with its own model selection. A single user request can spawn parallel Explore agents searching different parts of the codebase while a Librarian agent researches external documentation — all simultaneously.

A Real Workflow

Here's what happened when I asked the system to fix a scheduler bug in a Neo4j knowledge base:

1. Sisyphus (GLM 5 Turbo)  → Intent Gate: bugfix detected
2. Explore (Gemma 4 26B)    → Searched codebase for scheduler code
3. Librarian (Qwen3.5 35B)  → Researched SAIA embedding API documentation
4. Oracle (GPT OSS 120B)    → Root cause analysis: embedding model silently removed
5. Sisyphus                 → Created fix plan and delegated implementation
6. Sisyphus-Junior (Devstral 2) → Implemented config changes
7. Sisyphus                 → Verified build, confirmed tests pass

Six different models, roughly 15 API calls, all through one LiteLLM endpoint. The agent that investigated the codebase (Gemma 4, running locally) cost nothing. The agent that analysed the root cause (GPT OSS 120B) ran on SAIA's cluster. The user typed one sentence.

ACP: Cross-Editor Orchestration

The Agent Client Protocol (ACP), standardised by Zed Industries, solves a different problem: coupling agents to editors. ACP defines a JSON-RPC protocol between clients and agents — the same pattern LSP established for language servers.

With 30+ compatible agents and 40+ clients, ACP enables the ACP Orchestrator — a meta-orchestrator that drives multiple agents across multiple repositories:

# Batch prompt across all Python projects
acp run "check for outdated dependencies" --tags=python

# Autonomous improvement loop with session rotation
acp loop next-graphwiz-ai --max-iter 100 --rotate 25

Session rotation every 25 iterations prevents context saturation. The orchestrator supports five agents: OpenCode, Claude Code, Gemini CLI, Codex CLI, and Goose.

Infrastructure as Code

All of this is deployed through Ansible playbooks — 9 hosts, idempotent, reproducible:

Playbook Function
litellm-proxy Docker Compose: LiteLLM + PostgreSQL + Redis
opencode-deploy Build and install OpenCode from source (Go)
opencode-sync Sync agent configs across all hosts
knowledge-graph Neo4j + CodeGraphContext deployment
vpn-hub / vpn-peers WireGuard mesh networking
traefik Reverse proxy + TLS (Let's Encrypt)
ansible-playbook site.yml --limit ai_cluster

What Actually Matters

The architecture has three decoupling layers:

  1. LiteLLM decouples models from agents. Swap any model, add any provider, remove any endpoint — agents don't change.
  2. Oh-My-OpenAgent decouples orchestration from implementation. The orchestrator decides what to do and who should do it. Sub-agents decide how.
  3. ACP decouples agents from editors. Any client can drive any agent through a standard protocol.

Each layer is independently replaceable. That's not architecture for its own sake — it's the difference between a system that breaks when one API changes and one that reroutes in milliseconds.

The full presentation is available at tobias-weiss.org/presentations.