Orchestrating 25+ LLMs Through a Single Proxy | ai

Running one LLM through a single API key is straightforward. Running 25+ models across three providers — local GPUs, institutional HPC clusters, and commercial APIs — without coupling any agent to any specific model is an infrastructure problem. Here's how I solved it with LiteLLM, OpenCode, and Oh-My-OpenAgent.

The Core Problem

Every provider has its own API surface. SAIA uses OpenAI-compatible endpoints but with rate limits that reset at midnight. Z.ai serves GLM models through a Chinese CDN with different latency characteristics. Local GPU nodes running Qwen3.5 397B have no rate limits but limited availability when VPN drops.

Hard-coding any agent to any provider means you're coupling your workflow to someone else's uptime schedule. The embedding model SAIA offered was silently removed last month — every chunk in the knowledge graph failed to embed because the code expected a response that never came.

LiteLLM: One Endpoint to Rule Them All

LiteLLM Proxy sits at http://127.0.0.1:4000/v1 and presents an OpenAI-compatible interface. Every agent in the system sees one endpoint, one authentication method, one retry policy. Behind the proxy, requests fan out to SAIA, Z.ai, or local GPU nodes.

# litellm_config.yaml (simplified)
model_list:
  - model_name: glm-5-turbo
    litellm_params:
      model: openai/glm-5-turbo
      api_base: https://open.bigmodel.cn/api/paas/v4
      api_key: ${ZAI_API_KEY}

  - model_name: saia/gpt-oss-120b
    litellm_params:
      model: openai/gpt-oss-120b
      api_base: https://chat-ai.academiccloud.de/v1
      api_key: ${SAIA_API_KEY}

  - model_name: qwen3.5-397b
    litellm_params:
      model: ollama/qwen3.5:397b
      api_base: http://gpu-node:11434
```text

The agent config just references aliases:

```json
{
  "provider": {
    "litellm": {
      "models": {
        "glm-5-turbo": { "name": "GLM 5 Turbo (Z.ai)" },
        "saia/gpt-oss-120b": { "name": "SAIA GPT OSS 120B" },
        "qwen3.5-397b": { "name": "Qwen3.5 397B (local)" }
      },
      "options": {
        "baseURL": "http://127.0.0.1:4000/v1",
        "timeout": 120000,
        "maxRetries": 5
      }
    }
  }
}
```text

### Fallback Chains

When SAIA hits its 30 requests/minute rate limit, the proxy doesn't return a 429 to the agent. It routes to the next model in the fallback chain:

```yaml
router_settings:
  routing_strategy: "simple-shuffle"
  fallbacks:
    - "glm-5-turbo": ["saia/glm-4.7", "qwen3.5-397b"]
    - "saia/gpt-oss-120b": ["glm-5-turbo", "saia/qwen3.5-122b-a10b"]
```text

The agent never knows a fallback happened. It receives a completion either way.

### Supporting Infrastructure

LiteLLM runs behind Docker Compose with PostgreSQL for request logging and Redis for prompt caching:

```bash
# docker-compose.yml
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports: ["4000:4000"]
    volumes: ["./litellm_config.yaml:/app/config.yaml"]
    environment:
      - DATABASE_URL=postgresql://...
      - REDIS_URL=redis://redis:6379
  postgresql:
    image: pgvector/pgvector:pg16
  redis:
    image: redis:7-alpine
```text

Every API call is logged with token counts, latency, cost, and model routing decisions. When a model disappears (as SAIA's embedding endpoint did), you see it immediately in the logs rather than discovering it through cascading agent failures.

## OpenCode: The Agent Layer

[OpenCode](https://github.com/anomalyco/opencode) is a CLI-based coding agent — no editor plugin, no GUI. It reads files, runs shell commands, queries LSP servers, and modifies code through a text interface. The key architectural decision: OpenCode doesn't know which model it's talking to. It sends every request through the LiteLLM proxy and receives completions from whichever model the proxy selects.

### Plugin System

Three plugins extend OpenCode's capabilities:

| Plugin | Function |
| -------- | ---------- |
| `oh-my-openagent` | Multi-agent orchestration (10 agents, 10 categories) |
| `@tarquinen/opencode-dcp` | Dynamic Context Protocol — context-sensitive tool management |
| `opencode-morph-plugin` | Runtime configuration adjustments |

The [SAIA Plugin](https://codeberg.org/graphwiz-ai/opencode-saia-plugin) is a separate OpenCode plugin that automatically fetches all available models from the SAIA API and generates an `opencode.json` configuration. It marks reasoning models with `can_reason: true` and optionally routes through the LiteLLM proxy for caching and fallback support.

### MCP Integrations

OpenCode connects to external tools through the Model Context Protocol:

- **Inkscape** — SVG creation and manipulation for graphics workflows
- **CodeGraphContext** — Code-indexed knowledge graph (Neo4j), queried directly from the agent via Cypher

[CodeGraphContext](https://github.com/tobias-weiss-ai-xr/CodeGraphContext) (CGC) indexes code repositories into a Neo4j graph database. Currently 4 repositories are indexed: 131 files, 736 functions, 86 classes. The agent can query call chains, class hierarchies, and cross-repository relationships without leaving the conversation. CGC runs as an MCP server, connecting to Neo4j through an SSH tunnel.

## Oh-My-OpenAgent: Multi-Agent Orchestration

The real power emerges when you stop thinking about "one agent, one model" and start routing tasks to specialised sub-agents. Oh-My-OpenAgent provides this orchestration layer on top of OpenCode.

### Agent Hierarchy

10 agents, each with a model matched to its cognitive requirements:

| Agent | Role | Model | Why |
| ------- | ------ | ------- | ----- |
| Sisyphus | Orchestrator | GLM 5 Turbo | Strong reasoning for delegation decisions |
| Sisyphus-Junior | Worker | Devstral 2 123B | Code generation, cheaper than the orchestrator |
| Prometheus | Planner | GLM 4.7 | Structured work breakdowns |
| Oracle | Consultant | GPT OSS 120B | High-IQ read-only analysis |
| Librarian | Reference search | Qwen3.5 35B | External docs and API research |
| Explore | Code search | Gemma 4 26B | Fast local grep (runs locally, no VPN needed) |
| Metis | Pre-planning | GLM 5 Turbo | Ambiguity detection and edge case analysis |
| Momus | Reviewer | Qwen3.5 122B | Plan quality assurance |
| Atlas | Worker | GLM 5 Turbo | Implementation execution |
| Multimodal-Looker | Vision | GLM 4.6V | Image analysis, screenshots |

### Category-Based Routing

When Sisyphus delegates a task, it assigns a category. Each category maps to an optimal model:

| Category | Model | Use case |
| ----------- | ------- | ---------- |
| `local-quick` | Gemma 4 26B | Trivial local tasks (fast, free, no VPN) |
| `local-deep` | Qwen3.5 397B | Heavy local tasks (no internet needed) |
| `quick` | Gemma 4 26B | Single-file typo fixes |
| `deep` | GLM 4.7 | Complex multi-file implementation |
| `ultrabrain` | GLM 5 Turbo | Hardest reasoning tasks |
| `visual-engineering` | Qwen3.5 122B | UI/UX, styling, animation |
| `artistry` | GLM 5 Turbo | Creative problem-solving |
| `unspecified-high` | Devstral 2 123B | General tasks, high effort |
| `writing` | GLM 5.1 | Documentation, articles |

Up to 30 background agents run concurrently, each with its own model selection. A single user request can spawn parallel Explore agents searching different parts of the codebase while a Librarian agent researches external documentation — all simultaneously.

### A Real Workflow

Here's what happened when I asked the system to fix a scheduler bug in a Neo4j knowledge base:

```text
1. Sisyphus (GLM 5 Turbo)  → Intent Gate: bugfix detected
2. Explore (Gemma 4 26B)    → Searched codebase for scheduler code
3. Librarian (Qwen3.5 35B)  → Researched SAIA embedding API documentation
4. Oracle (GPT OSS 120B)    → Root cause analysis: embedding model silently removed
5. Sisyphus                 → Created fix plan and delegated implementation
6. Sisyphus-Junior (Devstral 2) → Implemented config changes
7. Sisyphus                 → Verified build, confirmed tests pass
```text

Six different models, roughly 15 API calls, all through one LiteLLM endpoint. The agent that investigated the codebase (Gemma 4, running locally) cost nothing. The agent that analysed the root cause (GPT OSS 120B) ran on SAIA's cluster. The user typed one sentence.

## ACP: Cross-Editor Orchestration

The [Agent Client Protocol](https://agentclientprotocol.com) (ACP), standardised by Zed Industries, solves a different problem: coupling agents to editors. ACP defines a JSON-RPC protocol between clients and agents — the same pattern LSP established for language servers.

With 30+ compatible agents and 40+ clients, ACP enables the [ACP Orchestrator](https://github.com/ACPRocks/ACP) — a meta-orchestrator that drives multiple agents across multiple repositories:

```bash
# Batch prompt across all Python projects
acp run "check for outdated dependencies" --tags=python

# Autonomous improvement loop with session rotation
acp loop next-graphwiz-ai --max-iter 100 --rotate 25
```text

Session rotation every 25 iterations prevents context saturation. The orchestrator supports five agents: OpenCode, Claude Code, Gemini CLI, Codex CLI, and Goose.

## Infrastructure as Code

All of this is deployed through Ansible playbooks — 9 hosts, idempotent, reproducible:

| Playbook | Function |
| ---------- | ---------- |
| `litellm-proxy` | Docker Compose: LiteLLM + PostgreSQL + Redis |
| `opencode-deploy` | Build and install OpenCode from source (Go) |
| `opencode-sync` | Sync agent configs across all hosts |
| `knowledge-graph` | Neo4j + CodeGraphContext deployment |
| `vpn-hub / vpn-peers` | WireGuard mesh networking |
| `traefik` | Reverse proxy + TLS (Let's Encrypt) |

```bash
ansible-playbook site.yml --limit ai_cluster
```text

## What Actually Matters

The architecture has three decoupling layers:

1. **LiteLLM decouples models from agents.** Swap any model, add any provider, remove any endpoint — agents don't change.
2. **Oh-My-OpenAgent decouples orchestration from implementation.** The orchestrator decides *what* to do and *who* should do it. Sub-agents decide *how*.
3. **ACP decouples agents from editors.** Any client can drive any agent through a standard protocol.

Each layer is independently replaceable. That's not architecture for its own sake — it's the difference between a system that breaks when one API changes and one that reroutes in milliseconds.

The full presentation is available at [tobias-weiss.org/presentations](https://tobias-weiss.org/presentations).

The Core Problem

LiteLLM: One Endpoint to Rule Them All

# litellm_config.yaml (simplified) model_list: - model_name: glm-5-turbo litellm_params: model: openai/glm-5-turbo api_base: https://open.bigmodel.cn/api/paas/v4 api_key: ${ZAI_API_KEY} - model_name: saia/gpt-oss-120b litellm_params: model: openai/gpt-oss-120b api_base: https://chat-ai.academiccloud.de/v1 api_key: ${SAIA_API_KEY} - model_name: qwen3.5-397b litellm_params: model: ollama/qwen3.5:397b api_base: http://gpu-node:11434 ```text The agent config just references aliases: ```json { "provider": { "litellm": { "models": { "glm-5-turbo": { "name": "GLM 5 Turbo (Z.ai)" }, "saia/gpt-oss-120b": { "name": "SAIA GPT OSS 120B" }, "qwen3.5-397b": { "name": "Qwen3.5 397B (local)" } }, "options": { "baseURL": "http://127.0.0.1:4000/v1", "timeout": 120000, "maxRetries": 5 } } } } ```text ### Fallback Chains When SAIA hits its 30 requests/minute rate limit, the proxy doesn't return a 429 to the agent. It routes to the next model in the fallback chain: ```yaml router_settings: routing_strategy: "simple-shuffle" fallbacks: - "glm-5-turbo": ["saia/glm-4.7", "qwen3.5-397b"] - "saia/gpt-oss-120b": ["glm-5-turbo", "saia/qwen3.5-122b-a10b"] ```text The agent never knows a fallback happened. It receives a completion either way. ### Supporting Infrastructure LiteLLM runs behind Docker Compose with PostgreSQL for request logging and Redis for prompt caching: ```bash # docker-compose.yml services: litellm: image: ghcr.io/berriai/litellm:main-latest ports: ["4000:4000"] volumes: ["./litellm_config.yaml:/app/config.yaml"] environment: - DATABASE_URL=postgresql://... - REDIS_URL=redis://redis:6379 postgresql: image: pgvector/pgvector:pg16 redis: image: redis:7-alpine ```text Every API call is logged with token counts, latency, cost, and model routing decisions. When a model disappears (as SAIA's embedding endpoint did), you see it immediately in the logs rather than discovering it through cascading agent failures. ## OpenCode: The Agent Layer [OpenCode](https://github.com/anomalyco/opencode) is a CLI-based coding agent — no editor plugin, no GUI. It reads files, runs shell commands, queries LSP servers, and modifies code through a text interface. The key architectural decision: OpenCode doesn't know which model it's talking to. It sends every request through the LiteLLM proxy and receives completions from whichever model the proxy selects. ### Plugin System Three plugins extend OpenCode's capabilities: | Plugin | Function | | -------- | ---------- | | `oh-my-openagent` | Multi-agent orchestration (10 agents, 10 categories) | | `@tarquinen/opencode-dcp` | Dynamic Context Protocol — context-sensitive tool management | | `opencode-morph-plugin` | Runtime configuration adjustments | The [SAIA Plugin](https://codeberg.org/graphwiz-ai/opencode-saia-plugin) is a separate OpenCode plugin that automatically fetches all available models from the SAIA API and generates an `opencode.json` configuration. It marks reasoning models with `can_reason: true` and optionally routes through the LiteLLM proxy for caching and fallback support. ### MCP Integrations OpenCode connects to external tools through the Model Context Protocol: - **Inkscape** — SVG creation and manipulation for graphics workflows - **CodeGraphContext** — Code-indexed knowledge graph (Neo4j), queried directly from the agent via Cypher [CodeGraphContext](https://github.com/tobias-weiss-ai-xr/CodeGraphContext) (CGC) indexes code repositories into a Neo4j graph database. Currently 4 repositories are indexed: 131 files, 736 functions, 86 classes. The agent can query call chains, class hierarchies, and cross-repository relationships without leaving the conversation. CGC runs as an MCP server, connecting to Neo4j through an SSH tunnel. ## Oh-My-OpenAgent: Multi-Agent Orchestration The real power emerges when you stop thinking about "one agent, one model" and start routing tasks to specialised sub-agents. Oh-My-OpenAgent provides this orchestration layer on top of OpenCode. ### Agent Hierarchy 10 agents, each with a model matched to its cognitive requirements: | Agent | Role | Model | Why | | ------- | ------ | ------- | ----- | | Sisyphus | Orchestrator | GLM 5 Turbo | Strong reasoning for delegation decisions | | Sisyphus-Junior | Worker | Devstral 2 123B | Code generation, cheaper than the orchestrator | | Prometheus | Planner | GLM 4.7 | Structured work breakdowns | | Oracle | Consultant | GPT OSS 120B | High-IQ read-only analysis | | Librarian | Reference search | Qwen3.5 35B | External docs and API research | | Explore | Code search | Gemma 4 26B | Fast local grep (runs locally, no VPN needed) | | Metis | Pre-planning | GLM 5 Turbo | Ambiguity detection and edge case analysis | | Momus | Reviewer | Qwen3.5 122B | Plan quality assurance | | Atlas | Worker | GLM 5 Turbo | Implementation execution | | Multimodal-Looker | Vision | GLM 4.6V | Image analysis, screenshots | ### Category-Based Routing When Sisyphus delegates a task, it assigns a category. Each category maps to an optimal model: | Category | Model | Use case | | ----------- | ------- | ---------- | | `local-quick` | Gemma 4 26B | Trivial local tasks (fast, free, no VPN) | | `local-deep` | Qwen3.5 397B | Heavy local tasks (no internet needed) | | `quick` | Gemma 4 26B | Single-file typo fixes | | `deep` | GLM 4.7 | Complex multi-file implementation | | `ultrabrain` | GLM 5 Turbo | Hardest reasoning tasks | | `visual-engineering` | Qwen3.5 122B | UI/UX, styling, animation | | `artistry` | GLM 5 Turbo | Creative problem-solving | | `unspecified-high` | Devstral 2 123B | General tasks, high effort | | `writing` | GLM 5.1 | Documentation, articles | Up to 30 background agents run concurrently, each with its own model selection. A single user request can spawn parallel Explore agents searching different parts of the codebase while a Librarian agent researches external documentation — all simultaneously. ### A Real Workflow Here's what happened when I asked the system to fix a scheduler bug in a Neo4j knowledge base: ```text 1. Sisyphus (GLM 5 Turbo) → Intent Gate: bugfix detected 2. Explore (Gemma 4 26B) → Searched codebase for scheduler code 3. Librarian (Qwen3.5 35B) → Researched SAIA embedding API documentation 4. Oracle (GPT OSS 120B) → Root cause analysis: embedding model silently removed 5. Sisyphus → Created fix plan and delegated implementation 6. Sisyphus-Junior (Devstral 2) → Implemented config changes 7. Sisyphus → Verified build, confirmed tests pass ```text Six different models, roughly 15 API calls, all through one LiteLLM endpoint. The agent that investigated the codebase (Gemma 4, running locally) cost nothing. The agent that analysed the root cause (GPT OSS 120B) ran on SAIA's cluster. The user typed one sentence. ## ACP: Cross-Editor Orchestration The [Agent Client Protocol](https://agentclientprotocol.com) (ACP), standardised by Zed Industries, solves a different problem: coupling agents to editors. ACP defines a JSON-RPC protocol between clients and agents — the same pattern LSP established for language servers. With 30+ compatible agents and 40+ clients, ACP enables the [ACP Orchestrator](https://github.com/ACPRocks/ACP) — a meta-orchestrator that drives multiple agents across multiple repositories: ```bash # Batch prompt across all Python projects acp run "check for outdated dependencies" --tags=python # Autonomous improvement loop with session rotation acp loop next-graphwiz-ai --max-iter 100 --rotate 25 ```text Session rotation every 25 iterations prevents context saturation. The orchestrator supports five agents: OpenCode, Claude Code, Gemini CLI, Codex CLI, and Goose. ## Infrastructure as Code All of this is deployed through Ansible playbooks — 9 hosts, idempotent, reproducible: | Playbook | Function | | ---------- | ---------- | | `litellm-proxy` | Docker Compose: LiteLLM + PostgreSQL + Redis | | `opencode-deploy` | Build and install OpenCode from source (Go) | | `opencode-sync` | Sync agent configs across all hosts | | `knowledge-graph` | Neo4j + CodeGraphContext deployment | | `vpn-hub / vpn-peers` | WireGuard mesh networking | | `traefik` | Reverse proxy + TLS (Let's Encrypt) | ```bash ansible-playbook site.yml --limit ai_cluster ```text ## What Actually Matters The architecture has three decoupling layers: 1. **LiteLLM decouples models from agents.** Swap any model, add any provider, remove any endpoint — agents don't change. 2. **Oh-My-OpenAgent decouples orchestration from implementation.** The orchestrator decides *what* to do and *who* should do it. Sub-agents decide *how*. 3. **ACP decouples agents from editors.** Any client can drive any agent through a standard protocol. Each layer is independently replaceable. That's not architecture for its own sake — it's the difference between a system that breaks when one API changes and one that reroutes in milliseconds. The full presentation is available at [tobias-weiss.org/presentations](https://tobias-weiss.org/presentations).

The Core Problem

LiteLLM: One Endpoint to Rule Them All

Never miss a deep-dive

The Core Problem

LiteLLM: One Endpoint to Rule Them All

Never miss a deep-dive