Multi-Model Mode

Use multiple LLMs in a single request. Reasoning model thinks, tool model executes, synthesis model responds.

Why multiple models?

Different models are good at different things. Opus is great at reasoning but expensive. Haiku is fast but shallow. GPT-4o handles tools well. o1 does extended thinking.

Multi-model mode picks the right model for each phase of a request instead of making one model do everything.

The four roles

Role

What it does

Default model

Reasoning

Analyzes the problem, plans approach, extended thinking

Claude Opus, o1

Tool Execution

Calls MCP tools, executes functions

Claude Sonnet, GPT-4o

Synthesis

Combines results into final response

Claude Sonnet, GPT-4o

Fallback

Error recovery when something fails

Gemini Flash, Haiku

How it decides

Not every request needs multi-model. The system analyzes:

Query complexity - Simple questions go to a single model
Tool requirements - If tools are likely needed, tool model gets involved
Slider position - Higher slider = more likely to use multi-model
Trigger patterns - Certain keywords ("analyze", "investigate", "debug") trigger it

Handoffs

Models pass context to each other through handoffs. The reasoning model's analysis becomes context for the tool model. Tool results become context for synthesis.

Max handoffs are configurable (default: 5) to prevent infinite loops.

Configuration

Environment variables

# Enable multi-model
ENABLE_MULTI_MODEL=true

# Per-tier model assignments
MULTI_MODEL_ECONOMY_REASONING=claude-3-haiku
MULTI_MODEL_ECONOMY_TOOL=claude-3-haiku
MULTI_MODEL_BALANCED_REASONING=claude-3-5-sonnet
MULTI_MODEL_BALANCED_TOOL=gpt-4o
MULTI_MODEL_PREMIUM_REASONING=claude-opus-4
MULTI_MODEL_PREMIUM_TOOL=claude-3-5-sonnet

# Extended thinking budget (tokens)
MULTI_MODEL_PREMIUM_THINKING_BUDGET=16000

Runtime toggle

Admins can enable/disable via the Admin Portal under Pipeline Settings.

Via SDK

const agent = aw.createAgent({
  model: 'auto',
  multiModel: {
    enabled: true,
    reasoningModel: 'claude-opus-4',
    toolModel: 'gpt-4o',
    synthesisModel: 'claude-3-5-sonnet'
  }
});

Cost tracking

Each role tracks its own costs. The total is reported in the response metadata:

{
  "multiModel": {
    "rolesExecuted": ["REASONING", "TOOL_EXECUTION", "SYNTHESIS"],
    "handoffCount": 2,
    "costBreakdown": {
      "REASONING": 0.045,
      "TOOL_EXECUTION": 0.012,
      "SYNTHESIS": 0.008
    },
    "totalCost": 0.065,
    "totalDuration": 8500
  }
}

When to use it

Good for:

Complex analysis tasks
Multi-step tool workflows
Architecture and debugging questions
Anything where you'd want to "think then do"

Skip it for:

Simple Q&A
Single-tool operations
High-volume, low-complexity tasks

Limitations

Adds latency (multiple model calls)
Higher cost for complex requests
Handoff context has size limits
Not all providers support extended thinking

PreviousIntelligence Slider NextCode Mode (Web Terminal)

Last updated 8 minutes ago