AgentEvolve: When LLM Orchestration Helps — and When It Hurts

Multi-step LLM workflow orchestration (plan→execute, critique→refine) is widely assumed to improve output quality. I tested this assumption empirically using evolutionary search across 6 models and 18 experimental runs.

The core finding: orchestration effectiveness is entirely model-dependent. For models ≤30B parameters, adding orchestration steps degrades quality while multiplying token cost. For models ≥70B, orchestration consistently helps. Direct single-step prompting scored 0.908 on our benchmark suite with GLM-4.7-Flash (30B MoE), beating the best multi-step workflow (0.890) while using 40% fewer tokens.

The Problem

The AI engineering community has converged on a pattern: breaking LLM tasks into multi-step workflows improves output quality. Popular agent frameworks default to plan→execute→critique→refine pipelines, reasoning that explicit decomposition and self-reflection must outperform single-shot generation.

This assumption makes intuitive sense. Humans break complex tasks into steps. Expert performance involves planning, execution, and review. But most agent benchmarks don't isolate the orchestration pattern itself—they compare "agent with tools" vs "no agent."

Core question: Does multi-step workflow orchestration causally improve LLM output quality, holding prompt quality constant?

Methodology

I built AgentEvolve, a FunSearch-inspired evolutionary search system that discovers optimal LLM workflow configurations.

Benchmark Tasks

Five tasks designed to reward multi-step reasoning:

Task	Description	Ideal Workflow
multi_step_reasoning	Multi-hop logical deduction	Plan → Execute
code_with_edge_cases	Robust code with edge cases	Decompose → Implement → Test
contradictory_extraction	Facts from conflicting sources	Decompose → Synthesize
synthesis_from_perspectives	Combine multiple viewpoints	Decompose → Synthesize
debug_the_bug	Identify and fix subtle bugs	Analyze → Fix → Verify

Candidate Workflow Structure

A candidate is a sequence of 1-4 steps, each with a role and hyperparameters:

Workflow = [
  {
    "role": "plan",        # or: execute, critique, refine, decompose, synthesize, direct
    "temperature": 0.67,
    "max_tokens": 1024
  },
  # ... 2-4 steps
]

Step roles:

direct: Single-shot answer the question (baseline)
plan: Break task into steps, outline approach
execute: Carry out the plan / generate the answer
critique: Review output for errors or gaps
refine: Improve output based on critique
decompose: Split complex input into simpler sub-parts
synthesize: Combine sub-part answers into coherent whole

Models Tested

Model	Parameters	Architecture	Seeds
llama3.1:8b	8B	Dense	3
mistral:7b	7B	Dense	3
qwen2.5:7b	7B	Dense	3
glm-opus-distill	30B MoE (3B active)	Mixture-of-Experts	2
llama-3.3-70b	70B	Dense	1
llama-3.1-405b	405B	Dense	1

Results

Primary Finding: Orchestration Inflection Point

Model Class	Best Workflow	Score	Token Cost
7-8B models	1-2 steps (direct)	0.73-0.77	4K-7K
30B MoE	1 step (direct, temp=0.67)	0.908	4.5K
70B models	2-4 steps	0.853	12K-18K
405B models	2 steps	0.893	8K-12K

The inflection point: 30B–70B parameters. Below this threshold, orchestration amplifies noise. Above it, reflection and critique genuinely improve output.

GLM-4.7-Flash Deep Dive

Rank	Workflow	Score	Avg Tokens
#1	`direct` (T=0.67, tok=1024)	0.908	4,555
#2	`direct → direct`	0.890	7,528
#3	`direct → direct → direct`	0.828	13,673
#4	`plan → execute`	0.823	9,917
#9	`critique → execute`	0.743	9,105
#10	`refine → verify`	0.661	11,250

Simple wins decisively. The best workflow is just direct with temp=0.67, max_tokens=1024. Traditional patterns (plan → execute, critique → execute) underperformed.

Why Multi-Step Failed at Small Scale

Pipeline noise compounds: Each orchestration step introduces formatting drift, hallucinated details, and error amplification. At 7-30B scale, each generation pass degrades quality.

Prompt overhead dilutes signal: Multi-step workflows spend tokens on meta-instructions ("Now plan your approach...") that compete with actual task content.

Weak self-critique: A 7B model's "critique" is typically vacuous ("This looks good") or hallucinated ("There's an error in line 7" when line 7 is correct).

Built-in reasoning: GLM-4.7-Flash uses MoE architecture with built-in chain-of-thought. Even with think: false, the model's architecture handles internal reasoning well enough that external scaffolding adds redundancy.

Task-Specific Insights

Performance gap between 405B and 7B average (per task):

Task	405B	7B Avg	Delta
multi_step_reasoning	1.00	0.65	+0.35
code_with_edge_cases	0.83	0.75	+0.08
debug_the_bug	1.00	0.78	+0.22
synthesis_from_perspectives	0.88	0.73	+0.15
contradictory_extraction	0.75	0.80	-0.05

Surprise: 7B models slightly outperformed 405B on contradictory extraction—this task rewards literal text matching where smaller models are less prone to overthinking.

Synthesis exception: Even at 7-8B scale, synthesis tasks showed a +0.12 benefit from two-pass approaches. This was the only task where multi-step consistently helped.

Production Implications

For Small Models (≤30B)

Use direct single-shot prompting. Temperature ~0.67, 1024 tokens is the sweet spot. Don't build agent pipelines—they cost 2-3× tokens for equal or worse output.

Exception: Synthesis tasks. Combining multiple perspectives benefits from two-pass decompose → synthesize even at small scale.

For Large Models (≥70B)

Orchestration pays off. Plan→execute and refine→direct patterns show consistent quality gains. Budget for 2-4 step workflows.

For Agent Framework Design

Model-aware routing. Claudia's voice bridge uses two-tier routing: GLM for fast single-shot tasks, Claude for complex multi-step work. This validates that design—GLM handles tasks in one pass, only complex work routes to frontier models.

The community has overcorrected. Most "agentic workflow" patterns are designed for frontier models and actively hurt when applied to smaller models.

Conclusion

Multi-step LLM workflow orchestration is not universally beneficial. It helps at large scale (≥70B) and hurts at small scale (≤30B). For practitioners deploying small-to-medium models, the optimal "agent" is just a well-crafted single prompt.

Key results:

Direct prompting scored 0.908 with GLM-4.7-Flash, beating best multi-step (0.890)
Adding orchestration at small scale increased token cost 2-3× while decreasing quality
Inflection point: 30B–70B parameters where orchestration starts helping
Seed variance (±0.013–0.020) exceeds model differences at 7B scale
Exception: Synthesis tasks benefit from two-pass even at small scale (+0.12)

The AI engineering community has overcorrected toward orchestration complexity. For most production deployments using sub-70B models, simple prompting wins.

Abstract