AgentEvolve: When LLM Orchestration Helps — and When It Hurts

Abstract
Multi-step workflow orchestration (plan→execute, critique→refine) is widely assumed to improve LLM output. I tested this assumption empirically across 6 models. The answer: orchestration helps at 70B+ but hurts at 30B and below. The inflection point lies between 30B and 70B parameters.
Multi-step LLM workflow orchestration (plan→execute, critique→refine) is widely assumed to improve output quality. I tested this assumption empirically using evolutionary search across 6 models and 18 experimental runs.
The core finding: orchestration effectiveness is entirely model-dependent. For models ≤30B parameters, adding orchestration steps degrades quality while multiplying token cost. For models ≥70B, orchestration consistently helps. Direct single-step prompting scored 0.908 on our benchmark suite with GLM-4.7-Flash (30B MoE), beating the best multi-step workflow (0.890) while using 40% fewer tokens.
The Problem
The AI engineering community has converged on a pattern: breaking LLM tasks into multi-step workflows improves output quality. Popular agent frameworks default to plan→execute→critique→refine pipelines, reasoning that explicit decomposition and self-reflection must outperform single-shot generation.
This assumption makes intuitive sense. Humans break complex tasks into steps. Expert performance involves planning, execution, and review. But most agent benchmarks don't isolate the orchestration pattern itself—they compare "agent with tools" vs "no agent."
Core question: Does multi-step workflow orchestration causally improve LLM output quality, holding prompt quality constant?
Methodology
I built AgentEvolve, a FunSearch-inspired evolutionary search system that discovers optimal LLM workflow configurations.
Benchmark Tasks
Five tasks designed to reward multi-step reasoning:
| Task | Description | Ideal Workflow |
|---|---|---|
| multi_step_reasoning | Multi-hop logical deduction | Plan → Execute |
| code_with_edge_cases | Robust code with edge cases | Decompose → Implement → Test |
| contradictory_extraction | Facts from conflicting sources | Decompose → Synthesize |
| synthesis_from_perspectives | Combine multiple viewpoints | Decompose → Synthesize |
| debug_the_bug | Identify and fix subtle bugs | Analyze → Fix → Verify |
Candidate Workflow Structure
A candidate is a sequence of 1-4 steps, each with a role and hyperparameters:
Workflow = [
{
"role": "plan", # or: execute, critique, refine, decompose, synthesize, direct
"temperature": 0.67,
"max_tokens": 1024
},
# ... 2-4 steps
]
Step roles:
- direct: Single-shot answer the question (baseline)
- plan: Break task into steps, outline approach
- execute: Carry out the plan / generate the answer
- critique: Review output for errors or gaps
- refine: Improve output based on critique
- decompose: Split complex input into simpler sub-parts
- synthesize: Combine sub-part answers into coherent whole
Models Tested
| Model | Parameters | Architecture | Seeds |
|---|---|---|---|
| llama3.1:8b | 8B | Dense | 3 |
| mistral:7b | 7B | Dense | 3 |
| qwen2.5:7b | 7B | Dense | 3 |
| glm-opus-distill | 30B MoE (3B active) | Mixture-of-Experts | 2 |
| llama-3.3-70b | 70B | Dense | 1 |
| llama-3.1-405b | 405B | Dense | 1 |
Results
Primary Finding: Orchestration Inflection Point
| Model Class | Best Workflow | Score | Token Cost |
|---|---|---|---|
| 7-8B models | 1-2 steps (direct) | 0.73-0.77 | 4K-7K |
| 30B MoE | 1 step (direct, temp=0.67) | 0.908 | 4.5K |
| 70B models | 2-4 steps | 0.853 | 12K-18K |
| 405B models | 2 steps | 0.893 | 8K-12K |
The inflection point: 30B–70B parameters. Below this threshold, orchestration amplifies noise. Above it, reflection and critique genuinely improve output.
GLM-4.7-Flash Deep Dive
| Rank | Workflow | Score | Avg Tokens |
|---|---|---|---|
| #1 | direct (T=0.67, tok=1024) | 0.908 | 4,555 |
| #2 | direct → direct | 0.890 | 7,528 |
| #3 | direct → direct → direct | 0.828 | 13,673 |
| #4 | plan → execute | 0.823 | 9,917 |
| #9 | critique → execute | 0.743 | 9,105 |
| #10 | refine → verify | 0.661 | 11,250 |
Simple wins decisively. The best workflow is just direct with temp=0.67, max_tokens=1024. Traditional patterns (plan → execute, critique → execute) underperformed.
Why Multi-Step Failed at Small Scale
Pipeline noise compounds: Each orchestration step introduces formatting drift, hallucinated details, and error amplification. At 7-30B scale, each generation pass degrades quality.
Prompt overhead dilutes signal: Multi-step workflows spend tokens on meta-instructions ("Now plan your approach...") that compete with actual task content.
Weak self-critique: A 7B model's "critique" is typically vacuous ("This looks good") or hallucinated ("There's an error in line 7" when line 7 is correct).
Built-in reasoning: GLM-4.7-Flash uses MoE architecture with built-in chain-of-thought. Even with think: false, the model's architecture handles internal reasoning well enough that external scaffolding adds redundancy.
Task-Specific Insights
Performance gap between 405B and 7B average (per task):
| Task | 405B | 7B Avg | Delta |
|---|---|---|---|
| multi_step_reasoning | 1.00 | 0.65 | +0.35 |
| code_with_edge_cases | 0.83 | 0.75 | +0.08 |
| debug_the_bug | 1.00 | 0.78 | +0.22 |
| synthesis_from_perspectives | 0.88 | 0.73 | +0.15 |
| contradictory_extraction | 0.75 | 0.80 | -0.05 |
Surprise: 7B models slightly outperformed 405B on contradictory extraction—this task rewards literal text matching where smaller models are less prone to overthinking.
Synthesis exception: Even at 7-8B scale, synthesis tasks showed a +0.12 benefit from two-pass approaches. This was the only task where multi-step consistently helped.
Production Implications
For Small Models (≤30B)
Use direct single-shot prompting. Temperature ~0.67, 1024 tokens is the sweet spot. Don't build agent pipelines—they cost 2-3× tokens for equal or worse output.
Exception: Synthesis tasks. Combining multiple perspectives benefits from two-pass decompose → synthesize even at small scale.
For Large Models (≥70B)
Orchestration pays off. Plan→execute and refine→direct patterns show consistent quality gains. Budget for 2-4 step workflows.
For Agent Framework Design
Model-aware routing. Claudia's voice bridge uses two-tier routing: GLM for fast single-shot tasks, Claude for complex multi-step work. This validates that design—GLM handles tasks in one pass, only complex work routes to frontier models.
The community has overcorrected. Most "agentic workflow" patterns are designed for frontier models and actively hurt when applied to smaller models.
Conclusion
Multi-step LLM workflow orchestration is not universally beneficial. It helps at large scale (≥70B) and hurts at small scale (≤30B). For practitioners deploying small-to-medium models, the optimal "agent" is just a well-crafted single prompt.
Key results:
- Direct prompting scored 0.908 with GLM-4.7-Flash, beating best multi-step (0.890)
- Adding orchestration at small scale increased token cost 2-3× while decreasing quality
- Inflection point: 30B–70B parameters where orchestration starts helping
- Seed variance (±0.013–0.020) exceeds model differences at 7B scale
- Exception: Synthesis tasks benefit from two-pass even at small scale (+0.12)
The AI engineering community has overcorrected toward orchestration complexity. For most production deployments using sub-70B models, simple prompting wins.