AI Research

AgentEvolve: When LLM Orchestration Helps — and When It Hurts

Ben Zanghi
February 14, 2026
12 min read
AgentEvolve: When LLM Orchestration Helps — and When It Hurts

Abstract

Multi-step workflow orchestration (plan→execute, critique→refine) is widely assumed to improve LLM output. I tested this assumption empirically across 6 models. The answer: orchestration helps at 70B+ but hurts at 30B and below. The inflection point lies between 30B and 70B parameters.

Multi-step LLM workflow orchestration (plan→execute, critique→refine) is widely assumed to improve output quality. I tested this assumption empirically using evolutionary search across 6 models and 18 experimental runs.

The core finding: orchestration effectiveness is entirely model-dependent. For models ≤30B parameters, adding orchestration steps degrades quality while multiplying token cost. For models ≥70B, orchestration consistently helps. Direct single-step prompting scored 0.908 on our benchmark suite with GLM-4.7-Flash (30B MoE), beating the best multi-step workflow (0.890) while using 40% fewer tokens.

The Problem

The AI engineering community has converged on a pattern: breaking LLM tasks into multi-step workflows improves output quality. Popular agent frameworks default to plan→execute→critique→refine pipelines, reasoning that explicit decomposition and self-reflection must outperform single-shot generation.

This assumption makes intuitive sense. Humans break complex tasks into steps. Expert performance involves planning, execution, and review. But most agent benchmarks don't isolate the orchestration pattern itself—they compare "agent with tools" vs "no agent."

Core question: Does multi-step workflow orchestration causally improve LLM output quality, holding prompt quality constant?

Methodology

I built AgentEvolve, a FunSearch-inspired evolutionary search system that discovers optimal LLM workflow configurations.

Benchmark Tasks

Five tasks designed to reward multi-step reasoning:

TaskDescriptionIdeal Workflow
multi_step_reasoningMulti-hop logical deductionPlan → Execute
code_with_edge_casesRobust code with edge casesDecompose → Implement → Test
contradictory_extractionFacts from conflicting sourcesDecompose → Synthesize
synthesis_from_perspectivesCombine multiple viewpointsDecompose → Synthesize
debug_the_bugIdentify and fix subtle bugsAnalyze → Fix → Verify

Candidate Workflow Structure

A candidate is a sequence of 1-4 steps, each with a role and hyperparameters:

Workflow = [
  {
    "role": "plan",        # or: execute, critique, refine, decompose, synthesize, direct
    "temperature": 0.67,
    "max_tokens": 1024
  },
  # ... 2-4 steps
]

Step roles:

Models Tested

ModelParametersArchitectureSeeds
llama3.1:8b8BDense3
mistral:7b7BDense3
qwen2.5:7b7BDense3
glm-opus-distill30B MoE (3B active)Mixture-of-Experts2
llama-3.3-70b70BDense1
llama-3.1-405b405BDense1

Results

Primary Finding: Orchestration Inflection Point

Model ClassBest WorkflowScoreToken Cost
7-8B models1-2 steps (direct)0.73-0.774K-7K
30B MoE1 step (direct, temp=0.67)0.9084.5K
70B models2-4 steps0.85312K-18K
405B models2 steps0.8938K-12K

The inflection point: 30B–70B parameters. Below this threshold, orchestration amplifies noise. Above it, reflection and critique genuinely improve output.

GLM-4.7-Flash Deep Dive

RankWorkflowScoreAvg Tokens
#1direct (T=0.67, tok=1024)0.9084,555
#2direct → direct0.8907,528
#3direct → direct → direct0.82813,673
#4plan → execute0.8239,917
#9critique → execute0.7439,105
#10refine → verify0.66111,250

Simple wins decisively. The best workflow is just direct with temp=0.67, max_tokens=1024. Traditional patterns (plan → execute, critique → execute) underperformed.

Why Multi-Step Failed at Small Scale

Pipeline noise compounds: Each orchestration step introduces formatting drift, hallucinated details, and error amplification. At 7-30B scale, each generation pass degrades quality.

Prompt overhead dilutes signal: Multi-step workflows spend tokens on meta-instructions ("Now plan your approach...") that compete with actual task content.

Weak self-critique: A 7B model's "critique" is typically vacuous ("This looks good") or hallucinated ("There's an error in line 7" when line 7 is correct).

Built-in reasoning: GLM-4.7-Flash uses MoE architecture with built-in chain-of-thought. Even with think: false, the model's architecture handles internal reasoning well enough that external scaffolding adds redundancy.

Task-Specific Insights

Performance gap between 405B and 7B average (per task):

Task405B7B AvgDelta
multi_step_reasoning1.000.65+0.35
code_with_edge_cases0.830.75+0.08
debug_the_bug1.000.78+0.22
synthesis_from_perspectives0.880.73+0.15
contradictory_extraction0.750.80-0.05

Surprise: 7B models slightly outperformed 405B on contradictory extraction—this task rewards literal text matching where smaller models are less prone to overthinking.

Synthesis exception: Even at 7-8B scale, synthesis tasks showed a +0.12 benefit from two-pass approaches. This was the only task where multi-step consistently helped.

Production Implications

For Small Models (≤30B)

Use direct single-shot prompting. Temperature ~0.67, 1024 tokens is the sweet spot. Don't build agent pipelines—they cost 2-3× tokens for equal or worse output.

Exception: Synthesis tasks. Combining multiple perspectives benefits from two-pass decompose → synthesize even at small scale.

For Large Models (≥70B)

Orchestration pays off. Plan→execute and refine→direct patterns show consistent quality gains. Budget for 2-4 step workflows.

For Agent Framework Design

Model-aware routing. Claudia's voice bridge uses two-tier routing: GLM for fast single-shot tasks, Claude for complex multi-step work. This validates that design—GLM handles tasks in one pass, only complex work routes to frontier models.

The community has overcorrected. Most "agentic workflow" patterns are designed for frontier models and actively hurt when applied to smaller models.

Conclusion

Multi-step LLM workflow orchestration is not universally beneficial. It helps at large scale (≥70B) and hurts at small scale (≤30B). For practitioners deploying small-to-medium models, the optimal "agent" is just a well-crafted single prompt.

Key results:

The AI engineering community has overcorrected toward orchestration complexity. For most production deployments using sub-70B models, simple prompting wins.

Back to all articles