GEPA Fundamentals¶

This document explains what GEPA is, how the evolution loop works, and what the adapter contract requires.

What is GEPA?¶

GEPA = Genetic-Pareto prompt optimizer

GEPA is an evolutionary algorithm that optimizes text components of any system. It treats prompts, instructions, and configurations as evolvable "genes" that can be improved through iteration.

The core loop:

Evaluation - Run the system, get scores
Reflection - Analyze what worked and what didn't
Mutation - Propose improved text based on reflection
Selection - Accept improvements, track Pareto frontier

How the Evolution Loop Works¶

GEPA takes a candidate (dict[str, str] mapping component names to text) and iteratively improves it:

┌─────────────────────────────────────────────────────────────────┐
│                      Evolution Loop                              │
│                                                                  │
│   ┌──────────────┐         ┌──────────────┐                     │
│   │   EVALUATE   │────────▶│   REFLECT    │                     │
│   │              │         │              │                     │
│   │ Run on batch │         │ Build trials │                     │
│   │ Get scores   │         │ from results │                     │
│   └──────────────┘         └──────┬───────┘                     │
│          ▲                        │                              │
│          │                        ▼                              │
│   ┌──────┴───────┐         ┌──────────────┐                     │
│   │ ACCEPT/REJECT│◀────────│   PROPOSE    │                     │
│   │              │         │              │                     │
│   │ Score better?│         │ LLM suggests │                     │
│   │ Keep change  │         │ improvements │                     │
│   └──────────────┘         └──────────────┘                     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Each iteration:

EVALUATE: Run the candidate on a batch of examples → outputs, scores, trajectories
REFLECT: Transform trajectories into a dataset the reflection LLM can analyze
PROPOSE: LLM analyzes the dataset → proposes improved component text
ACCEPT/REJECT: If the new score is better → accept the mutation

The Key Insight¶

GEPA doesn't care what "the system" is. It just needs:

A way to run the system and get scores
A way to capture execution context (trajectories)
A way to surface that context for reflection

This abstraction is powerful: the same evolution loop works for single agents, multi-agent pipelines, and complex workflow structures.

The Adapter Contract¶

The AsyncGEPAAdapter protocol defines exactly 3 things GEPA needs from any system:

1. `evaluate(batch, candidate) → EvaluationBatch`¶

Runs the system and returns results.

Inputs: - batch - List of examples to run - candidate - Current component texts (dict[str, str])

Outputs: - outputs - What the system produced (opaque to GEPA) - scores - Per-example numeric scores (higher = better) - trajectories - Execution context for reflection (opaque to GEPA)

2. `make_reflective_dataset(candidate, eval_batch, components_to_update) → dict`¶

Transforms trajectories into a JSON-serializable dataset for the reflection LLM.

Recommended output format:

{
    "component_name": [
        {
            "Inputs": {...},            # What went in
            "Generated Outputs": {...}, # What came out
            "Feedback": "..."           # What was wrong/right
        },
        ...
    ]
}

3. `propose_new_texts(candidate, reflective_dataset, components_to_update) → dict[str, str]`¶

(Optional) Custom proposal logic. GEPA provides a default reflection agent.

Trajectories vs Outputs¶

A critical distinction:

Concept	Who Sees It	Purpose
Output	Scorer/Critic	Single result to evaluate
Trajectory	Reflection Agent	Full execution context

The scorer sees ONE thing—whatever we decide is "the output."

The reflection sees EVERYTHING we capture in trajectories—tool calls, intermediate states, token usage, reasoning chains.

This separation is powerful:

What execution context (trajectories) should we capture, and how do we transform that into a reflective dataset that helps the reflection LLM propose better component text?

For workflow agents with multiple steps, the question isn't "which output do we score?" but rather "what context helps the reflection agent understand what went wrong and how to fix it?"

Pareto Frontier¶

GEPA tracks a Pareto frontier of candidates when optimizing multiple objectives. A candidate is Pareto-optimal if no other candidate is better on ALL objectives.

Score B
    ▲
    │     ○ Pareto-optimal
    │    ╱
    │   ○
    │  ╱
    │ ○
    │╱
    └──────────────▶ Score A

This enables multi-objective optimization without forcing a single "best" solution.

Next Steps¶

Single-Agent Evolution - How evolution works for one agent
Multi-Agent Evolution - How multiple agents evolve together
Workflow Agents - How workflow structures evolve