Scorer
scorer ¶
Protocol definition for scoring agent outputs.
This module defines the Scorer protocol that enables custom scoring logic for evaluating agent outputs in the evolution engine. The protocol provides both synchronous and asynchronous methods, returning a tuple of score and metadata.
| ATTRIBUTE | DESCRIPTION |
|---|---|
Scorer | Protocol for scoring agent outputs. TYPE: |
Examples:
Implement a simple exact-match scorer:
from gepa_adk.ports import Scorer
class ExactMatchScorer:
def score(
self,
input_text: str,
output: str,
expected: str | None = None,
) -> tuple[float, dict]:
if expected is None:
return 0.0, {"error": "Expected value required"}
match = output.strip() == expected.strip()
return (1.0 if match else 0.0), {"exact_match": match}
async def async_score(
self,
input_text: str,
output: str,
expected: str | None = None,
) -> tuple[float, dict]:
return self.score(input_text, output, expected)
Use the scorer:
scorer = ExactMatchScorer()
score, metadata = scorer.score("What is 2+2?", "4", "4")
assert score == 1.0
assert metadata["exact_match"] is True
Note
The protocol defines both synchronous and asynchronous scoring methods to support various use cases. Score values should be normalized between 0.0 and 1.0 by convention, with higher values indicating better performance. The protocol does not enforce this range.
Scorer ¶
Bases: Protocol
flowchart TD
gepa_adk.ports.scorer.Scorer[Scorer]
click gepa_adk.ports.scorer.Scorer href "" "gepa_adk.ports.scorer.Scorer"
Protocol for scoring agent outputs.
Implementations provide scoring logic that evaluates how well an agent's output matches expected results or quality criteria.
Both synchronous and asynchronous methods are defined. Implementations should provide both, though callers may use only one based on context.
Examples:
Implement a simple fixed scorer for testing:
class FixedScorer:
def score(
self,
input_text: str,
output: str,
expected: str | None = None,
) -> tuple[float, dict]:
return 0.5, {"note": "Fixed score for testing"}
async def async_score(
self,
input_text: str,
output: str,
expected: str | None = None,
) -> tuple[float, dict]:
return self.score(input_text, output, expected)
Verify protocol compliance:
from gepa_adk.ports import Scorer
scorer = FixedScorer()
assert isinstance(scorer, Scorer) # Runtime check works
Note
All implementations must provide both score() and async_score() methods to satisfy the protocol. Score values should be normalized between 0.0 and 1.0 by convention, with higher values indicating better performance. The protocol does not enforce this range.
Source code in src/gepa_adk/ports/scorer.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | |
score ¶
Score an agent output synchronously.
| PARAMETER | DESCRIPTION |
|---|---|
input_text | The input provided to the agent. TYPE: |
output | The agent's generated output to score. TYPE: |
expected | Optional expected/reference output for comparison. Pass None for open-ended evaluation without expected output. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
float | A tuple of (score, metadata) where: |
dict[str, Any] |
|
tuple[float, dict[str, Any]] |
|
Examples:
Basic usage:
score, meta = scorer.score("What is 2+2?", "4", "4")
assert score == 1.0
assert meta.get("exact_match") is True
Scoring without expected output:
score, meta = scorer.score("Explain gravity", response)
# Scorer evaluates based on quality criteria, not exact match
Note
Operations complete synchronously and block until scoring finishes. Use async_score() for I/O-bound operations like LLM calls.
Source code in src/gepa_adk/ports/scorer.py
async_score async ¶
async_score(
input_text: str,
output: str,
expected: str | None = None,
) -> tuple[float, dict[str, Any]]
Score an agent output asynchronously.
| PARAMETER | DESCRIPTION |
|---|---|
input_text | The input provided to the agent. TYPE: |
output | The agent's generated output to score. TYPE: |
expected | Optional expected/reference output for comparison. Pass None for open-ended evaluation without expected output. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
float | A tuple of (score, metadata) where: |
dict[str, Any] |
|
tuple[float, dict[str, Any]] |
|
Examples:
Async usage:
score, meta = await scorer.async_score("Explain gravity", response)
print(f"Quality: {score:.2f} - {meta.get('feedback')}")
Concurrent scoring:
import asyncio
tasks = [
scorer.async_score(input, output, expected)
for input, output, expected in batch
]
scores = await asyncio.gather(*tasks)
Note
Operations run asynchronously and can be executed concurrently. Prefer this method for I/O-bound scoring operations such as LLM-based evaluation or external API calls.