Scoring

scoring ¶

Scoring infrastructure for evolution evaluation.

Contains CriticScorer for LLM-based evaluation and the create_critic() preset factory for pre-configured critic agents.

ATTRIBUTE	DESCRIPTION
`CriticScorer`	LLM-based scorer using critic agents.
`SimpleCriticOutput`	KISS schema with score + feedback.
`CriticOutput`	Advanced schema with dimensions and guidance.
`SIMPLE_CRITIC_INSTRUCTION`	Generic instruction for simple critics. TYPE: `str`
`ADVANCED_CRITIC_INSTRUCTION`	Generic instruction for advanced critics. TYPE: `str`
`STRUCTURED_OUTPUT_CRITIC_INSTRUCTION`	Preset instruction for structure evaluation. TYPE: `str`
`ACCURACY_CRITIC_INSTRUCTION`	Preset instruction for factual accuracy evaluation. TYPE: `str`
`RELEVANCE_CRITIC_INSTRUCTION`	Preset instruction for relevance evaluation. TYPE: `str`
`normalize_feedback`	Normalizes critic output to trial format. TYPE: `dict[str, Any]`
`create_critic`	Factory for pre-configured critic agents by preset name. TYPE: `LlmAgent`
`critic_presets`	Maps preset name to human-readable description. TYPE: `dict[str, str]`

Examples:

Create a critic scorer with an executor:

from google.adk.agents import LlmAgent
from gepa_adk.adapters.scoring import CriticScorer, CriticOutput
from gepa_adk.adapters.execution.agent_executor import AgentExecutor

critic = LlmAgent(
    name="quality_critic",
    model="gemini-2.5-flash",
    instruction="Evaluate response quality...",
    output_schema=CriticOutput,
)
executor = AgentExecutor()
scorer = CriticScorer(critic_agent=critic, executor=executor)

CriticOutput ¶

Bases: BaseModel


              flowchart TD
              gepa_adk.adapters.scoring.CriticOutput[CriticOutput]

              

              click gepa_adk.adapters.scoring.CriticOutput href "" "gepa_adk.adapters.scoring.CriticOutput"

Advanced schema for structured critic feedback with dimensions.

This schema defines the expected JSON structure that critic agents should return when configured with output_schema. The score field is required, while other fields are optional and will be preserved in metadata.

ATTRIBUTE	DESCRIPTION
`score`	Score value between 0.0 and 1.0 (required). TYPE: `float`
`feedback`	Human-readable feedback text (optional). TYPE: `str`
`dimension_scores`	Per-dimension evaluation scores (optional). TYPE: `dict[str, float]`
`actionable_guidance`	Specific improvement suggestions (optional). TYPE: `str`

Examples:

Advanced critic output:

{
    "score": 0.75,
    "feedback": "Good response but could be more concise",
    "dimension_scores": {
        "accuracy": 0.9,
        "clarity": 0.6,
        "completeness": 0.8
    },
    "actionable_guidance": "Reduce response length by 30%"
}

Note

All critic agents using this schema must return structured JSON. When this schema is used as output_schema on an LlmAgent, the agent can ONLY reply and CANNOT use any tools. This is acceptable for critic agents focused on scoring.

CriticScorer ¶

Adapter that wraps ADK critic agents to provide structured scoring.

CriticScorer implements the Scorer protocol, enabling integration with gepa-adk's evaluation and evolution workflows. It executes ADK critic agents (LlmAgent, SequentialAgent, etc.) and extracts structured scores with metadata from their outputs.

ATTRIBUTE	DESCRIPTION
`critic_agent`	ADK agent configured for evaluation. TYPE: `BaseAgent`
`_session_service`	Session service for state management. TYPE: `BaseSessionService`
`_app_name`	Application name for session identification. TYPE: `str`
`_logger`	Bound logger with scorer context. TYPE: `BoundLogger`

Examples:

Basic usage:

from google.adk.agents import LlmAgent
from gepa_adk.adapters.scoring.critic_scorer import CriticScorer, CriticOutput
from gepa_adk.adapters.execution.agent_executor import AgentExecutor

critic = LlmAgent(
    name="quality_critic",
    model="gemini-2.5-flash",
    instruction="Evaluate response quality...",
    output_schema=CriticOutput,
)

executor = AgentExecutor()
scorer = CriticScorer(critic_agent=critic, executor=executor)
score, metadata = await scorer.async_score(
    input_text="What is Python?",
    output="Python is a programming language.",
)

Note

Adapter wraps ADK critic agents to provide structured scoring. Implements Scorer protocol for compatibility with evolution engine. Creates isolated sessions per scoring call unless session_id provided.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py

class CriticScorer:
    """Adapter that wraps ADK critic agents to provide structured scoring.

    CriticScorer implements the Scorer protocol, enabling integration with
    gepa-adk's evaluation and evolution workflows. It executes ADK critic
    agents (LlmAgent, SequentialAgent, etc.) and extracts structured scores
    with metadata from their outputs.

    Attributes:
        critic_agent (BaseAgent): ADK agent configured for evaluation.
        _session_service (BaseSessionService): Session service for state
            management.
        _app_name (str): Application name for session identification.
        _logger (structlog.BoundLogger): Bound logger with scorer context.

    Examples:
        Basic usage:

        ```python
        from google.adk.agents import LlmAgent
        from gepa_adk.adapters.scoring.critic_scorer import CriticScorer, CriticOutput
        from gepa_adk.adapters.execution.agent_executor import AgentExecutor

        critic = LlmAgent(
            name="quality_critic",
            model="gemini-2.5-flash",
            instruction="Evaluate response quality...",
            output_schema=CriticOutput,
        )

        executor = AgentExecutor()
        scorer = CriticScorer(critic_agent=critic, executor=executor)
        score, metadata = await scorer.async_score(
            input_text="What is Python?",
            output="Python is a programming language.",
        )
        ```

    Note:
        Adapter wraps ADK critic agents to provide structured scoring.
        Implements Scorer protocol for compatibility with evolution engine.
        Creates isolated sessions per scoring call unless session_id provided.
    """

    def __init__(
        self,
        critic_agent: BaseAgent,
        executor: AgentExecutorProtocol,
        session_service: BaseSessionService | None = None,
        app_name: str = "critic_scorer",
    ) -> None:
        """Initialize CriticScorer with critic agent.

        Args:
            critic_agent: ADK agent (LlmAgent or workflow agent) configured
                for evaluation.
            executor: AgentExecutorProtocol implementation for unified agent
                execution. Handles session management and execution, enabling
                feature parity across all agent types.
            session_service: Optional session service for state management.
                If None, creates an InMemorySessionService.
            app_name: Application name for session identification.

        Raises:
            TypeError: If critic_agent is not a BaseAgent instance.
            ValueError: If app_name is empty string.

        Examples:
            Basic setup with executor:

            ```python
            from gepa_adk.adapters.execution.agent_executor import AgentExecutor

            executor = AgentExecutor()
            scorer = CriticScorer(critic_agent=critic, executor=executor)
            ```

            With shared session service:

            ```python
            from google.adk.sessions import InMemorySessionService
            from gepa_adk.adapters.execution.agent_executor import AgentExecutor

            session_service = InMemorySessionService()
            executor = AgentExecutor(session_service=session_service)
            scorer = CriticScorer(
                critic_agent=critic,
                executor=executor,
                session_service=session_service,
            )
            ```

        Note:
            Creates logger with scorer context and validates agent type.
        """
        if not isinstance(critic_agent, BaseAgent):
            raise TypeError(f"critic_agent must be BaseAgent, got {type(critic_agent)}")

        if not app_name or not app_name.strip():
            raise ValueError("app_name cannot be empty")

        self.critic_agent = critic_agent
        self._session_service = session_service or InMemorySessionService()
        self._app_name = app_name.strip()
        self._executor = executor

        # Bind logger with scorer context
        self._logger = logger.bind(
            scorer="CriticScorer",
            agent_name=self.critic_agent.name,
            app_name=self._app_name,
            uses_executor=True,  # Always true since executor is required
        )

        self._logger.info("scorer.initialized")

    def _format_critic_input(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
    ) -> str:
        """Format input for critic agent evaluation.

        Builds a prompt that presents the input query, agent output, and
        optionally the expected output for the critic to evaluate.

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.

        Returns:
            Formatted prompt string for the critic agent.

        Examples:
            Basic formatting:

            ```python
            prompt = scorer._format_critic_input(
                input_text="What is 2+2?",
                output="4",
                expected="4",
            )
            ```

        Note:
            Organizes input for critic evaluation with clearly labeled sections.
            Format is designed to give critic context for evaluation.
            Expected output is included only if provided.
        """
        parts = [
            "Input Query:",
            input_text,
            "",
            "Agent Output:",
            output,
        ]

        if expected is not None:
            parts.extend(
                [
                    "",
                    "Expected Output:",
                    expected,
                ]
            )

        parts.append("")
        parts.append(
            "Please evaluate the agent output and provide a score with feedback."
        )

        return "\n".join(parts)

    def _parse_critic_output(self, output_text: str) -> tuple[float, dict[str, Any]]:
        """Parse critic agent output and extract score with metadata.

        Parses the critic's output text as JSON and extracts the score field
        along with optional metadata (feedback, dimension_scores,
        actionable_guidance, and any additional fields).

        Args:
            output_text: Raw text output from critic agent.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value extracted from output
            - metadata: Dict containing feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If output cannot be parsed as JSON.
            MissingScoreFieldError: If parsed JSON lacks required score field.

        Examples:
            Parse structured output:

            ```python
            output = '{"score": 0.75, "feedback": "Good", "dimension_scores": {"accuracy": 0.9}}'
            score, metadata = scorer._parse_critic_output(output)
            assert score == 0.75
            assert metadata["feedback"] == "Good"
            ```

        Note:
            Obtains score and metadata from critic JSON output with validation.
            Preserves all fields from parsed JSON in metadata, not just
            the known CriticOutput schema fields. This allows for extensibility.
        """
        # Parse JSON output
        try:
            parsed = json.loads(output_text)
        except json.JSONDecodeError as e:
            raise CriticOutputParseError(
                f"Critic output is not valid JSON: {e}",
                raw_output=output_text,
                parse_error=str(e),
                cause=e,
            ) from e

        # Validate parsed output is a dict
        if not isinstance(parsed, dict):
            raise CriticOutputParseError(
                f"Critic output must be a JSON object, got {type(parsed).__name__}",
                raw_output=output_text,
                parse_error="Not a JSON object",
            )

        # Extract required score field
        if "score" not in parsed:
            raise MissingScoreFieldError(
                "Critic output missing required 'score' field",
                parsed_output=parsed,
            )

        score = parsed["score"]
        if not isinstance(score, (int, float)):
            raise MissingScoreFieldError(
                f"Score field must be numeric, got {type(score).__name__}",
                parsed_output=parsed,
            )

        # Build metadata dict with known fields and any additional fields
        metadata: dict[str, Any] = {}

        # Extract known fields if present
        if "feedback" in parsed:
            metadata["feedback"] = str(parsed["feedback"])
        if "dimension_scores" in parsed:
            # Preserve dimension_scores as-is (may contain non-numeric values)
            metadata["dimension_scores"] = parsed["dimension_scores"]
        if "actionable_guidance" in parsed:
            metadata["actionable_guidance"] = str(parsed["actionable_guidance"])

        # Preserve any additional fields
        known_fields = {"score", "feedback", "dimension_scores", "actionable_guidance"}
        for key, value in parsed.items():
            if key not in known_fields:
                metadata[key] = value

        return float(score), metadata

    def _extract_json_from_text(self, text: str) -> str:
        """Extract JSON from text that may contain markdown code blocks.

        Minimal implementation - tries direct parse and markdown extraction.
        A more robust implementation will be added per GitHub issue #78.

        Args:
            text: Text that may contain JSON.

        Returns:
            Extracted JSON string, or original text if extraction fails.

        Note:
            Operates as a minimal JSON extractor; robust implementation planned
            per GitHub issue #78.
        """
        # Try parsing the entire text as-is
        try:
            json.loads(text.strip())
            return text.strip()
        except json.JSONDecodeError:
            pass

        # Extract from markdown code blocks (```json ... ``` or ``` ... ```)
        json_block_pattern = r"```(?:json)?\s*\n?(.*?)\n?```"
        matches = re.findall(json_block_pattern, text, re.DOTALL | re.IGNORECASE)
        for match in matches:
            try:
                json.loads(match.strip())
                return match.strip()
            except json.JSONDecodeError:
                continue

        # Try to find JSON object embedded in text (minimal regex for { ... })
        # Look for opening brace and try to find matching closing brace
        brace_start = text.find("{")
        if brace_start != -1:
            # Try to find the matching closing brace
            # NOTE: This algorithm doesn't account for braces within string literals
            # (e.g., JSON with template strings like "instruction": "Use {variable}").
            # This is a minimal implementation; a more robust parser will be added
            # per GitHub issue #78.
            depth = 0
            for i in range(brace_start, len(text)):
                if text[i] == "{":
                    depth += 1
                elif text[i] == "}":
                    depth -= 1
                    if depth == 0:
                        candidate = text[brace_start : i + 1]
                        try:
                            json.loads(candidate)
                            return candidate
                        except json.JSONDecodeError:
                            break

        # Return original text (will fail with clear error message)
        return text

    async def async_score(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
        session_id: str | None = None,
    ) -> tuple[float, dict[str, Any]]:
        """Score an agent output asynchronously using the critic agent.

        Executes the critic agent with formatted input and extracts structured
        score and metadata from the response.

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.
            session_id: Optional session ID to share state with main agent
                workflow. If None, creates an isolated session.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value, conventionally 0.0-1.0
            - metadata: Dict with feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If critic output is not valid JSON.
            MissingScoreFieldError: If score field missing from output.

        Examples:
            Basic async scoring:

            ```python
            score, metadata = await scorer.async_score(
                input_text="What is Python?",
                output="Python is a programming language.",
            )
            ```

            With session sharing:

            ```python
            score, metadata = await scorer.async_score(
                input_text="...",
                output="...",
                session_id="existing_session_123",
            )
            ```

        Note:
            Orchestrates critic agent execution via AgentExecutor and extracts
            structured output. Creates isolated session unless session_id provided
            for state sharing.
        """
        self._logger.debug(
            "scorer.async_score.start",
            input_preview=input_text[:50] if input_text else "",
            output_preview=output[:50] if output else "",
            has_expected=expected is not None,
            session_id=session_id,
        )

        # Format input for critic
        critic_input = self._format_critic_input(input_text, output, expected)

        # Execute via AgentExecutor
        result = await self._executor.execute_agent(
            agent=self.critic_agent,
            input_text=critic_input,
            existing_session_id=session_id,
        )

        if result.status == ExecutionStatus.FAILED:
            raise ScoringError(
                f"Critic agent execution failed: {result.error_message}",
            )

        final_output = result.extracted_value

        if not final_output:
            raise ScoringError("Critic agent returned empty output")

        # Parse output and extract score
        try:
            score, metadata = self._parse_critic_output(final_output)
        except (CriticOutputParseError, MissingScoreFieldError) as e:
            self._logger.error(
                "scorer.async_score.parse_error",
                error=str(e),
                error_type=type(e).__name__,
            )
            raise

        # Log multi-dimensional scoring context if present
        log_context: dict[str, Any] = {
            "score": score,
            "has_feedback": "feedback" in metadata,
            "has_dimension_scores": "dimension_scores" in metadata,
            "has_actionable_guidance": "actionable_guidance" in metadata,
        }
        if "dimension_scores" in metadata:
            log_context["dimension_count"] = len(metadata["dimension_scores"])

        self._logger.info(
            "scorer.async_score.complete",
            **log_context,
        )

        return score, metadata

    def score(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
    ) -> tuple[float, dict[str, Any]]:
        """Score an agent output synchronously using the critic agent.

        Synchronous wrapper around async_score() using asyncio.run().

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value, conventionally 0.0-1.0
            - metadata: Dict with feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If critic output is not valid JSON.
            MissingScoreFieldError: If score field missing from output.

        Examples:
            Basic sync scoring:

            ```python
            score, metadata = scorer.score(
                input_text="What is 2+2?",
                output="4",
                expected="4",
            )
            ```

        Note:
            Operates synchronously by wrapping async_score() with asyncio.run().
            Uses asyncio.run() to execute async_score(). Prefer async_score()
            for better performance in async contexts.
        """
        return asyncio.run(self.async_score(input_text, output, expected))

init ¶

__init__(
    critic_agent: BaseAgent,
    executor: AgentExecutorProtocol,
    session_service: BaseSessionService | None = None,
    app_name: str = "critic_scorer",
) -> None

Initialize CriticScorer with critic agent.

PARAMETER	DESCRIPTION
`critic_agent`	ADK agent (LlmAgent or workflow agent) configured for evaluation. TYPE: `BaseAgent`
`executor`	AgentExecutorProtocol implementation for unified agent execution. Handles session management and execution, enabling feature parity across all agent types. TYPE: `AgentExecutorProtocol`
`session_service`	Optional session service for state management. If None, creates an InMemorySessionService. TYPE: `BaseSessionService \| None` DEFAULT: `None`
`app_name`	Application name for session identification. TYPE: `str` DEFAULT: `'critic_scorer'`

RAISES	DESCRIPTION
`TypeError`	If critic_agent is not a BaseAgent instance.
`ValueError`	If app_name is empty string.

Examples:

Basic setup with executor:

from gepa_adk.adapters.execution.agent_executor import AgentExecutor

executor = AgentExecutor()
scorer = CriticScorer(critic_agent=critic, executor=executor)

With shared session service:

from google.adk.sessions import InMemorySessionService
from gepa_adk.adapters.execution.agent_executor import AgentExecutor

session_service = InMemorySessionService()
executor = AgentExecutor(session_service=session_service)
scorer = CriticScorer(
    critic_agent=critic,
    executor=executor,
    session_service=session_service,
)

Note

Creates logger with scorer context and validates agent type.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py

def __init__(
    self,
    critic_agent: BaseAgent,
    executor: AgentExecutorProtocol,
    session_service: BaseSessionService | None = None,
    app_name: str = "critic_scorer",
) -> None:
    """Initialize CriticScorer with critic agent.

    Args:
        critic_agent: ADK agent (LlmAgent or workflow agent) configured
            for evaluation.
        executor: AgentExecutorProtocol implementation for unified agent
            execution. Handles session management and execution, enabling
            feature parity across all agent types.
        session_service: Optional session service for state management.
            If None, creates an InMemorySessionService.
        app_name: Application name for session identification.

    Raises:
        TypeError: If critic_agent is not a BaseAgent instance.
        ValueError: If app_name is empty string.

    Examples:
        Basic setup with executor:

        ```python
        from gepa_adk.adapters.execution.agent_executor import AgentExecutor

        executor = AgentExecutor()
        scorer = CriticScorer(critic_agent=critic, executor=executor)
        ```

        With shared session service:

        ```python
        from google.adk.sessions import InMemorySessionService
        from gepa_adk.adapters.execution.agent_executor import AgentExecutor

        session_service = InMemorySessionService()
        executor = AgentExecutor(session_service=session_service)
        scorer = CriticScorer(
            critic_agent=critic,
            executor=executor,
            session_service=session_service,
        )
        ```

    Note:
        Creates logger with scorer context and validates agent type.
    """
    if not isinstance(critic_agent, BaseAgent):
        raise TypeError(f"critic_agent must be BaseAgent, got {type(critic_agent)}")

    if not app_name or not app_name.strip():
        raise ValueError("app_name cannot be empty")

    self.critic_agent = critic_agent
    self._session_service = session_service or InMemorySessionService()
    self._app_name = app_name.strip()
    self._executor = executor

    # Bind logger with scorer context
    self._logger = logger.bind(
        scorer="CriticScorer",
        agent_name=self.critic_agent.name,
        app_name=self._app_name,
        uses_executor=True,  # Always true since executor is required
    )

    self._logger.info("scorer.initialized")

async_score `async` ¶

async_score(
    input_text: str,
    output: str,
    expected: str | None = None,
    session_id: str | None = None,
) -> tuple[float, dict[str, Any]]

Score an agent output asynchronously using the critic agent.

Executes the critic agent with formatted input and extracts structured score and metadata from the response.

PARAMETER	DESCRIPTION
`input_text`	The original input provided to the agent being evaluated. TYPE: `str`
`output`	The agent's generated output to score. TYPE: `str`
`expected`	Optional expected/reference output for comparison. TYPE: `str \| None` DEFAULT: `None`
`session_id`	Optional session ID to share state with main agent workflow. If None, creates an isolated session. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Tuple of (score, metadata) where:
`dict[str, Any]`	score: Float value, conventionally 0.0-1.0
`tuple[float, dict[str, Any]]`	metadata: Dict with feedback, dimension_scores, actionable_guidance, and any additional fields

RAISES	DESCRIPTION
`CriticOutputParseError`	If critic output is not valid JSON.
`MissingScoreFieldError`	If score field missing from output.

Examples:

Basic async scoring:

score, metadata = await scorer.async_score(
    input_text="What is Python?",
    output="Python is a programming language.",
)

With session sharing:

score, metadata = await scorer.async_score(
    input_text="...",
    output="...",
    session_id="existing_session_123",
)

Note

Orchestrates critic agent execution via AgentExecutor and extracts structured output. Creates isolated session unless session_id provided for state sharing.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py

async def async_score(
    self,
    input_text: str,
    output: str,
    expected: str | None = None,
    session_id: str | None = None,
) -> tuple[float, dict[str, Any]]:
    """Score an agent output asynchronously using the critic agent.

    Executes the critic agent with formatted input and extracts structured
    score and metadata from the response.

    Args:
        input_text: The original input provided to the agent being evaluated.
        output: The agent's generated output to score.
        expected: Optional expected/reference output for comparison.
        session_id: Optional session ID to share state with main agent
            workflow. If None, creates an isolated session.

    Returns:
        Tuple of (score, metadata) where:
        - score: Float value, conventionally 0.0-1.0
        - metadata: Dict with feedback, dimension_scores,
            actionable_guidance, and any additional fields

    Raises:
        CriticOutputParseError: If critic output is not valid JSON.
        MissingScoreFieldError: If score field missing from output.

    Examples:
        Basic async scoring:

        ```python
        score, metadata = await scorer.async_score(
            input_text="What is Python?",
            output="Python is a programming language.",
        )
        ```

        With session sharing:

        ```python
        score, metadata = await scorer.async_score(
            input_text="...",
            output="...",
            session_id="existing_session_123",
        )
        ```

    Note:
        Orchestrates critic agent execution via AgentExecutor and extracts
        structured output. Creates isolated session unless session_id provided
        for state sharing.
    """
    self._logger.debug(
        "scorer.async_score.start",
        input_preview=input_text[:50] if input_text else "",
        output_preview=output[:50] if output else "",
        has_expected=expected is not None,
        session_id=session_id,
    )

    # Format input for critic
    critic_input = self._format_critic_input(input_text, output, expected)

    # Execute via AgentExecutor
    result = await self._executor.execute_agent(
        agent=self.critic_agent,
        input_text=critic_input,
        existing_session_id=session_id,
    )

    if result.status == ExecutionStatus.FAILED:
        raise ScoringError(
            f"Critic agent execution failed: {result.error_message}",
        )

    final_output = result.extracted_value

    if not final_output:
        raise ScoringError("Critic agent returned empty output")

    # Parse output and extract score
    try:
        score, metadata = self._parse_critic_output(final_output)
    except (CriticOutputParseError, MissingScoreFieldError) as e:
        self._logger.error(
            "scorer.async_score.parse_error",
            error=str(e),
            error_type=type(e).__name__,
        )
        raise

    # Log multi-dimensional scoring context if present
    log_context: dict[str, Any] = {
        "score": score,
        "has_feedback": "feedback" in metadata,
        "has_dimension_scores": "dimension_scores" in metadata,
        "has_actionable_guidance": "actionable_guidance" in metadata,
    }
    if "dimension_scores" in metadata:
        log_context["dimension_count"] = len(metadata["dimension_scores"])

    self._logger.info(
        "scorer.async_score.complete",
        **log_context,
    )

    return score, metadata

score ¶

score(
    input_text: str,
    output: str,
    expected: str | None = None,
) -> tuple[float, dict[str, Any]]

Score an agent output synchronously using the critic agent.

Synchronous wrapper around async_score() using asyncio.run().

PARAMETER	DESCRIPTION
`input_text`	The original input provided to the agent being evaluated. TYPE: `str`
`output`	The agent's generated output to score. TYPE: `str`
`expected`	Optional expected/reference output for comparison. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Tuple of (score, metadata) where:
`dict[str, Any]`	score: Float value, conventionally 0.0-1.0
`tuple[float, dict[str, Any]]`	metadata: Dict with feedback, dimension_scores, actionable_guidance, and any additional fields

RAISES	DESCRIPTION
`CriticOutputParseError`	If critic output is not valid JSON.
`MissingScoreFieldError`	If score field missing from output.

Examples:

Basic sync scoring:

score, metadata = scorer.score(
    input_text="What is 2+2?",
    output="4",
    expected="4",
)

Note

Operates synchronously by wrapping async_score() with asyncio.run(). Uses asyncio.run() to execute async_score(). Prefer async_score() for better performance in async contexts.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py

def score(
    self,
    input_text: str,
    output: str,
    expected: str | None = None,
) -> tuple[float, dict[str, Any]]:
    """Score an agent output synchronously using the critic agent.

    Synchronous wrapper around async_score() using asyncio.run().

    Args:
        input_text: The original input provided to the agent being evaluated.
        output: The agent's generated output to score.
        expected: Optional expected/reference output for comparison.

    Returns:
        Tuple of (score, metadata) where:
        - score: Float value, conventionally 0.0-1.0
        - metadata: Dict with feedback, dimension_scores,
            actionable_guidance, and any additional fields

    Raises:
        CriticOutputParseError: If critic output is not valid JSON.
        MissingScoreFieldError: If score field missing from output.

    Examples:
        Basic sync scoring:

        ```python
        score, metadata = scorer.score(
            input_text="What is 2+2?",
            output="4",
            expected="4",
        )
        ```

    Note:
        Operates synchronously by wrapping async_score() with asyncio.run().
        Uses asyncio.run() to execute async_score(). Prefer async_score()
        for better performance in async contexts.
    """
    return asyncio.run(self.async_score(input_text, output, expected))

SimpleCriticOutput ¶

Bases: BaseModel


              flowchart TD
              gepa_adk.adapters.scoring.SimpleCriticOutput[SimpleCriticOutput]

              

              click gepa_adk.adapters.scoring.SimpleCriticOutput href "" "gepa_adk.adapters.scoring.SimpleCriticOutput"

KISS schema for basic critic feedback.

This is the minimal schema for critic agents that only need to provide a score and text feedback. Use this for straightforward evaluation tasks where dimension breakdowns are not needed.

ATTRIBUTE	DESCRIPTION
`score`	Score value between 0.0 and 1.0 (required). TYPE: `float`
`feedback`	Human-readable feedback text (required). TYPE: `str`

Examples:

Simple critic output:

{
    "score": 0.75,
    "feedback": "Good response but could be more concise."
}

Using with LlmAgent:

from google.adk.agents import LlmAgent
from gepa_adk.adapters.scoring.critic_scorer import SimpleCriticOutput

critic = LlmAgent(
    name="simple_critic",
    model="gemini-2.5-flash",
    instruction=SIMPLE_CRITIC_INSTRUCTION,
    output_schema=SimpleCriticOutput,
)

Note

Applies to basic evaluation tasks where only a score and feedback are needed. For more detailed evaluations with dimension scores, use CriticOutput instead.

create_critic ¶

create_critic(
    name: str, *, model: str | None = None
) -> LlmAgent

Create a pre-configured critic agent by preset name.

PARAMETER	DESCRIPTION
`name`	Preset name. Must be a key in `_PRESET_INSTRUCTIONS`. TYPE: `str`
`model`	Optional model override. When None, ADK uses its default. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`LlmAgent`	Configured LlmAgent with CriticOutput schema and preset instruction.

RAISES	DESCRIPTION
`ConfigurationError`	If name is not a valid preset.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py

def create_critic(name: str, *, model: str | None = None) -> LlmAgent:
    """Create a pre-configured critic agent by preset name.

    Args:
        name: Preset name. Must be a key in ``_PRESET_INSTRUCTIONS``.
        model: Optional model override. When None, ADK uses its default.

    Returns:
        Configured LlmAgent with CriticOutput schema and preset instruction.

    Raises:
        ConfigurationError: If name is not a valid preset.
    """
    if name not in _PRESET_INSTRUCTIONS:
        valid_presets = ", ".join(sorted(_PRESET_INSTRUCTIONS))
        raise ConfigurationError(
            f"Unknown critic preset '{name}'. Valid presets: {valid_presets}",
            constraint=f"Must be one of: {valid_presets}",
            value=name,
            field="name",
        )

    model_kwargs: dict[str, Any] = {}
    if model is not None:
        model_kwargs["model"] = model

    return LlmAgent(
        name=f"{name}_critic",
        instruction=_PRESET_INSTRUCTIONS[name],
        output_schema=CriticOutput,
        **model_kwargs,
    )

normalize_feedback ¶

normalize_feedback(
    score: float, metadata: dict[str, Any] | None
) -> dict[str, Any]

Normalize critic feedback to consistent trial format.

Converts both simple and advanced critic outputs to a standardized format for use in trial records. This enables the reflection agent to receive consistent feedback regardless of which critic schema was used.

PARAMETER	DESCRIPTION
`score`	The numeric score from the critic (0.0-1.0). TYPE: `float`
`metadata`	Optional metadata dict from critic output. May contain: - feedback (str): Simple feedback text - dimension_scores (dict): Per-dimension scores - actionable_guidance (str): Improvement suggestions - Any additional fields from critic output TYPE: `dict[str, Any] \| None`

RETURNS	DESCRIPTION
`dict[str, Any]`	Normalized feedback dict with structure:
`dict[str, Any]`	```python
`dict[str, Any]`	{ "score": 0.75, "feedback_text": "Main feedback message", "dimension_scores": {...}, # Optional "actionable_guidance": "...", # Optional
`dict[str, Any]`	}
`dict[str, Any]`	```

Examples:

Normalize simple feedback:

normalized = normalize_feedback(0.8, {"feedback": "Good job"})
# {"score": 0.8, "feedback_text": "Good job"}

Normalize advanced feedback:

normalized = normalize_feedback(
    0.6,
    {
        "feedback": "Needs work",
        "dimension_scores": {"clarity": 0.5},
        "actionable_guidance": "Add examples",
    },
)
# {
#     "score": 0.6,
#     "feedback_text": "Needs work",
#     "dimension_scores": {"clarity": 0.5},
#     "actionable_guidance": "Add examples",
# }

Handle missing feedback:

normalized = normalize_feedback(0.5, None)
# {"score": 0.5, "feedback_text": ""}

Note

Supports both SimpleCriticOutput and CriticOutput schemas for flexible critic integration. Extracts the "feedback" field and renames it to "feedback_text" for consistent trial structure. Additional fields like dimension_scores are preserved when present.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py

def normalize_feedback(
    score: float,
    metadata: dict[str, Any] | None,
) -> dict[str, Any]:
    """Normalize critic feedback to consistent trial format.

    Converts both simple and advanced critic outputs to a standardized
    format for use in trial records. This enables the reflection agent
    to receive consistent feedback regardless of which critic schema
    was used.

    Args:
        score: The numeric score from the critic (0.0-1.0).
        metadata: Optional metadata dict from critic output. May contain:
            - feedback (str): Simple feedback text
            - dimension_scores (dict): Per-dimension scores
            - actionable_guidance (str): Improvement suggestions
            - Any additional fields from critic output

    Returns:
        Normalized feedback dict with structure:
        ```python
        {
            "score": 0.75,
            "feedback_text": "Main feedback message",
            "dimension_scores": {...},  # Optional
            "actionable_guidance": "...",  # Optional
        }
        ```

    Examples:
        Normalize simple feedback:

        ```python
        normalized = normalize_feedback(0.8, {"feedback": "Good job"})
        # {"score": 0.8, "feedback_text": "Good job"}
        ```

        Normalize advanced feedback:

        ```python
        normalized = normalize_feedback(
            0.6,
            {
                "feedback": "Needs work",
                "dimension_scores": {"clarity": 0.5},
                "actionable_guidance": "Add examples",
            },
        )
        # {
        #     "score": 0.6,
        #     "feedback_text": "Needs work",
        #     "dimension_scores": {"clarity": 0.5},
        #     "actionable_guidance": "Add examples",
        # }
        ```

        Handle missing feedback:

        ```python
        normalized = normalize_feedback(0.5, None)
        # {"score": 0.5, "feedback_text": ""}
        ```

    Note:
        Supports both SimpleCriticOutput and CriticOutput schemas for flexible
        critic integration. Extracts the "feedback" field and renames it to
        "feedback_text" for consistent trial structure. Additional fields
        like dimension_scores are preserved when present.
    """
    result: dict[str, Any] = {"score": score}

    if metadata is None:
        result["feedback_text"] = ""
        return result

    # Extract feedback text - handle both "feedback" and "feedback_text" keys
    feedback_text = metadata.get("feedback_text") or metadata.get("feedback") or ""
    if isinstance(feedback_text, str) and feedback_text.strip():
        result["feedback_text"] = feedback_text.strip()
    else:
        result["feedback_text"] = ""

    # Preserve dimension_scores if present
    dimension_scores = metadata.get("dimension_scores")
    if dimension_scores and isinstance(dimension_scores, dict):
        result["dimension_scores"] = dimension_scores

    # Preserve actionable_guidance if present
    actionable_guidance = metadata.get("actionable_guidance")
    if actionable_guidance and isinstance(actionable_guidance, str):
        guidance_str = actionable_guidance.strip()
        if guidance_str:
            result["actionable_guidance"] = guidance_str

    return result

Scoring

scoring ¶

CriticOutput ¶

CriticScorer ¶

__init__ ¶

async_score async ¶

score ¶

SimpleCriticOutput ¶

create_critic ¶

normalize_feedback ¶

init ¶

async_score `async` ¶