Critic scorer

critic_scorer ¶

CriticScorer adapter for structured scoring with ADK critic agents.

This module provides the CriticScorer implementation that wraps ADK critic agents to provide structured scoring with feedback, dimension scores, and actionable guidance. The scorer implements the Scorer protocol, enabling integration with gepa-adk's evaluation and evolution workflows.

Also provides KISS and advanced critic output schemas with generic instruction templates for rapid critic agent development.

ATTRIBUTE	DESCRIPTION
`CriticScorer`	Adapter that wraps ADK critic agents for scoring. TYPE: `class`
`SimpleCriticOutput`	KISS schema with just score + feedback. TYPE: `class`
`CriticOutput`	Advanced schema with dimensions and guidance. TYPE: `class`
`SIMPLE_CRITIC_INSTRUCTION`	Generic instruction for simple critics. TYPE: `str`
`ADVANCED_CRITIC_INSTRUCTION`	Generic instruction for advanced critics. TYPE: `str`
`normalize_feedback`	Normalizes critic output to trial format. TYPE: `function`

Examples:

Basic usage with LlmAgent critic:

from pydantic import BaseModel, Field
from google.adk.agents import LlmAgent
from gepa_adk.adapters.critic_scorer import CriticScorer, CriticOutput

critic = LlmAgent(
    name="quality_critic",
    model="gemini-2.5-flash",
    instruction="Evaluate response quality...",
    output_schema=CriticOutput,
)

scorer = CriticScorer(critic_agent=critic)
score, metadata = await scorer.async_score(
    input_text="What is Python?",
    output="Python is a programming language.",
)

Note

This module wraps ADK critic agents to provide structured scoring. When using LlmAgent with output_schema, the agent can ONLY reply and CANNOT use any tools (ADK constraint). For evaluations requiring tool usage, use a SequentialAgent with tool-enabled agents before the output-constrained scorer.

SimpleCriticOutput ¶

Bases: BaseModel


              flowchart TD
              gepa_adk.adapters.critic_scorer.SimpleCriticOutput[SimpleCriticOutput]

              

              click gepa_adk.adapters.critic_scorer.SimpleCriticOutput href "" "gepa_adk.adapters.critic_scorer.SimpleCriticOutput"

KISS schema for basic critic feedback.

This is the minimal schema for critic agents that only need to provide a score and text feedback. Use this for straightforward evaluation tasks where dimension breakdowns are not needed.

ATTRIBUTE	DESCRIPTION
`score`	Score value between 0.0 and 1.0 (required). TYPE: `float`
`feedback`	Human-readable feedback text (required). TYPE: `str`

Examples:

Simple critic output:

{
    "score": 0.75,
    "feedback": "Good response but could be more concise."
}

Using with LlmAgent:

from google.adk.agents import LlmAgent
from gepa_adk.adapters.critic_scorer import SimpleCriticOutput

critic = LlmAgent(
    name="simple_critic",
    model="gemini-2.5-flash",
    instruction=SIMPLE_CRITIC_INSTRUCTION,
    output_schema=SimpleCriticOutput,
)

Note

Applies to basic evaluation tasks where only a score and feedback are needed. For more detailed evaluations with dimension scores, use CriticOutput instead.

CriticOutput ¶

Bases: BaseModel


              flowchart TD
              gepa_adk.adapters.critic_scorer.CriticOutput[CriticOutput]

              

              click gepa_adk.adapters.critic_scorer.CriticOutput href "" "gepa_adk.adapters.critic_scorer.CriticOutput"

Advanced schema for structured critic feedback with dimensions.

This schema defines the expected JSON structure that critic agents should return when configured with output_schema. The score field is required, while other fields are optional and will be preserved in metadata.

ATTRIBUTE	DESCRIPTION
`score`	Score value between 0.0 and 1.0 (required). TYPE: `float`
`feedback`	Human-readable feedback text (optional). TYPE: `str`
`dimension_scores`	Per-dimension evaluation scores (optional). TYPE: `dict[str, float]`
`actionable_guidance`	Specific improvement suggestions (optional). TYPE: `str`

Examples:

Advanced critic output:

{
    "score": 0.75,
    "feedback": "Good response but could be more concise",
    "dimension_scores": {
        "accuracy": 0.9,
        "clarity": 0.6,
        "completeness": 0.8
    },
    "actionable_guidance": "Reduce response length by 30%"
}

Note

All critic agents using this schema must return structured JSON. When this schema is used as output_schema on an LlmAgent, the agent can ONLY reply and CANNOT use any tools. This is acceptable for critic agents focused on scoring.

CriticScorer ¶

Adapter that wraps ADK critic agents to provide structured scoring.

CriticScorer implements the Scorer protocol, enabling integration with gepa-adk's evaluation and evolution workflows. It executes ADK critic agents (LlmAgent, SequentialAgent, etc.) and extracts structured scores with metadata from their outputs.

ATTRIBUTE	DESCRIPTION
`critic_agent`	ADK agent configured for evaluation. TYPE: `BaseAgent`
`_session_service`	Session service for state management. TYPE: `BaseSessionService`
`_app_name`	Application name for session identification. TYPE: `str`
`_logger`	Bound logger with scorer context. TYPE: `BoundLogger`

Examples:

Basic usage:

from google.adk.agents import LlmAgent
from gepa_adk.adapters.critic_scorer import CriticScorer, CriticOutput
from gepa_adk.adapters.agent_executor import AgentExecutor

critic = LlmAgent(
    name="quality_critic",
    model="gemini-2.5-flash",
    instruction="Evaluate response quality...",
    output_schema=CriticOutput,
)

executor = AgentExecutor()
scorer = CriticScorer(critic_agent=critic, executor=executor)
score, metadata = await scorer.async_score(
    input_text="What is Python?",
    output="Python is a programming language.",
)

Note

Adapter wraps ADK critic agents to provide structured scoring. Implements Scorer protocol for compatibility with evolution engine. Creates isolated sessions per scoring call unless session_id provided.

Source code in src/gepa_adk/adapters/critic_scorer.py

class CriticScorer:
    """Adapter that wraps ADK critic agents to provide structured scoring.

    CriticScorer implements the Scorer protocol, enabling integration with
    gepa-adk's evaluation and evolution workflows. It executes ADK critic
    agents (LlmAgent, SequentialAgent, etc.) and extracts structured scores
    with metadata from their outputs.

    Attributes:
        critic_agent (BaseAgent): ADK agent configured for evaluation.
        _session_service (BaseSessionService): Session service for state
            management.
        _app_name (str): Application name for session identification.
        _logger (structlog.BoundLogger): Bound logger with scorer context.

    Examples:
        Basic usage:

        ```python
        from google.adk.agents import LlmAgent
        from gepa_adk.adapters.critic_scorer import CriticScorer, CriticOutput
        from gepa_adk.adapters.agent_executor import AgentExecutor

        critic = LlmAgent(
            name="quality_critic",
            model="gemini-2.5-flash",
            instruction="Evaluate response quality...",
            output_schema=CriticOutput,
        )

        executor = AgentExecutor()
        scorer = CriticScorer(critic_agent=critic, executor=executor)
        score, metadata = await scorer.async_score(
            input_text="What is Python?",
            output="Python is a programming language.",
        )
        ```

    Note:
        Adapter wraps ADK critic agents to provide structured scoring.
        Implements Scorer protocol for compatibility with evolution engine.
        Creates isolated sessions per scoring call unless session_id provided.
    """

    def __init__(
        self,
        critic_agent: BaseAgent,
        executor: AgentExecutorProtocol,
        session_service: BaseSessionService | None = None,
        app_name: str = "critic_scorer",
    ) -> None:
        """Initialize CriticScorer with critic agent.

        Args:
            critic_agent: ADK agent (LlmAgent or workflow agent) configured
                for evaluation.
            executor: AgentExecutorProtocol implementation for unified agent
                execution. Handles session management and execution, enabling
                feature parity across all agent types.
            session_service: Optional session service for state management.
                If None, creates an InMemorySessionService.
            app_name: Application name for session identification.

        Raises:
            TypeError: If critic_agent is not a BaseAgent instance.
            ValueError: If app_name is empty string.

        Examples:
            Basic setup with executor:

            ```python
            from gepa_adk.adapters.agent_executor import AgentExecutor

            executor = AgentExecutor()
            scorer = CriticScorer(critic_agent=critic, executor=executor)
            ```

            With shared session service:

            ```python
            from google.adk.sessions import InMemorySessionService
            from gepa_adk.adapters.agent_executor import AgentExecutor

            session_service = InMemorySessionService()
            executor = AgentExecutor(session_service=session_service)
            scorer = CriticScorer(
                critic_agent=critic,
                executor=executor,
                session_service=session_service,
            )
            ```

        Note:
            Creates logger with scorer context and validates agent type.
        """
        if not isinstance(critic_agent, BaseAgent):
            raise TypeError(f"critic_agent must be BaseAgent, got {type(critic_agent)}")

        if not app_name or not app_name.strip():
            raise ValueError("app_name cannot be empty")

        self.critic_agent = critic_agent
        self._session_service = session_service or InMemorySessionService()
        self._app_name = app_name.strip()
        self._executor = executor

        # Bind logger with scorer context
        self._logger = logger.bind(
            scorer="CriticScorer",
            agent_name=self.critic_agent.name,
            app_name=self._app_name,
            uses_executor=True,  # Always true since executor is required
        )

        self._logger.info("scorer.initialized")

    def _format_critic_input(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
    ) -> str:
        """Format input for critic agent evaluation.

        Builds a prompt that presents the input query, agent output, and
        optionally the expected output for the critic to evaluate.

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.

        Returns:
            Formatted prompt string for the critic agent.

        Examples:
            Basic formatting:

            ```python
            prompt = scorer._format_critic_input(
                input_text="What is 2+2?",
                output="4",
                expected="4",
            )
            ```

        Note:
            Organizes input for critic evaluation with clearly labeled sections.
            Format is designed to give critic context for evaluation.
            Expected output is included only if provided.
        """
        parts = [
            "Input Query:",
            input_text,
            "",
            "Agent Output:",
            output,
        ]

        if expected is not None:
            parts.extend(
                [
                    "",
                    "Expected Output:",
                    expected,
                ]
            )

        parts.append("")
        parts.append(
            "Please evaluate the agent output and provide a score with feedback."
        )

        return "\n".join(parts)

    def _parse_critic_output(self, output_text: str) -> tuple[float, dict[str, Any]]:
        """Parse critic agent output and extract score with metadata.

        Parses the critic's output text as JSON and extracts the score field
        along with optional metadata (feedback, dimension_scores,
        actionable_guidance, and any additional fields).

        Args:
            output_text: Raw text output from critic agent.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value extracted from output
            - metadata: Dict containing feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If output cannot be parsed as JSON.
            MissingScoreFieldError: If parsed JSON lacks required score field.

        Examples:
            Parse structured output:

            ```python
            output = '{"score": 0.75, "feedback": "Good", "dimension_scores": {"accuracy": 0.9}}'
            score, metadata = scorer._parse_critic_output(output)
            assert score == 0.75
            assert metadata["feedback"] == "Good"
            ```

        Note:
            Obtains score and metadata from critic JSON output with validation.
            Preserves all fields from parsed JSON in metadata, not just
            the known CriticOutput schema fields. This allows for extensibility.
        """
        # Parse JSON output
        try:
            parsed = json.loads(output_text)
        except json.JSONDecodeError as e:
            raise CriticOutputParseError(
                f"Critic output is not valid JSON: {e}",
                raw_output=output_text,
                parse_error=str(e),
                cause=e,
            ) from e

        # Validate parsed output is a dict
        if not isinstance(parsed, dict):
            raise CriticOutputParseError(
                f"Critic output must be a JSON object, got {type(parsed).__name__}",
                raw_output=output_text,
                parse_error="Not a JSON object",
            )

        # Extract required score field
        if "score" not in parsed:
            raise MissingScoreFieldError(
                "Critic output missing required 'score' field",
                parsed_output=parsed,
            )

        score = parsed["score"]
        if not isinstance(score, (int, float)):
            raise MissingScoreFieldError(
                f"Score field must be numeric, got {type(score).__name__}",
                parsed_output=parsed,
            )

        # Build metadata dict with known fields and any additional fields
        metadata: dict[str, Any] = {}

        # Extract known fields if present
        if "feedback" in parsed:
            metadata["feedback"] = str(parsed["feedback"])
        if "dimension_scores" in parsed:
            # Preserve dimension_scores as-is (may contain non-numeric values)
            metadata["dimension_scores"] = parsed["dimension_scores"]
        if "actionable_guidance" in parsed:
            metadata["actionable_guidance"] = str(parsed["actionable_guidance"])

        # Preserve any additional fields
        known_fields = {"score", "feedback", "dimension_scores", "actionable_guidance"}
        for key, value in parsed.items():
            if key not in known_fields:
                metadata[key] = value

        return float(score), metadata

    def _extract_json_from_text(self, text: str) -> str:
        """Extract JSON from text that may contain markdown code blocks.

        Minimal implementation - tries direct parse and markdown extraction.
        A more robust implementation will be added per GitHub issue #78.

        Args:
            text: Text that may contain JSON.

        Returns:
            Extracted JSON string, or original text if extraction fails.

        Note:
            Operates as a minimal JSON extractor; robust implementation planned
            per GitHub issue #78.
        """
        # Try parsing the entire text as-is
        try:
            json.loads(text.strip())
            return text.strip()
        except json.JSONDecodeError:
            pass

        # Extract from markdown code blocks (```json ... ``` or ``` ... ```)
        json_block_pattern = r"```(?:json)?\s*\n?(.*?)\n?```"
        matches = re.findall(json_block_pattern, text, re.DOTALL | re.IGNORECASE)
        for match in matches:
            try:
                json.loads(match.strip())
                return match.strip()
            except json.JSONDecodeError:
                continue

        # Try to find JSON object embedded in text (minimal regex for { ... })
        # Look for opening brace and try to find matching closing brace
        brace_start = text.find("{")
        if brace_start != -1:
            # Try to find the matching closing brace
            # NOTE: This algorithm doesn't account for braces within string literals
            # (e.g., JSON with template strings like "instruction": "Use {variable}").
            # This is a minimal implementation; a more robust parser will be added
            # per GitHub issue #78.
            depth = 0
            for i in range(brace_start, len(text)):
                if text[i] == "{":
                    depth += 1
                elif text[i] == "}":
                    depth -= 1
                    if depth == 0:
                        candidate = text[brace_start : i + 1]
                        try:
                            json.loads(candidate)
                            return candidate
                        except json.JSONDecodeError:
                            break

        # Return original text (will fail with clear error message)
        return text

    async def async_score(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
        session_id: str | None = None,
    ) -> tuple[float, dict[str, Any]]:
        """Score an agent output asynchronously using the critic agent.

        Executes the critic agent with formatted input and extracts structured
        score and metadata from the response.

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.
            session_id: Optional session ID to share state with main agent
                workflow. If None, creates an isolated session.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value, conventionally 0.0-1.0
            - metadata: Dict with feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If critic output is not valid JSON.
            MissingScoreFieldError: If score field missing from output.

        Examples:
            Basic async scoring:

            ```python
            score, metadata = await scorer.async_score(
                input_text="What is Python?",
                output="Python is a programming language.",
            )
            ```

            With session sharing:

            ```python
            score, metadata = await scorer.async_score(
                input_text="...",
                output="...",
                session_id="existing_session_123",
            )
            ```

        Note:
            Orchestrates critic agent execution via AgentExecutor and extracts
            structured output. Creates isolated session unless session_id provided
            for state sharing.
        """
        self._logger.debug(
            "scorer.async_score.start",
            input_preview=input_text[:50] if input_text else "",
            output_preview=output[:50] if output else "",
            has_expected=expected is not None,
            session_id=session_id,
        )

        # Format input for critic
        critic_input = self._format_critic_input(input_text, output, expected)

        # Execute via AgentExecutor
        result = await self._executor.execute_agent(
            agent=self.critic_agent,
            input_text=critic_input,
            existing_session_id=session_id,
        )

        if result.status == ExecutionStatus.FAILED:
            raise ScoringError(
                f"Critic agent execution failed: {result.error_message}",
            )

        final_output = result.extracted_value

        if not final_output:
            raise ScoringError("Critic agent returned empty output")

        # Parse output and extract score
        try:
            score, metadata = self._parse_critic_output(final_output)
        except (CriticOutputParseError, MissingScoreFieldError) as e:
            self._logger.error(
                "scorer.async_score.parse_error",
                error=str(e),
                error_type=type(e).__name__,
            )
            raise

        # Log multi-dimensional scoring context if present
        log_context: dict[str, Any] = {
            "score": score,
            "has_feedback": "feedback" in metadata,
            "has_dimension_scores": "dimension_scores" in metadata,
            "has_actionable_guidance": "actionable_guidance" in metadata,
        }
        if "dimension_scores" in metadata:
            log_context["dimension_count"] = len(metadata["dimension_scores"])

        self._logger.info(
            "scorer.async_score.complete",
            **log_context,
        )

        return score, metadata

    def score(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
    ) -> tuple[float, dict[str, Any]]:
        """Score an agent output synchronously using the critic agent.

        Synchronous wrapper around async_score() using asyncio.run().

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value, conventionally 0.0-1.0
            - metadata: Dict with feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If critic output is not valid JSON.
            MissingScoreFieldError: If score field missing from output.

        Examples:
            Basic sync scoring:

            ```python
            score, metadata = scorer.score(
                input_text="What is 2+2?",
                output="4",
                expected="4",
            )
            ```

        Note:
            Operates synchronously by wrapping async_score() with asyncio.run().
            Uses asyncio.run() to execute async_score(). Prefer async_score()
            for better performance in async contexts.
        """
        return asyncio.run(self.async_score(input_text, output, expected))

init ¶

__init__(
    critic_agent: BaseAgent,
    executor: AgentExecutorProtocol,
    session_service: BaseSessionService | None = None,
    app_name: str = "critic_scorer",
) -> None

Initialize CriticScorer with critic agent.

PARAMETER	DESCRIPTION
`critic_agent`	ADK agent (LlmAgent or workflow agent) configured for evaluation. TYPE: `BaseAgent`
`executor`	AgentExecutorProtocol implementation for unified agent execution. Handles session management and execution, enabling feature parity across all agent types. TYPE: `AgentExecutorProtocol`
`session_service`	Optional session service for state management. If None, creates an InMemorySessionService. TYPE: `BaseSessionService \| None` DEFAULT: `None`
`app_name`	Application name for session identification. TYPE: `str` DEFAULT: `'critic_scorer'`

RAISES	DESCRIPTION
`TypeError`	If critic_agent is not a BaseAgent instance.
`ValueError`	If app_name is empty string.

Examples:

Basic setup with executor:

from gepa_adk.adapters.agent_executor import AgentExecutor

executor = AgentExecutor()
scorer = CriticScorer(critic_agent=critic, executor=executor)

With shared session service:

from google.adk.sessions import InMemorySessionService
from gepa_adk.adapters.agent_executor import AgentExecutor

session_service = InMemorySessionService()
executor = AgentExecutor(session_service=session_service)
scorer = CriticScorer(
    critic_agent=critic,
    executor=executor,
    session_service=session_service,
)

Note

Creates logger with scorer context and validates agent type.

Source code in src/gepa_adk/adapters/critic_scorer.py

def __init__(
    self,
    critic_agent: BaseAgent,
    executor: AgentExecutorProtocol,
    session_service: BaseSessionService | None = None,
    app_name: str = "critic_scorer",
) -> None:
    """Initialize CriticScorer with critic agent.

    Args:
        critic_agent: ADK agent (LlmAgent or workflow agent) configured
            for evaluation.
        executor: AgentExecutorProtocol implementation for unified agent
            execution. Handles session management and execution, enabling
            feature parity across all agent types.
        session_service: Optional session service for state management.
            If None, creates an InMemorySessionService.
        app_name: Application name for session identification.

    Raises:
        TypeError: If critic_agent is not a BaseAgent instance.
        ValueError: If app_name is empty string.

    Examples:
        Basic setup with executor:

        ```python
        from gepa_adk.adapters.agent_executor import AgentExecutor

        executor = AgentExecutor()
        scorer = CriticScorer(critic_agent=critic, executor=executor)
        ```

        With shared session service:

        ```python
        from google.adk.sessions import InMemorySessionService
        from gepa_adk.adapters.agent_executor import AgentExecutor

        session_service = InMemorySessionService()
        executor = AgentExecutor(session_service=session_service)
        scorer = CriticScorer(
            critic_agent=critic,
            executor=executor,
            session_service=session_service,
        )
        ```

    Note:
        Creates logger with scorer context and validates agent type.
    """
    if not isinstance(critic_agent, BaseAgent):
        raise TypeError(f"critic_agent must be BaseAgent, got {type(critic_agent)}")

    if not app_name or not app_name.strip():
        raise ValueError("app_name cannot be empty")

    self.critic_agent = critic_agent
    self._session_service = session_service or InMemorySessionService()
    self._app_name = app_name.strip()
    self._executor = executor

    # Bind logger with scorer context
    self._logger = logger.bind(
        scorer="CriticScorer",
        agent_name=self.critic_agent.name,
        app_name=self._app_name,
        uses_executor=True,  # Always true since executor is required
    )

    self._logger.info("scorer.initialized")

async_score `async` ¶

async_score(
    input_text: str,
    output: str,
    expected: str | None = None,
    session_id: str | None = None,
) -> tuple[float, dict[str, Any]]

Score an agent output asynchronously using the critic agent.

Executes the critic agent with formatted input and extracts structured score and metadata from the response.

PARAMETER	DESCRIPTION
`input_text`	The original input provided to the agent being evaluated. TYPE: `str`
`output`	The agent's generated output to score. TYPE: `str`
`expected`	Optional expected/reference output for comparison. TYPE: `str \| None` DEFAULT: `None`
`session_id`	Optional session ID to share state with main agent workflow. If None, creates an isolated session. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Tuple of (score, metadata) where:
`dict[str, Any]`	score: Float value, conventionally 0.0-1.0
`tuple[float, dict[str, Any]]`	metadata: Dict with feedback, dimension_scores, actionable_guidance, and any additional fields

RAISES	DESCRIPTION
`CriticOutputParseError`	If critic output is not valid JSON.
`MissingScoreFieldError`	If score field missing from output.

Examples:

Basic async scoring:

score, metadata = await scorer.async_score(
    input_text="What is Python?",
    output="Python is a programming language.",
)

With session sharing:

score, metadata = await scorer.async_score(
    input_text="...",
    output="...",
    session_id="existing_session_123",
)

Note

Orchestrates critic agent execution via AgentExecutor and extracts structured output. Creates isolated session unless session_id provided for state sharing.

Source code in src/gepa_adk/adapters/critic_scorer.py

async def async_score(
    self,
    input_text: str,
    output: str,
    expected: str | None = None,
    session_id: str | None = None,
) -> tuple[float, dict[str, Any]]:
    """Score an agent output asynchronously using the critic agent.

    Executes the critic agent with formatted input and extracts structured
    score and metadata from the response.

    Args:
        input_text: The original input provided to the agent being evaluated.
        output: The agent's generated output to score.
        expected: Optional expected/reference output for comparison.
        session_id: Optional session ID to share state with main agent
            workflow. If None, creates an isolated session.

    Returns:
        Tuple of (score, metadata) where:
        - score: Float value, conventionally 0.0-1.0
        - metadata: Dict with feedback, dimension_scores,
            actionable_guidance, and any additional fields

    Raises:
        CriticOutputParseError: If critic output is not valid JSON.
        MissingScoreFieldError: If score field missing from output.

    Examples:
        Basic async scoring:

        ```python
        score, metadata = await scorer.async_score(
            input_text="What is Python?",
            output="Python is a programming language.",
        )
        ```

        With session sharing:

        ```python
        score, metadata = await scorer.async_score(
            input_text="...",
            output="...",
            session_id="existing_session_123",
        )
        ```

    Note:
        Orchestrates critic agent execution via AgentExecutor and extracts
        structured output. Creates isolated session unless session_id provided
        for state sharing.
    """
    self._logger.debug(
        "scorer.async_score.start",
        input_preview=input_text[:50] if input_text else "",
        output_preview=output[:50] if output else "",
        has_expected=expected is not None,
        session_id=session_id,
    )

    # Format input for critic
    critic_input = self._format_critic_input(input_text, output, expected)

    # Execute via AgentExecutor
    result = await self._executor.execute_agent(
        agent=self.critic_agent,
        input_text=critic_input,
        existing_session_id=session_id,
    )

    if result.status == ExecutionStatus.FAILED:
        raise ScoringError(
            f"Critic agent execution failed: {result.error_message}",
        )

    final_output = result.extracted_value

    if not final_output:
        raise ScoringError("Critic agent returned empty output")

    # Parse output and extract score
    try:
        score, metadata = self._parse_critic_output(final_output)
    except (CriticOutputParseError, MissingScoreFieldError) as e:
        self._logger.error(
            "scorer.async_score.parse_error",
            error=str(e),
            error_type=type(e).__name__,
        )
        raise

    # Log multi-dimensional scoring context if present
    log_context: dict[str, Any] = {
        "score": score,
        "has_feedback": "feedback" in metadata,
        "has_dimension_scores": "dimension_scores" in metadata,
        "has_actionable_guidance": "actionable_guidance" in metadata,
    }
    if "dimension_scores" in metadata:
        log_context["dimension_count"] = len(metadata["dimension_scores"])

    self._logger.info(
        "scorer.async_score.complete",
        **log_context,
    )

    return score, metadata

score ¶

score(
    input_text: str,
    output: str,
    expected: str | None = None,
) -> tuple[float, dict[str, Any]]

Score an agent output synchronously using the critic agent.

Synchronous wrapper around async_score() using asyncio.run().

PARAMETER	DESCRIPTION
`input_text`	The original input provided to the agent being evaluated. TYPE: `str`
`output`	The agent's generated output to score. TYPE: `str`
`expected`	Optional expected/reference output for comparison. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Tuple of (score, metadata) where:
`dict[str, Any]`	score: Float value, conventionally 0.0-1.0
`tuple[float, dict[str, Any]]`	metadata: Dict with feedback, dimension_scores, actionable_guidance, and any additional fields

RAISES	DESCRIPTION
`CriticOutputParseError`	If critic output is not valid JSON.
`MissingScoreFieldError`	If score field missing from output.

Examples:

Basic sync scoring:

score, metadata = scorer.score(
    input_text="What is 2+2?",
    output="4",
    expected="4",
)

Note

Operates synchronously by wrapping async_score() with asyncio.run(). Uses asyncio.run() to execute async_score(). Prefer async_score() for better performance in async contexts.

Source code in src/gepa_adk/adapters/critic_scorer.py

def score(
    self,
    input_text: str,
    output: str,
    expected: str | None = None,
) -> tuple[float, dict[str, Any]]:
    """Score an agent output synchronously using the critic agent.

    Synchronous wrapper around async_score() using asyncio.run().

    Args:
        input_text: The original input provided to the agent being evaluated.
        output: The agent's generated output to score.
        expected: Optional expected/reference output for comparison.

    Returns:
        Tuple of (score, metadata) where:
        - score: Float value, conventionally 0.0-1.0
        - metadata: Dict with feedback, dimension_scores,
            actionable_guidance, and any additional fields

    Raises:
        CriticOutputParseError: If critic output is not valid JSON.
        MissingScoreFieldError: If score field missing from output.

    Examples:
        Basic sync scoring:

        ```python
        score, metadata = scorer.score(
            input_text="What is 2+2?",
            output="4",
            expected="4",
        )
        ```

    Note:
        Operates synchronously by wrapping async_score() with asyncio.run().
        Uses asyncio.run() to execute async_score(). Prefer async_score()
        for better performance in async contexts.
    """
    return asyncio.run(self.async_score(input_text, output, expected))

normalize_feedback ¶

normalize_feedback(
    score: float, metadata: dict[str, Any] | None
) -> dict[str, Any]

Normalize critic feedback to consistent trial format.

Converts both simple and advanced critic outputs to a standardized format for use in trial records. This enables the reflection agent to receive consistent feedback regardless of which critic schema was used.

PARAMETER	DESCRIPTION
`score`	The numeric score from the critic (0.0-1.0). TYPE: `float`
`metadata`	Optional metadata dict from critic output. May contain: - feedback (str): Simple feedback text - dimension_scores (dict): Per-dimension scores - actionable_guidance (str): Improvement suggestions - Any additional fields from critic output TYPE: `dict[str, Any] \| None`

RETURNS	DESCRIPTION
`dict[str, Any]`	Normalized feedback dict with structure:
`dict[str, Any]`	```python
`dict[str, Any]`	{ "score": 0.75, "feedback_text": "Main feedback message", "dimension_scores": {...}, # Optional "actionable_guidance": "...", # Optional
`dict[str, Any]`	}
`dict[str, Any]`	```

Examples:

Normalize simple feedback:

normalized = normalize_feedback(0.8, {"feedback": "Good job"})
# {"score": 0.8, "feedback_text": "Good job"}

Normalize advanced feedback:

normalized = normalize_feedback(
    0.6,
    {
        "feedback": "Needs work",
        "dimension_scores": {"clarity": 0.5},
        "actionable_guidance": "Add examples",
    },
)
# {
#     "score": 0.6,
#     "feedback_text": "Needs work",
#     "dimension_scores": {"clarity": 0.5},
#     "actionable_guidance": "Add examples",
# }

Handle missing feedback:

normalized = normalize_feedback(0.5, None)
# {"score": 0.5, "feedback_text": ""}

Note

Supports both SimpleCriticOutput and CriticOutput schemas for flexible critic integration. Extracts the "feedback" field and renames it to "feedback_text" for consistent trial structure. Additional fields like dimension_scores are preserved when present.

Source code in src/gepa_adk/adapters/critic_scorer.py

def normalize_feedback(
    score: float,
    metadata: dict[str, Any] | None,
) -> dict[str, Any]:
    """Normalize critic feedback to consistent trial format.

    Converts both simple and advanced critic outputs to a standardized
    format for use in trial records. This enables the reflection agent
    to receive consistent feedback regardless of which critic schema
    was used.

    Args:
        score: The numeric score from the critic (0.0-1.0).
        metadata: Optional metadata dict from critic output. May contain:
            - feedback (str): Simple feedback text
            - dimension_scores (dict): Per-dimension scores
            - actionable_guidance (str): Improvement suggestions
            - Any additional fields from critic output

    Returns:
        Normalized feedback dict with structure:
        ```python
        {
            "score": 0.75,
            "feedback_text": "Main feedback message",
            "dimension_scores": {...},  # Optional
            "actionable_guidance": "...",  # Optional
        }
        ```

    Examples:
        Normalize simple feedback:

        ```python
        normalized = normalize_feedback(0.8, {"feedback": "Good job"})
        # {"score": 0.8, "feedback_text": "Good job"}
        ```

        Normalize advanced feedback:

        ```python
        normalized = normalize_feedback(
            0.6,
            {
                "feedback": "Needs work",
                "dimension_scores": {"clarity": 0.5},
                "actionable_guidance": "Add examples",
            },
        )
        # {
        #     "score": 0.6,
        #     "feedback_text": "Needs work",
        #     "dimension_scores": {"clarity": 0.5},
        #     "actionable_guidance": "Add examples",
        # }
        ```

        Handle missing feedback:

        ```python
        normalized = normalize_feedback(0.5, None)
        # {"score": 0.5, "feedback_text": ""}
        ```

    Note:
        Supports both SimpleCriticOutput and CriticOutput schemas for flexible
        critic integration. Extracts the "feedback" field and renames it to
        "feedback_text" for consistent trial structure. Additional fields
        like dimension_scores are preserved when present.
    """
    result: dict[str, Any] = {"score": score}

    if metadata is None:
        result["feedback_text"] = ""
        return result

    # Extract feedback text - handle both "feedback" and "feedback_text" keys
    feedback_text = metadata.get("feedback_text") or metadata.get("feedback") or ""
    if isinstance(feedback_text, str) and feedback_text.strip():
        result["feedback_text"] = feedback_text.strip()
    else:
        result["feedback_text"] = ""

    # Preserve dimension_scores if present
    dimension_scores = metadata.get("dimension_scores")
    if dimension_scores and isinstance(dimension_scores, dict):
        result["dimension_scores"] = dimension_scores

    # Preserve actionable_guidance if present
    actionable_guidance = metadata.get("actionable_guidance")
    if actionable_guidance and isinstance(actionable_guidance, str):
        guidance_str = actionable_guidance.strip()
        if guidance_str:
            result["actionable_guidance"] = guidance_str

    return result

Critic scorer

critic_scorer ¶

SimpleCriticOutput ¶

CriticOutput ¶

CriticScorer ¶

__init__ ¶

async_score async ¶

score ¶

normalize_feedback ¶

init ¶

async_score `async` ¶