Skip to content

Scoring

scoring

Scoring infrastructure for evolution evaluation.

Contains CriticScorer for LLM-based evaluation and the create_critic() preset factory for pre-configured critic agents.

ATTRIBUTE DESCRIPTION
CriticScorer

LLM-based scorer using critic agents.

SimpleCriticOutput

KISS schema with score + feedback.

CriticOutput

Advanced schema with dimensions and guidance.

SIMPLE_CRITIC_INSTRUCTION

Generic instruction for simple critics.

TYPE: str

ADVANCED_CRITIC_INSTRUCTION

Generic instruction for advanced critics.

TYPE: str

STRUCTURED_OUTPUT_CRITIC_INSTRUCTION

Preset instruction for structure evaluation.

TYPE: str

ACCURACY_CRITIC_INSTRUCTION

Preset instruction for factual accuracy evaluation.

TYPE: str

RELEVANCE_CRITIC_INSTRUCTION

Preset instruction for relevance evaluation.

TYPE: str

normalize_feedback

Normalizes critic output to trial format.

TYPE: dict[str, Any]

create_critic

Factory for pre-configured critic agents by preset name.

TYPE: LlmAgent

critic_presets

Maps preset name to human-readable description.

TYPE: dict[str, str]

Examples:

Create a critic scorer with an executor:

from google.adk.agents import LlmAgent
from gepa_adk.adapters.scoring import CriticScorer, CriticOutput
from gepa_adk.adapters.execution.agent_executor import AgentExecutor

critic = LlmAgent(
    name="quality_critic",
    model="gemini-2.5-flash",
    instruction="Evaluate response quality...",
    output_schema=CriticOutput,
)
executor = AgentExecutor()
scorer = CriticScorer(critic_agent=critic, executor=executor)
See Also
Note

This package isolates critic-based scoring from other adapter concerns.

CriticOutput

Bases: BaseModel


              flowchart TD
              gepa_adk.adapters.scoring.CriticOutput[CriticOutput]

              

              click gepa_adk.adapters.scoring.CriticOutput href "" "gepa_adk.adapters.scoring.CriticOutput"
            

Advanced schema for structured critic feedback with dimensions.

This schema defines the expected JSON structure that critic agents should return when configured with output_schema. The score field is required, while other fields are optional and will be preserved in metadata.

ATTRIBUTE DESCRIPTION
score

Score value between 0.0 and 1.0 (required).

TYPE: float

feedback

Human-readable feedback text (optional).

TYPE: str

dimension_scores

Per-dimension evaluation scores (optional).

TYPE: dict[str, float]

actionable_guidance

Specific improvement suggestions (optional).

TYPE: str

Examples:

Advanced critic output:

{
    "score": 0.75,
    "feedback": "Good response but could be more concise",
    "dimension_scores": {
        "accuracy": 0.9,
        "clarity": 0.6,
        "completeness": 0.8
    },
    "actionable_guidance": "Reduce response length by 30%"
}
Note

All critic agents using this schema must return structured JSON. When this schema is used as output_schema on an LlmAgent, the agent can ONLY reply and CANNOT use any tools. This is acceptable for critic agents focused on scoring.

See Also

gepa_adk.adapters.scoring.critic_scorer.SimpleCriticOutput: KISS schema with just score + feedback.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py
class CriticOutput(BaseModel):
    """Advanced schema for structured critic feedback with dimensions.

    This schema defines the expected JSON structure that critic agents
    should return when configured with output_schema. The score field is
    required, while other fields are optional and will be preserved in
    metadata.

    Attributes:
        score (float): Score value between 0.0 and 1.0 (required).
        feedback (str): Human-readable feedback text (optional).
        dimension_scores (dict[str, float]): Per-dimension evaluation scores (optional).
        actionable_guidance (str): Specific improvement suggestions (optional).

    Examples:
        Advanced critic output:

        ```json
        {
            "score": 0.75,
            "feedback": "Good response but could be more concise",
            "dimension_scores": {
                "accuracy": 0.9,
                "clarity": 0.6,
                "completeness": 0.8
            },
            "actionable_guidance": "Reduce response length by 30%"
        }
        ```

    Note:
        All critic agents using this schema must return structured JSON.
        When this schema is used as output_schema on an LlmAgent, the
        agent can ONLY reply and CANNOT use any tools. This is acceptable
        for critic agents focused on scoring.

    See Also:
        [gepa_adk.adapters.scoring.critic_scorer.SimpleCriticOutput][]:
            KISS schema with just score + feedback.
    """

    score: float = Field(
        ...,
        ge=0.0,
        le=1.0,
        description="Score from 0.0 to 1.0",
    )
    feedback: str = Field(
        default="",
        description="Human-readable feedback",
    )
    dimension_scores: dict[str, float] = Field(
        default_factory=dict,
        description="Per-dimension scores",
    )
    actionable_guidance: str = Field(
        default="",
        description="Improvement suggestions",
    )

CriticScorer

Adapter that wraps ADK critic agents to provide structured scoring.

CriticScorer implements the Scorer protocol, enabling integration with gepa-adk's evaluation and evolution workflows. It executes ADK critic agents (LlmAgent, SequentialAgent, etc.) and extracts structured scores with metadata from their outputs.

ATTRIBUTE DESCRIPTION
critic_agent

ADK agent configured for evaluation.

TYPE: BaseAgent

_session_service

Session service for state management.

TYPE: BaseSessionService

_app_name

Application name for session identification.

TYPE: str

_logger

Bound logger with scorer context.

TYPE: BoundLogger

Examples:

Basic usage:

from google.adk.agents import LlmAgent
from gepa_adk.adapters.scoring.critic_scorer import CriticScorer, CriticOutput
from gepa_adk.adapters.execution.agent_executor import AgentExecutor

critic = LlmAgent(
    name="quality_critic",
    model="gemini-2.5-flash",
    instruction="Evaluate response quality...",
    output_schema=CriticOutput,
)

executor = AgentExecutor()
scorer = CriticScorer(critic_agent=critic, executor=executor)
score, metadata = await scorer.async_score(
    input_text="What is Python?",
    output="Python is a programming language.",
)
Note

Adapter wraps ADK critic agents to provide structured scoring. Implements Scorer protocol for compatibility with evolution engine. Creates isolated sessions per scoring call unless session_id provided.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
class CriticScorer:
    """Adapter that wraps ADK critic agents to provide structured scoring.

    CriticScorer implements the Scorer protocol, enabling integration with
    gepa-adk's evaluation and evolution workflows. It executes ADK critic
    agents (LlmAgent, SequentialAgent, etc.) and extracts structured scores
    with metadata from their outputs.

    Attributes:
        critic_agent (BaseAgent): ADK agent configured for evaluation.
        _session_service (BaseSessionService): Session service for state
            management.
        _app_name (str): Application name for session identification.
        _logger (structlog.BoundLogger): Bound logger with scorer context.

    Examples:
        Basic usage:

        ```python
        from google.adk.agents import LlmAgent
        from gepa_adk.adapters.scoring.critic_scorer import CriticScorer, CriticOutput
        from gepa_adk.adapters.execution.agent_executor import AgentExecutor

        critic = LlmAgent(
            name="quality_critic",
            model="gemini-2.5-flash",
            instruction="Evaluate response quality...",
            output_schema=CriticOutput,
        )

        executor = AgentExecutor()
        scorer = CriticScorer(critic_agent=critic, executor=executor)
        score, metadata = await scorer.async_score(
            input_text="What is Python?",
            output="Python is a programming language.",
        )
        ```

    Note:
        Adapter wraps ADK critic agents to provide structured scoring.
        Implements Scorer protocol for compatibility with evolution engine.
        Creates isolated sessions per scoring call unless session_id provided.
    """

    def __init__(
        self,
        critic_agent: BaseAgent,
        executor: AgentExecutorProtocol,
        session_service: BaseSessionService | None = None,
        app_name: str = "critic_scorer",
    ) -> None:
        """Initialize CriticScorer with critic agent.

        Args:
            critic_agent: ADK agent (LlmAgent or workflow agent) configured
                for evaluation.
            executor: AgentExecutorProtocol implementation for unified agent
                execution. Handles session management and execution, enabling
                feature parity across all agent types.
            session_service: Optional session service for state management.
                If None, creates an InMemorySessionService.
            app_name: Application name for session identification.

        Raises:
            TypeError: If critic_agent is not a BaseAgent instance.
            ValueError: If app_name is empty string.

        Examples:
            Basic setup with executor:

            ```python
            from gepa_adk.adapters.execution.agent_executor import AgentExecutor

            executor = AgentExecutor()
            scorer = CriticScorer(critic_agent=critic, executor=executor)
            ```

            With shared session service:

            ```python
            from google.adk.sessions import InMemorySessionService
            from gepa_adk.adapters.execution.agent_executor import AgentExecutor

            session_service = InMemorySessionService()
            executor = AgentExecutor(session_service=session_service)
            scorer = CriticScorer(
                critic_agent=critic,
                executor=executor,
                session_service=session_service,
            )
            ```

        Note:
            Creates logger with scorer context and validates agent type.
        """
        if not isinstance(critic_agent, BaseAgent):
            raise TypeError(f"critic_agent must be BaseAgent, got {type(critic_agent)}")

        if not app_name or not app_name.strip():
            raise ValueError("app_name cannot be empty")

        self.critic_agent = critic_agent
        self._session_service = session_service or InMemorySessionService()
        self._app_name = app_name.strip()
        self._executor = executor

        # Bind logger with scorer context
        self._logger = logger.bind(
            scorer="CriticScorer",
            agent_name=self.critic_agent.name,
            app_name=self._app_name,
            uses_executor=True,  # Always true since executor is required
        )

        self._logger.info("scorer.initialized")

    def _format_critic_input(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
    ) -> str:
        """Format input for critic agent evaluation.

        Builds a prompt that presents the input query, agent output, and
        optionally the expected output for the critic to evaluate.

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.

        Returns:
            Formatted prompt string for the critic agent.

        Examples:
            Basic formatting:

            ```python
            prompt = scorer._format_critic_input(
                input_text="What is 2+2?",
                output="4",
                expected="4",
            )
            ```

        Note:
            Organizes input for critic evaluation with clearly labeled sections.
            Format is designed to give critic context for evaluation.
            Expected output is included only if provided.
        """
        parts = [
            "Input Query:",
            input_text,
            "",
            "Agent Output:",
            output,
        ]

        if expected is not None:
            parts.extend(
                [
                    "",
                    "Expected Output:",
                    expected,
                ]
            )

        parts.append("")
        parts.append(
            "Please evaluate the agent output and provide a score with feedback."
        )

        return "\n".join(parts)

    def _parse_critic_output(self, output_text: str) -> tuple[float, dict[str, Any]]:
        """Parse critic agent output and extract score with metadata.

        Parses the critic's output text as JSON and extracts the score field
        along with optional metadata (feedback, dimension_scores,
        actionable_guidance, and any additional fields).

        Args:
            output_text: Raw text output from critic agent.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value extracted from output
            - metadata: Dict containing feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If output cannot be parsed as JSON.
            MissingScoreFieldError: If parsed JSON lacks required score field.

        Examples:
            Parse structured output:

            ```python
            output = '{"score": 0.75, "feedback": "Good", "dimension_scores": {"accuracy": 0.9}}'
            score, metadata = scorer._parse_critic_output(output)
            assert score == 0.75
            assert metadata["feedback"] == "Good"
            ```

        Note:
            Obtains score and metadata from critic JSON output with validation.
            Preserves all fields from parsed JSON in metadata, not just
            the known CriticOutput schema fields. This allows for extensibility.
        """
        # Parse JSON output
        try:
            parsed = json.loads(output_text)
        except json.JSONDecodeError as e:
            raise CriticOutputParseError(
                f"Critic output is not valid JSON: {e}",
                raw_output=output_text,
                parse_error=str(e),
                cause=e,
            ) from e

        # Validate parsed output is a dict
        if not isinstance(parsed, dict):
            raise CriticOutputParseError(
                f"Critic output must be a JSON object, got {type(parsed).__name__}",
                raw_output=output_text,
                parse_error="Not a JSON object",
            )

        # Extract required score field
        if "score" not in parsed:
            raise MissingScoreFieldError(
                "Critic output missing required 'score' field",
                parsed_output=parsed,
            )

        score = parsed["score"]
        if not isinstance(score, (int, float)):
            raise MissingScoreFieldError(
                f"Score field must be numeric, got {type(score).__name__}",
                parsed_output=parsed,
            )

        # Build metadata dict with known fields and any additional fields
        metadata: dict[str, Any] = {}

        # Extract known fields if present
        if "feedback" in parsed:
            metadata["feedback"] = str(parsed["feedback"])
        if "dimension_scores" in parsed:
            # Preserve dimension_scores as-is (may contain non-numeric values)
            metadata["dimension_scores"] = parsed["dimension_scores"]
        if "actionable_guidance" in parsed:
            metadata["actionable_guidance"] = str(parsed["actionable_guidance"])

        # Preserve any additional fields
        known_fields = {"score", "feedback", "dimension_scores", "actionable_guidance"}
        for key, value in parsed.items():
            if key not in known_fields:
                metadata[key] = value

        return float(score), metadata

    def _extract_json_from_text(self, text: str) -> str:
        """Extract JSON from text that may contain markdown code blocks.

        Minimal implementation - tries direct parse and markdown extraction.
        A more robust implementation will be added per GitHub issue #78.

        Args:
            text: Text that may contain JSON.

        Returns:
            Extracted JSON string, or original text if extraction fails.

        Note:
            Operates as a minimal JSON extractor; robust implementation planned
            per GitHub issue #78.
        """
        # Try parsing the entire text as-is
        try:
            json.loads(text.strip())
            return text.strip()
        except json.JSONDecodeError:
            pass

        # Extract from markdown code blocks (```json ... ``` or ``` ... ```)
        json_block_pattern = r"```(?:json)?\s*\n?(.*?)\n?```"
        matches = re.findall(json_block_pattern, text, re.DOTALL | re.IGNORECASE)
        for match in matches:
            try:
                json.loads(match.strip())
                return match.strip()
            except json.JSONDecodeError:
                continue

        # Try to find JSON object embedded in text (minimal regex for { ... })
        # Look for opening brace and try to find matching closing brace
        brace_start = text.find("{")
        if brace_start != -1:
            # Try to find the matching closing brace
            # NOTE: This algorithm doesn't account for braces within string literals
            # (e.g., JSON with template strings like "instruction": "Use {variable}").
            # This is a minimal implementation; a more robust parser will be added
            # per GitHub issue #78.
            depth = 0
            for i in range(brace_start, len(text)):
                if text[i] == "{":
                    depth += 1
                elif text[i] == "}":
                    depth -= 1
                    if depth == 0:
                        candidate = text[brace_start : i + 1]
                        try:
                            json.loads(candidate)
                            return candidate
                        except json.JSONDecodeError:
                            break

        # Return original text (will fail with clear error message)
        return text

    async def async_score(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
        session_id: str | None = None,
    ) -> tuple[float, dict[str, Any]]:
        """Score an agent output asynchronously using the critic agent.

        Executes the critic agent with formatted input and extracts structured
        score and metadata from the response.

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.
            session_id: Optional session ID to share state with main agent
                workflow. If None, creates an isolated session.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value, conventionally 0.0-1.0
            - metadata: Dict with feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If critic output is not valid JSON.
            MissingScoreFieldError: If score field missing from output.

        Examples:
            Basic async scoring:

            ```python
            score, metadata = await scorer.async_score(
                input_text="What is Python?",
                output="Python is a programming language.",
            )
            ```

            With session sharing:

            ```python
            score, metadata = await scorer.async_score(
                input_text="...",
                output="...",
                session_id="existing_session_123",
            )
            ```

        Note:
            Orchestrates critic agent execution via AgentExecutor and extracts
            structured output. Creates isolated session unless session_id provided
            for state sharing.
        """
        self._logger.debug(
            "scorer.async_score.start",
            input_preview=input_text[:50] if input_text else "",
            output_preview=output[:50] if output else "",
            has_expected=expected is not None,
            session_id=session_id,
        )

        # Format input for critic
        critic_input = self._format_critic_input(input_text, output, expected)

        # Execute via AgentExecutor
        result = await self._executor.execute_agent(
            agent=self.critic_agent,
            input_text=critic_input,
            existing_session_id=session_id,
        )

        if result.status == ExecutionStatus.FAILED:
            raise ScoringError(
                f"Critic agent execution failed: {result.error_message}",
            )

        final_output = result.extracted_value

        if not final_output:
            raise ScoringError("Critic agent returned empty output")

        # Parse output and extract score
        try:
            score, metadata = self._parse_critic_output(final_output)
        except (CriticOutputParseError, MissingScoreFieldError) as e:
            self._logger.error(
                "scorer.async_score.parse_error",
                error=str(e),
                error_type=type(e).__name__,
            )
            raise

        # Log multi-dimensional scoring context if present
        log_context: dict[str, Any] = {
            "score": score,
            "has_feedback": "feedback" in metadata,
            "has_dimension_scores": "dimension_scores" in metadata,
            "has_actionable_guidance": "actionable_guidance" in metadata,
        }
        if "dimension_scores" in metadata:
            log_context["dimension_count"] = len(metadata["dimension_scores"])

        self._logger.info(
            "scorer.async_score.complete",
            **log_context,
        )

        return score, metadata

    def score(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
    ) -> tuple[float, dict[str, Any]]:
        """Score an agent output synchronously using the critic agent.

        Synchronous wrapper around async_score() using asyncio.run().

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value, conventionally 0.0-1.0
            - metadata: Dict with feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If critic output is not valid JSON.
            MissingScoreFieldError: If score field missing from output.

        Examples:
            Basic sync scoring:

            ```python
            score, metadata = scorer.score(
                input_text="What is 2+2?",
                output="4",
                expected="4",
            )
            ```

        Note:
            Operates synchronously by wrapping async_score() with asyncio.run().
            Uses asyncio.run() to execute async_score(). Prefer async_score()
            for better performance in async contexts.
        """
        return asyncio.run(self.async_score(input_text, output, expected))

__init__

__init__(
    critic_agent: BaseAgent,
    executor: AgentExecutorProtocol,
    session_service: BaseSessionService | None = None,
    app_name: str = "critic_scorer",
) -> None

Initialize CriticScorer with critic agent.

PARAMETER DESCRIPTION
critic_agent

ADK agent (LlmAgent or workflow agent) configured for evaluation.

TYPE: BaseAgent

executor

AgentExecutorProtocol implementation for unified agent execution. Handles session management and execution, enabling feature parity across all agent types.

TYPE: AgentExecutorProtocol

session_service

Optional session service for state management. If None, creates an InMemorySessionService.

TYPE: BaseSessionService | None DEFAULT: None

app_name

Application name for session identification.

TYPE: str DEFAULT: 'critic_scorer'

RAISES DESCRIPTION
TypeError

If critic_agent is not a BaseAgent instance.

ValueError

If app_name is empty string.

Examples:

Basic setup with executor:

from gepa_adk.adapters.execution.agent_executor import AgentExecutor

executor = AgentExecutor()
scorer = CriticScorer(critic_agent=critic, executor=executor)

With shared session service:

from google.adk.sessions import InMemorySessionService
from gepa_adk.adapters.execution.agent_executor import AgentExecutor

session_service = InMemorySessionService()
executor = AgentExecutor(session_service=session_service)
scorer = CriticScorer(
    critic_agent=critic,
    executor=executor,
    session_service=session_service,
)
Note

Creates logger with scorer context and validates agent type.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py
def __init__(
    self,
    critic_agent: BaseAgent,
    executor: AgentExecutorProtocol,
    session_service: BaseSessionService | None = None,
    app_name: str = "critic_scorer",
) -> None:
    """Initialize CriticScorer with critic agent.

    Args:
        critic_agent: ADK agent (LlmAgent or workflow agent) configured
            for evaluation.
        executor: AgentExecutorProtocol implementation for unified agent
            execution. Handles session management and execution, enabling
            feature parity across all agent types.
        session_service: Optional session service for state management.
            If None, creates an InMemorySessionService.
        app_name: Application name for session identification.

    Raises:
        TypeError: If critic_agent is not a BaseAgent instance.
        ValueError: If app_name is empty string.

    Examples:
        Basic setup with executor:

        ```python
        from gepa_adk.adapters.execution.agent_executor import AgentExecutor

        executor = AgentExecutor()
        scorer = CriticScorer(critic_agent=critic, executor=executor)
        ```

        With shared session service:

        ```python
        from google.adk.sessions import InMemorySessionService
        from gepa_adk.adapters.execution.agent_executor import AgentExecutor

        session_service = InMemorySessionService()
        executor = AgentExecutor(session_service=session_service)
        scorer = CriticScorer(
            critic_agent=critic,
            executor=executor,
            session_service=session_service,
        )
        ```

    Note:
        Creates logger with scorer context and validates agent type.
    """
    if not isinstance(critic_agent, BaseAgent):
        raise TypeError(f"critic_agent must be BaseAgent, got {type(critic_agent)}")

    if not app_name or not app_name.strip():
        raise ValueError("app_name cannot be empty")

    self.critic_agent = critic_agent
    self._session_service = session_service or InMemorySessionService()
    self._app_name = app_name.strip()
    self._executor = executor

    # Bind logger with scorer context
    self._logger = logger.bind(
        scorer="CriticScorer",
        agent_name=self.critic_agent.name,
        app_name=self._app_name,
        uses_executor=True,  # Always true since executor is required
    )

    self._logger.info("scorer.initialized")

async_score async

async_score(
    input_text: str,
    output: str,
    expected: str | None = None,
    session_id: str | None = None,
) -> tuple[float, dict[str, Any]]

Score an agent output asynchronously using the critic agent.

Executes the critic agent with formatted input and extracts structured score and metadata from the response.

PARAMETER DESCRIPTION
input_text

The original input provided to the agent being evaluated.

TYPE: str

output

The agent's generated output to score.

TYPE: str

expected

Optional expected/reference output for comparison.

TYPE: str | None DEFAULT: None

session_id

Optional session ID to share state with main agent workflow. If None, creates an isolated session.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
float

Tuple of (score, metadata) where:

dict[str, Any]
  • score: Float value, conventionally 0.0-1.0
tuple[float, dict[str, Any]]
  • metadata: Dict with feedback, dimension_scores, actionable_guidance, and any additional fields
RAISES DESCRIPTION
CriticOutputParseError

If critic output is not valid JSON.

MissingScoreFieldError

If score field missing from output.

Examples:

Basic async scoring:

score, metadata = await scorer.async_score(
    input_text="What is Python?",
    output="Python is a programming language.",
)

With session sharing:

score, metadata = await scorer.async_score(
    input_text="...",
    output="...",
    session_id="existing_session_123",
)
Note

Orchestrates critic agent execution via AgentExecutor and extracts structured output. Creates isolated session unless session_id provided for state sharing.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py
async def async_score(
    self,
    input_text: str,
    output: str,
    expected: str | None = None,
    session_id: str | None = None,
) -> tuple[float, dict[str, Any]]:
    """Score an agent output asynchronously using the critic agent.

    Executes the critic agent with formatted input and extracts structured
    score and metadata from the response.

    Args:
        input_text: The original input provided to the agent being evaluated.
        output: The agent's generated output to score.
        expected: Optional expected/reference output for comparison.
        session_id: Optional session ID to share state with main agent
            workflow. If None, creates an isolated session.

    Returns:
        Tuple of (score, metadata) where:
        - score: Float value, conventionally 0.0-1.0
        - metadata: Dict with feedback, dimension_scores,
            actionable_guidance, and any additional fields

    Raises:
        CriticOutputParseError: If critic output is not valid JSON.
        MissingScoreFieldError: If score field missing from output.

    Examples:
        Basic async scoring:

        ```python
        score, metadata = await scorer.async_score(
            input_text="What is Python?",
            output="Python is a programming language.",
        )
        ```

        With session sharing:

        ```python
        score, metadata = await scorer.async_score(
            input_text="...",
            output="...",
            session_id="existing_session_123",
        )
        ```

    Note:
        Orchestrates critic agent execution via AgentExecutor and extracts
        structured output. Creates isolated session unless session_id provided
        for state sharing.
    """
    self._logger.debug(
        "scorer.async_score.start",
        input_preview=input_text[:50] if input_text else "",
        output_preview=output[:50] if output else "",
        has_expected=expected is not None,
        session_id=session_id,
    )

    # Format input for critic
    critic_input = self._format_critic_input(input_text, output, expected)

    # Execute via AgentExecutor
    result = await self._executor.execute_agent(
        agent=self.critic_agent,
        input_text=critic_input,
        existing_session_id=session_id,
    )

    if result.status == ExecutionStatus.FAILED:
        raise ScoringError(
            f"Critic agent execution failed: {result.error_message}",
        )

    final_output = result.extracted_value

    if not final_output:
        raise ScoringError("Critic agent returned empty output")

    # Parse output and extract score
    try:
        score, metadata = self._parse_critic_output(final_output)
    except (CriticOutputParseError, MissingScoreFieldError) as e:
        self._logger.error(
            "scorer.async_score.parse_error",
            error=str(e),
            error_type=type(e).__name__,
        )
        raise

    # Log multi-dimensional scoring context if present
    log_context: dict[str, Any] = {
        "score": score,
        "has_feedback": "feedback" in metadata,
        "has_dimension_scores": "dimension_scores" in metadata,
        "has_actionable_guidance": "actionable_guidance" in metadata,
    }
    if "dimension_scores" in metadata:
        log_context["dimension_count"] = len(metadata["dimension_scores"])

    self._logger.info(
        "scorer.async_score.complete",
        **log_context,
    )

    return score, metadata

score

score(
    input_text: str,
    output: str,
    expected: str | None = None,
) -> tuple[float, dict[str, Any]]

Score an agent output synchronously using the critic agent.

Synchronous wrapper around async_score() using asyncio.run().

PARAMETER DESCRIPTION
input_text

The original input provided to the agent being evaluated.

TYPE: str

output

The agent's generated output to score.

TYPE: str

expected

Optional expected/reference output for comparison.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
float

Tuple of (score, metadata) where:

dict[str, Any]
  • score: Float value, conventionally 0.0-1.0
tuple[float, dict[str, Any]]
  • metadata: Dict with feedback, dimension_scores, actionable_guidance, and any additional fields
RAISES DESCRIPTION
CriticOutputParseError

If critic output is not valid JSON.

MissingScoreFieldError

If score field missing from output.

Examples:

Basic sync scoring:

score, metadata = scorer.score(
    input_text="What is 2+2?",
    output="4",
    expected="4",
)
Note

Operates synchronously by wrapping async_score() with asyncio.run(). Uses asyncio.run() to execute async_score(). Prefer async_score() for better performance in async contexts.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py
def score(
    self,
    input_text: str,
    output: str,
    expected: str | None = None,
) -> tuple[float, dict[str, Any]]:
    """Score an agent output synchronously using the critic agent.

    Synchronous wrapper around async_score() using asyncio.run().

    Args:
        input_text: The original input provided to the agent being evaluated.
        output: The agent's generated output to score.
        expected: Optional expected/reference output for comparison.

    Returns:
        Tuple of (score, metadata) where:
        - score: Float value, conventionally 0.0-1.0
        - metadata: Dict with feedback, dimension_scores,
            actionable_guidance, and any additional fields

    Raises:
        CriticOutputParseError: If critic output is not valid JSON.
        MissingScoreFieldError: If score field missing from output.

    Examples:
        Basic sync scoring:

        ```python
        score, metadata = scorer.score(
            input_text="What is 2+2?",
            output="4",
            expected="4",
        )
        ```

    Note:
        Operates synchronously by wrapping async_score() with asyncio.run().
        Uses asyncio.run() to execute async_score(). Prefer async_score()
        for better performance in async contexts.
    """
    return asyncio.run(self.async_score(input_text, output, expected))

SimpleCriticOutput

Bases: BaseModel


              flowchart TD
              gepa_adk.adapters.scoring.SimpleCriticOutput[SimpleCriticOutput]

              

              click gepa_adk.adapters.scoring.SimpleCriticOutput href "" "gepa_adk.adapters.scoring.SimpleCriticOutput"
            

KISS schema for basic critic feedback.

This is the minimal schema for critic agents that only need to provide a score and text feedback. Use this for straightforward evaluation tasks where dimension breakdowns are not needed.

ATTRIBUTE DESCRIPTION
score

Score value between 0.0 and 1.0 (required).

TYPE: float

feedback

Human-readable feedback text (required).

TYPE: str

Examples:

Simple critic output:

{
    "score": 0.75,
    "feedback": "Good response but could be more concise."
}

Using with LlmAgent:

from google.adk.agents import LlmAgent
from gepa_adk.adapters.scoring.critic_scorer import SimpleCriticOutput

critic = LlmAgent(
    name="simple_critic",
    model="gemini-2.5-flash",
    instruction=SIMPLE_CRITIC_INSTRUCTION,
    output_schema=SimpleCriticOutput,
)
Note

Applies to basic evaluation tasks where only a score and feedback are needed. For more detailed evaluations with dimension scores, use CriticOutput instead.

See Also

gepa_adk.adapters.scoring.critic_scorer.CriticOutput: Advanced schema with dimension scores and guidance.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py
class SimpleCriticOutput(BaseModel):
    """KISS schema for basic critic feedback.

    This is the minimal schema for critic agents that only need to provide
    a score and text feedback. Use this for straightforward evaluation tasks
    where dimension breakdowns are not needed.

    Attributes:
        score (float): Score value between 0.0 and 1.0 (required).
        feedback (str): Human-readable feedback text (required).

    Examples:
        Simple critic output:

        ```json
        {
            "score": 0.75,
            "feedback": "Good response but could be more concise."
        }
        ```

        Using with LlmAgent:

        ```python
        from google.adk.agents import LlmAgent
        from gepa_adk.adapters.scoring.critic_scorer import SimpleCriticOutput

        critic = LlmAgent(
            name="simple_critic",
            model="gemini-2.5-flash",
            instruction=SIMPLE_CRITIC_INSTRUCTION,
            output_schema=SimpleCriticOutput,
        )
        ```

    Note:
        Applies to basic evaluation tasks where only a score and feedback
        are needed. For more detailed evaluations with dimension scores,
        use CriticOutput instead.

    See Also:
        [gepa_adk.adapters.scoring.critic_scorer.CriticOutput][]:
            Advanced schema with dimension scores and guidance.
    """

    score: float = Field(
        ...,
        ge=0.0,
        le=1.0,
        description="Score from 0.0 to 1.0",
    )
    feedback: str = Field(
        ...,
        description="Human-readable feedback explaining the score",
    )

create_critic

create_critic(
    name: str, *, model: str | None = None
) -> LlmAgent

Create a pre-configured critic agent by preset name.

PARAMETER DESCRIPTION
name

Preset name. Must be a key in _PRESET_INSTRUCTIONS.

TYPE: str

model

Optional model override. When None, ADK uses its default.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
LlmAgent

Configured LlmAgent with CriticOutput schema and preset instruction.

RAISES DESCRIPTION
ConfigurationError

If name is not a valid preset.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py
def create_critic(name: str, *, model: str | None = None) -> LlmAgent:
    """Create a pre-configured critic agent by preset name.

    Args:
        name: Preset name. Must be a key in ``_PRESET_INSTRUCTIONS``.
        model: Optional model override. When None, ADK uses its default.

    Returns:
        Configured LlmAgent with CriticOutput schema and preset instruction.

    Raises:
        ConfigurationError: If name is not a valid preset.
    """
    if name not in _PRESET_INSTRUCTIONS:
        valid_presets = ", ".join(sorted(_PRESET_INSTRUCTIONS))
        raise ConfigurationError(
            f"Unknown critic preset '{name}'. Valid presets: {valid_presets}",
            constraint=f"Must be one of: {valid_presets}",
            value=name,
            field="name",
        )

    model_kwargs: dict[str, Any] = {}
    if model is not None:
        model_kwargs["model"] = model

    return LlmAgent(
        name=f"{name}_critic",
        instruction=_PRESET_INSTRUCTIONS[name],
        output_schema=CriticOutput,
        **model_kwargs,
    )

normalize_feedback

normalize_feedback(
    score: float, metadata: dict[str, Any] | None
) -> dict[str, Any]

Normalize critic feedback to consistent trial format.

Converts both simple and advanced critic outputs to a standardized format for use in trial records. This enables the reflection agent to receive consistent feedback regardless of which critic schema was used.

PARAMETER DESCRIPTION
score

The numeric score from the critic (0.0-1.0).

TYPE: float

metadata

Optional metadata dict from critic output. May contain: - feedback (str): Simple feedback text - dimension_scores (dict): Per-dimension scores - actionable_guidance (str): Improvement suggestions - Any additional fields from critic output

TYPE: dict[str, Any] | None

RETURNS DESCRIPTION
dict[str, Any]

Normalized feedback dict with structure:

dict[str, Any]

```python

dict[str, Any]

{ "score": 0.75, "feedback_text": "Main feedback message", "dimension_scores": {...}, # Optional "actionable_guidance": "...", # Optional

dict[str, Any]

}

dict[str, Any]

```

Examples:

Normalize simple feedback:

normalized = normalize_feedback(0.8, {"feedback": "Good job"})
# {"score": 0.8, "feedback_text": "Good job"}

Normalize advanced feedback:

normalized = normalize_feedback(
    0.6,
    {
        "feedback": "Needs work",
        "dimension_scores": {"clarity": 0.5},
        "actionable_guidance": "Add examples",
    },
)
# {
#     "score": 0.6,
#     "feedback_text": "Needs work",
#     "dimension_scores": {"clarity": 0.5},
#     "actionable_guidance": "Add examples",
# }

Handle missing feedback:

normalized = normalize_feedback(0.5, None)
# {"score": 0.5, "feedback_text": ""}
Note

Supports both SimpleCriticOutput and CriticOutput schemas for flexible critic integration. Extracts the "feedback" field and renames it to "feedback_text" for consistent trial structure. Additional fields like dimension_scores are preserved when present.

Source code in src/gepa_adk/adapters/scoring/critic_scorer.py
def normalize_feedback(
    score: float,
    metadata: dict[str, Any] | None,
) -> dict[str, Any]:
    """Normalize critic feedback to consistent trial format.

    Converts both simple and advanced critic outputs to a standardized
    format for use in trial records. This enables the reflection agent
    to receive consistent feedback regardless of which critic schema
    was used.

    Args:
        score: The numeric score from the critic (0.0-1.0).
        metadata: Optional metadata dict from critic output. May contain:
            - feedback (str): Simple feedback text
            - dimension_scores (dict): Per-dimension scores
            - actionable_guidance (str): Improvement suggestions
            - Any additional fields from critic output

    Returns:
        Normalized feedback dict with structure:
        ```python
        {
            "score": 0.75,
            "feedback_text": "Main feedback message",
            "dimension_scores": {...},  # Optional
            "actionable_guidance": "...",  # Optional
        }
        ```

    Examples:
        Normalize simple feedback:

        ```python
        normalized = normalize_feedback(0.8, {"feedback": "Good job"})
        # {"score": 0.8, "feedback_text": "Good job"}
        ```

        Normalize advanced feedback:

        ```python
        normalized = normalize_feedback(
            0.6,
            {
                "feedback": "Needs work",
                "dimension_scores": {"clarity": 0.5},
                "actionable_guidance": "Add examples",
            },
        )
        # {
        #     "score": 0.6,
        #     "feedback_text": "Needs work",
        #     "dimension_scores": {"clarity": 0.5},
        #     "actionable_guidance": "Add examples",
        # }
        ```

        Handle missing feedback:

        ```python
        normalized = normalize_feedback(0.5, None)
        # {"score": 0.5, "feedback_text": ""}
        ```

    Note:
        Supports both SimpleCriticOutput and CriticOutput schemas for flexible
        critic integration. Extracts the "feedback" field and renames it to
        "feedback_text" for consistent trial structure. Additional fields
        like dimension_scores are preserved when present.
    """
    result: dict[str, Any] = {"score": score}

    if metadata is None:
        result["feedback_text"] = ""
        return result

    # Extract feedback text - handle both "feedback" and "feedback_text" keys
    feedback_text = metadata.get("feedback_text") or metadata.get("feedback") or ""
    if isinstance(feedback_text, str) and feedback_text.strip():
        result["feedback_text"] = feedback_text.strip()
    else:
        result["feedback_text"] = ""

    # Preserve dimension_scores if present
    dimension_scores = metadata.get("dimension_scores")
    if dimension_scores and isinstance(dimension_scores, dict):
        result["dimension_scores"] = dimension_scores

    # Preserve actionable_guidance if present
    actionable_guidance = metadata.get("actionable_guidance")
    if actionable_guidance and isinstance(actionable_guidance, str):
        guidance_str = actionable_guidance.strip()
        if guidance_str:
            result["actionable_guidance"] = guidance_str

    return result