Skip to content

Critic scorer

critic_scorer

CriticScorer adapter for structured scoring with ADK critic agents.

This module provides the CriticScorer implementation that wraps ADK critic agents to provide structured scoring with feedback, dimension scores, and actionable guidance. The scorer implements the Scorer protocol, enabling integration with gepa-adk's evaluation and evolution workflows.

Also provides KISS and advanced critic output schemas with generic instruction templates for rapid critic agent development.

ATTRIBUTE DESCRIPTION
CriticScorer

Adapter that wraps ADK critic agents for scoring.

TYPE: class

SimpleCriticOutput

KISS schema with just score + feedback.

TYPE: class

CriticOutput

Advanced schema with dimensions and guidance.

TYPE: class

SIMPLE_CRITIC_INSTRUCTION

Generic instruction for simple critics.

TYPE: str

ADVANCED_CRITIC_INSTRUCTION

Generic instruction for advanced critics.

TYPE: str

normalize_feedback

Normalizes critic output to trial format.

TYPE: function

Examples:

Basic usage with LlmAgent critic:

from pydantic import BaseModel, Field
from google.adk.agents import LlmAgent
from gepa_adk.adapters.critic_scorer import CriticScorer, CriticOutput

critic = LlmAgent(
    name="quality_critic",
    model="gemini-2.5-flash",
    instruction="Evaluate response quality...",
    output_schema=CriticOutput,
)

scorer = CriticScorer(critic_agent=critic)
score, metadata = await scorer.async_score(
    input_text="What is Python?",
    output="Python is a programming language.",
)
Note

This module wraps ADK critic agents to provide structured scoring. When using LlmAgent with output_schema, the agent can ONLY reply and CANNOT use any tools (ADK constraint). For evaluations requiring tool usage, use a SequentialAgent with tool-enabled agents before the output-constrained scorer.

SimpleCriticOutput

Bases: BaseModel


              flowchart TD
              gepa_adk.adapters.critic_scorer.SimpleCriticOutput[SimpleCriticOutput]

              

              click gepa_adk.adapters.critic_scorer.SimpleCriticOutput href "" "gepa_adk.adapters.critic_scorer.SimpleCriticOutput"
            

KISS schema for basic critic feedback.

This is the minimal schema for critic agents that only need to provide a score and text feedback. Use this for straightforward evaluation tasks where dimension breakdowns are not needed.

ATTRIBUTE DESCRIPTION
score

Score value between 0.0 and 1.0 (required).

TYPE: float

feedback

Human-readable feedback text (required).

TYPE: str

Examples:

Simple critic output:

{
    "score": 0.75,
    "feedback": "Good response but could be more concise."
}

Using with LlmAgent:

from google.adk.agents import LlmAgent
from gepa_adk.adapters.critic_scorer import SimpleCriticOutput

critic = LlmAgent(
    name="simple_critic",
    model="gemini-2.5-flash",
    instruction=SIMPLE_CRITIC_INSTRUCTION,
    output_schema=SimpleCriticOutput,
)
Note

Applies to basic evaluation tasks where only a score and feedback are needed. For more detailed evaluations with dimension scores, use CriticOutput instead.

See Also

CriticOutput: Advanced schema with dimension scores and guidance.

Source code in src/gepa_adk/adapters/critic_scorer.py
class SimpleCriticOutput(BaseModel):
    """KISS schema for basic critic feedback.

    This is the minimal schema for critic agents that only need to provide
    a score and text feedback. Use this for straightforward evaluation tasks
    where dimension breakdowns are not needed.

    Attributes:
        score: Score value between 0.0 and 1.0 (required).
        feedback: Human-readable feedback text (required).

    Examples:
        Simple critic output:

        ```json
        {
            "score": 0.75,
            "feedback": "Good response but could be more concise."
        }
        ```

        Using with LlmAgent:

        ```python
        from google.adk.agents import LlmAgent
        from gepa_adk.adapters.critic_scorer import SimpleCriticOutput

        critic = LlmAgent(
            name="simple_critic",
            model="gemini-2.5-flash",
            instruction=SIMPLE_CRITIC_INSTRUCTION,
            output_schema=SimpleCriticOutput,
        )
        ```

    Note:
        Applies to basic evaluation tasks where only a score and feedback
        are needed. For more detailed evaluations with dimension scores,
        use CriticOutput instead.

    See Also:
        CriticOutput: Advanced schema with dimension scores and guidance.
    """

    score: float = Field(
        ...,
        ge=0.0,
        le=1.0,
        description="Score from 0.0 to 1.0",
    )
    feedback: str = Field(
        ...,
        description="Human-readable feedback explaining the score",
    )

CriticOutput

Bases: BaseModel


              flowchart TD
              gepa_adk.adapters.critic_scorer.CriticOutput[CriticOutput]

              

              click gepa_adk.adapters.critic_scorer.CriticOutput href "" "gepa_adk.adapters.critic_scorer.CriticOutput"
            

Advanced schema for structured critic feedback with dimensions.

This schema defines the expected JSON structure that critic agents should return when configured with output_schema. The score field is required, while other fields are optional and will be preserved in metadata.

ATTRIBUTE DESCRIPTION
score

Score value between 0.0 and 1.0 (required).

TYPE: float

feedback

Human-readable feedback text (optional).

TYPE: str

dimension_scores

Per-dimension evaluation scores (optional).

TYPE: dict[str, float]

actionable_guidance

Specific improvement suggestions (optional).

TYPE: str

Examples:

Advanced critic output:

{
    "score": 0.75,
    "feedback": "Good response but could be more concise",
    "dimension_scores": {
        "accuracy": 0.9,
        "clarity": 0.6,
        "completeness": 0.8
    },
    "actionable_guidance": "Reduce response length by 30%"
}
Note

All critic agents using this schema must return structured JSON. When this schema is used as output_schema on an LlmAgent, the agent can ONLY reply and CANNOT use any tools. This is acceptable for critic agents focused on scoring.

See Also

SimpleCriticOutput: KISS schema with just score + feedback.

Source code in src/gepa_adk/adapters/critic_scorer.py
class CriticOutput(BaseModel):
    """Advanced schema for structured critic feedback with dimensions.

    This schema defines the expected JSON structure that critic agents
    should return when configured with output_schema. The score field is
    required, while other fields are optional and will be preserved in
    metadata.

    Attributes:
        score: Score value between 0.0 and 1.0 (required).
        feedback: Human-readable feedback text (optional).
        dimension_scores: Per-dimension evaluation scores (optional).
        actionable_guidance: Specific improvement suggestions (optional).

    Examples:
        Advanced critic output:

        ```json
        {
            "score": 0.75,
            "feedback": "Good response but could be more concise",
            "dimension_scores": {
                "accuracy": 0.9,
                "clarity": 0.6,
                "completeness": 0.8
            },
            "actionable_guidance": "Reduce response length by 30%"
        }
        ```

    Note:
        All critic agents using this schema must return structured JSON.
        When this schema is used as output_schema on an LlmAgent, the
        agent can ONLY reply and CANNOT use any tools. This is acceptable
        for critic agents focused on scoring.

    See Also:
        SimpleCriticOutput: KISS schema with just score + feedback.
    """

    score: float = Field(
        ...,
        ge=0.0,
        le=1.0,
        description="Score from 0.0 to 1.0",
    )
    feedback: str = Field(
        default="",
        description="Human-readable feedback",
    )
    dimension_scores: dict[str, float] = Field(
        default_factory=dict,
        description="Per-dimension scores",
    )
    actionable_guidance: str = Field(
        default="",
        description="Improvement suggestions",
    )

CriticScorer

Adapter that wraps ADK critic agents to provide structured scoring.

CriticScorer implements the Scorer protocol, enabling integration with gepa-adk's evaluation and evolution workflows. It executes ADK critic agents (LlmAgent, SequentialAgent, etc.) and extracts structured scores with metadata from their outputs.

ATTRIBUTE DESCRIPTION
critic_agent

ADK agent configured for evaluation.

TYPE: BaseAgent

_session_service

Session service for state management.

TYPE: BaseSessionService

_app_name

Application name for session identification.

TYPE: str

_logger

Bound logger with scorer context.

TYPE: BoundLogger

Examples:

Basic usage:

from google.adk.agents import LlmAgent
from gepa_adk.adapters.critic_scorer import CriticScorer, CriticOutput
from gepa_adk.adapters.agent_executor import AgentExecutor

critic = LlmAgent(
    name="quality_critic",
    model="gemini-2.5-flash",
    instruction="Evaluate response quality...",
    output_schema=CriticOutput,
)

executor = AgentExecutor()
scorer = CriticScorer(critic_agent=critic, executor=executor)
score, metadata = await scorer.async_score(
    input_text="What is Python?",
    output="Python is a programming language.",
)
Note

Adapter wraps ADK critic agents to provide structured scoring. Implements Scorer protocol for compatibility with evolution engine. Creates isolated sessions per scoring call unless session_id provided.

Source code in src/gepa_adk/adapters/critic_scorer.py
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
class CriticScorer:
    """Adapter that wraps ADK critic agents to provide structured scoring.

    CriticScorer implements the Scorer protocol, enabling integration with
    gepa-adk's evaluation and evolution workflows. It executes ADK critic
    agents (LlmAgent, SequentialAgent, etc.) and extracts structured scores
    with metadata from their outputs.

    Attributes:
        critic_agent (BaseAgent): ADK agent configured for evaluation.
        _session_service (BaseSessionService): Session service for state
            management.
        _app_name (str): Application name for session identification.
        _logger (structlog.BoundLogger): Bound logger with scorer context.

    Examples:
        Basic usage:

        ```python
        from google.adk.agents import LlmAgent
        from gepa_adk.adapters.critic_scorer import CriticScorer, CriticOutput
        from gepa_adk.adapters.agent_executor import AgentExecutor

        critic = LlmAgent(
            name="quality_critic",
            model="gemini-2.5-flash",
            instruction="Evaluate response quality...",
            output_schema=CriticOutput,
        )

        executor = AgentExecutor()
        scorer = CriticScorer(critic_agent=critic, executor=executor)
        score, metadata = await scorer.async_score(
            input_text="What is Python?",
            output="Python is a programming language.",
        )
        ```

    Note:
        Adapter wraps ADK critic agents to provide structured scoring.
        Implements Scorer protocol for compatibility with evolution engine.
        Creates isolated sessions per scoring call unless session_id provided.
    """

    def __init__(
        self,
        critic_agent: BaseAgent,
        executor: AgentExecutorProtocol,
        session_service: BaseSessionService | None = None,
        app_name: str = "critic_scorer",
    ) -> None:
        """Initialize CriticScorer with critic agent.

        Args:
            critic_agent: ADK agent (LlmAgent or workflow agent) configured
                for evaluation.
            executor: AgentExecutorProtocol implementation for unified agent
                execution. Handles session management and execution, enabling
                feature parity across all agent types.
            session_service: Optional session service for state management.
                If None, creates an InMemorySessionService.
            app_name: Application name for session identification.

        Raises:
            TypeError: If critic_agent is not a BaseAgent instance.
            ValueError: If app_name is empty string.

        Examples:
            Basic setup with executor:

            ```python
            from gepa_adk.adapters.agent_executor import AgentExecutor

            executor = AgentExecutor()
            scorer = CriticScorer(critic_agent=critic, executor=executor)
            ```

            With shared session service:

            ```python
            from google.adk.sessions import InMemorySessionService
            from gepa_adk.adapters.agent_executor import AgentExecutor

            session_service = InMemorySessionService()
            executor = AgentExecutor(session_service=session_service)
            scorer = CriticScorer(
                critic_agent=critic,
                executor=executor,
                session_service=session_service,
            )
            ```

        Note:
            Creates logger with scorer context and validates agent type.
        """
        if not isinstance(critic_agent, BaseAgent):
            raise TypeError(f"critic_agent must be BaseAgent, got {type(critic_agent)}")

        if not app_name or not app_name.strip():
            raise ValueError("app_name cannot be empty")

        self.critic_agent = critic_agent
        self._session_service = session_service or InMemorySessionService()
        self._app_name = app_name.strip()
        self._executor = executor

        # Bind logger with scorer context
        self._logger = logger.bind(
            scorer="CriticScorer",
            agent_name=self.critic_agent.name,
            app_name=self._app_name,
            uses_executor=True,  # Always true since executor is required
        )

        self._logger.info("scorer.initialized")

    def _format_critic_input(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
    ) -> str:
        """Format input for critic agent evaluation.

        Builds a prompt that presents the input query, agent output, and
        optionally the expected output for the critic to evaluate.

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.

        Returns:
            Formatted prompt string for the critic agent.

        Examples:
            Basic formatting:

            ```python
            prompt = scorer._format_critic_input(
                input_text="What is 2+2?",
                output="4",
                expected="4",
            )
            ```

        Note:
            Organizes input for critic evaluation with clearly labeled sections.
            Format is designed to give critic context for evaluation.
            Expected output is included only if provided.
        """
        parts = [
            "Input Query:",
            input_text,
            "",
            "Agent Output:",
            output,
        ]

        if expected is not None:
            parts.extend(
                [
                    "",
                    "Expected Output:",
                    expected,
                ]
            )

        parts.append("")
        parts.append(
            "Please evaluate the agent output and provide a score with feedback."
        )

        return "\n".join(parts)

    def _parse_critic_output(self, output_text: str) -> tuple[float, dict[str, Any]]:
        """Parse critic agent output and extract score with metadata.

        Parses the critic's output text as JSON and extracts the score field
        along with optional metadata (feedback, dimension_scores,
        actionable_guidance, and any additional fields).

        Args:
            output_text: Raw text output from critic agent.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value extracted from output
            - metadata: Dict containing feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If output cannot be parsed as JSON.
            MissingScoreFieldError: If parsed JSON lacks required score field.

        Examples:
            Parse structured output:

            ```python
            output = '{"score": 0.75, "feedback": "Good", "dimension_scores": {"accuracy": 0.9}}'
            score, metadata = scorer._parse_critic_output(output)
            assert score == 0.75
            assert metadata["feedback"] == "Good"
            ```

        Note:
            Obtains score and metadata from critic JSON output with validation.
            Preserves all fields from parsed JSON in metadata, not just
            the known CriticOutput schema fields. This allows for extensibility.
        """
        # Parse JSON output
        try:
            parsed = json.loads(output_text)
        except json.JSONDecodeError as e:
            raise CriticOutputParseError(
                f"Critic output is not valid JSON: {e}",
                raw_output=output_text,
                parse_error=str(e),
                cause=e,
            ) from e

        # Validate parsed output is a dict
        if not isinstance(parsed, dict):
            raise CriticOutputParseError(
                f"Critic output must be a JSON object, got {type(parsed).__name__}",
                raw_output=output_text,
                parse_error="Not a JSON object",
            )

        # Extract required score field
        if "score" not in parsed:
            raise MissingScoreFieldError(
                "Critic output missing required 'score' field",
                parsed_output=parsed,
            )

        score = parsed["score"]
        if not isinstance(score, (int, float)):
            raise MissingScoreFieldError(
                f"Score field must be numeric, got {type(score).__name__}",
                parsed_output=parsed,
            )

        # Build metadata dict with known fields and any additional fields
        metadata: dict[str, Any] = {}

        # Extract known fields if present
        if "feedback" in parsed:
            metadata["feedback"] = str(parsed["feedback"])
        if "dimension_scores" in parsed:
            # Preserve dimension_scores as-is (may contain non-numeric values)
            metadata["dimension_scores"] = parsed["dimension_scores"]
        if "actionable_guidance" in parsed:
            metadata["actionable_guidance"] = str(parsed["actionable_guidance"])

        # Preserve any additional fields
        known_fields = {"score", "feedback", "dimension_scores", "actionable_guidance"}
        for key, value in parsed.items():
            if key not in known_fields:
                metadata[key] = value

        return float(score), metadata

    def _extract_json_from_text(self, text: str) -> str:
        """Extract JSON from text that may contain markdown code blocks.

        Minimal implementation - tries direct parse and markdown extraction.
        A more robust implementation will be added per GitHub issue #78.

        Args:
            text: Text that may contain JSON.

        Returns:
            Extracted JSON string, or original text if extraction fails.

        Note:
            Operates as a minimal JSON extractor; robust implementation planned
            per GitHub issue #78.
        """
        # Try parsing the entire text as-is
        try:
            json.loads(text.strip())
            return text.strip()
        except json.JSONDecodeError:
            pass

        # Extract from markdown code blocks (```json ... ``` or ``` ... ```)
        json_block_pattern = r"```(?:json)?\s*\n?(.*?)\n?```"
        matches = re.findall(json_block_pattern, text, re.DOTALL | re.IGNORECASE)
        for match in matches:
            try:
                json.loads(match.strip())
                return match.strip()
            except json.JSONDecodeError:
                continue

        # Try to find JSON object embedded in text (minimal regex for { ... })
        # Look for opening brace and try to find matching closing brace
        brace_start = text.find("{")
        if brace_start != -1:
            # Try to find the matching closing brace
            # NOTE: This algorithm doesn't account for braces within string literals
            # (e.g., JSON with template strings like "instruction": "Use {variable}").
            # This is a minimal implementation; a more robust parser will be added
            # per GitHub issue #78.
            depth = 0
            for i in range(brace_start, len(text)):
                if text[i] == "{":
                    depth += 1
                elif text[i] == "}":
                    depth -= 1
                    if depth == 0:
                        candidate = text[brace_start : i + 1]
                        try:
                            json.loads(candidate)
                            return candidate
                        except json.JSONDecodeError:
                            break

        # Return original text (will fail with clear error message)
        return text

    async def async_score(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
        session_id: str | None = None,
    ) -> tuple[float, dict[str, Any]]:
        """Score an agent output asynchronously using the critic agent.

        Executes the critic agent with formatted input and extracts structured
        score and metadata from the response.

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.
            session_id: Optional session ID to share state with main agent
                workflow. If None, creates an isolated session.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value, conventionally 0.0-1.0
            - metadata: Dict with feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If critic output is not valid JSON.
            MissingScoreFieldError: If score field missing from output.

        Examples:
            Basic async scoring:

            ```python
            score, metadata = await scorer.async_score(
                input_text="What is Python?",
                output="Python is a programming language.",
            )
            ```

            With session sharing:

            ```python
            score, metadata = await scorer.async_score(
                input_text="...",
                output="...",
                session_id="existing_session_123",
            )
            ```

        Note:
            Orchestrates critic agent execution via AgentExecutor and extracts
            structured output. Creates isolated session unless session_id provided
            for state sharing.
        """
        self._logger.debug(
            "scorer.async_score.start",
            input_preview=input_text[:50] if input_text else "",
            output_preview=output[:50] if output else "",
            has_expected=expected is not None,
            session_id=session_id,
        )

        # Format input for critic
        critic_input = self._format_critic_input(input_text, output, expected)

        # Execute via AgentExecutor
        result = await self._executor.execute_agent(
            agent=self.critic_agent,
            input_text=critic_input,
            existing_session_id=session_id,
        )

        if result.status == ExecutionStatus.FAILED:
            raise ScoringError(
                f"Critic agent execution failed: {result.error_message}",
            )

        final_output = result.extracted_value

        if not final_output:
            raise ScoringError("Critic agent returned empty output")

        # Parse output and extract score
        try:
            score, metadata = self._parse_critic_output(final_output)
        except (CriticOutputParseError, MissingScoreFieldError) as e:
            self._logger.error(
                "scorer.async_score.parse_error",
                error=str(e),
                error_type=type(e).__name__,
            )
            raise

        # Log multi-dimensional scoring context if present
        log_context: dict[str, Any] = {
            "score": score,
            "has_feedback": "feedback" in metadata,
            "has_dimension_scores": "dimension_scores" in metadata,
            "has_actionable_guidance": "actionable_guidance" in metadata,
        }
        if "dimension_scores" in metadata:
            log_context["dimension_count"] = len(metadata["dimension_scores"])

        self._logger.info(
            "scorer.async_score.complete",
            **log_context,
        )

        return score, metadata

    def score(
        self,
        input_text: str,
        output: str,
        expected: str | None = None,
    ) -> tuple[float, dict[str, Any]]:
        """Score an agent output synchronously using the critic agent.

        Synchronous wrapper around async_score() using asyncio.run().

        Args:
            input_text: The original input provided to the agent being evaluated.
            output: The agent's generated output to score.
            expected: Optional expected/reference output for comparison.

        Returns:
            Tuple of (score, metadata) where:
            - score: Float value, conventionally 0.0-1.0
            - metadata: Dict with feedback, dimension_scores,
                actionable_guidance, and any additional fields

        Raises:
            CriticOutputParseError: If critic output is not valid JSON.
            MissingScoreFieldError: If score field missing from output.

        Examples:
            Basic sync scoring:

            ```python
            score, metadata = scorer.score(
                input_text="What is 2+2?",
                output="4",
                expected="4",
            )
            ```

        Note:
            Operates synchronously by wrapping async_score() with asyncio.run().
            Uses asyncio.run() to execute async_score(). Prefer async_score()
            for better performance in async contexts.
        """
        return asyncio.run(self.async_score(input_text, output, expected))

__init__

__init__(
    critic_agent: BaseAgent,
    executor: AgentExecutorProtocol,
    session_service: BaseSessionService | None = None,
    app_name: str = "critic_scorer",
) -> None

Initialize CriticScorer with critic agent.

PARAMETER DESCRIPTION
critic_agent

ADK agent (LlmAgent or workflow agent) configured for evaluation.

TYPE: BaseAgent

executor

AgentExecutorProtocol implementation for unified agent execution. Handles session management and execution, enabling feature parity across all agent types.

TYPE: AgentExecutorProtocol

session_service

Optional session service for state management. If None, creates an InMemorySessionService.

TYPE: BaseSessionService | None DEFAULT: None

app_name

Application name for session identification.

TYPE: str DEFAULT: 'critic_scorer'

RAISES DESCRIPTION
TypeError

If critic_agent is not a BaseAgent instance.

ValueError

If app_name is empty string.

Examples:

Basic setup with executor:

from gepa_adk.adapters.agent_executor import AgentExecutor

executor = AgentExecutor()
scorer = CriticScorer(critic_agent=critic, executor=executor)

With shared session service:

from google.adk.sessions import InMemorySessionService
from gepa_adk.adapters.agent_executor import AgentExecutor

session_service = InMemorySessionService()
executor = AgentExecutor(session_service=session_service)
scorer = CriticScorer(
    critic_agent=critic,
    executor=executor,
    session_service=session_service,
)
Note

Creates logger with scorer context and validates agent type.

Source code in src/gepa_adk/adapters/critic_scorer.py
def __init__(
    self,
    critic_agent: BaseAgent,
    executor: AgentExecutorProtocol,
    session_service: BaseSessionService | None = None,
    app_name: str = "critic_scorer",
) -> None:
    """Initialize CriticScorer with critic agent.

    Args:
        critic_agent: ADK agent (LlmAgent or workflow agent) configured
            for evaluation.
        executor: AgentExecutorProtocol implementation for unified agent
            execution. Handles session management and execution, enabling
            feature parity across all agent types.
        session_service: Optional session service for state management.
            If None, creates an InMemorySessionService.
        app_name: Application name for session identification.

    Raises:
        TypeError: If critic_agent is not a BaseAgent instance.
        ValueError: If app_name is empty string.

    Examples:
        Basic setup with executor:

        ```python
        from gepa_adk.adapters.agent_executor import AgentExecutor

        executor = AgentExecutor()
        scorer = CriticScorer(critic_agent=critic, executor=executor)
        ```

        With shared session service:

        ```python
        from google.adk.sessions import InMemorySessionService
        from gepa_adk.adapters.agent_executor import AgentExecutor

        session_service = InMemorySessionService()
        executor = AgentExecutor(session_service=session_service)
        scorer = CriticScorer(
            critic_agent=critic,
            executor=executor,
            session_service=session_service,
        )
        ```

    Note:
        Creates logger with scorer context and validates agent type.
    """
    if not isinstance(critic_agent, BaseAgent):
        raise TypeError(f"critic_agent must be BaseAgent, got {type(critic_agent)}")

    if not app_name or not app_name.strip():
        raise ValueError("app_name cannot be empty")

    self.critic_agent = critic_agent
    self._session_service = session_service or InMemorySessionService()
    self._app_name = app_name.strip()
    self._executor = executor

    # Bind logger with scorer context
    self._logger = logger.bind(
        scorer="CriticScorer",
        agent_name=self.critic_agent.name,
        app_name=self._app_name,
        uses_executor=True,  # Always true since executor is required
    )

    self._logger.info("scorer.initialized")

async_score async

async_score(
    input_text: str,
    output: str,
    expected: str | None = None,
    session_id: str | None = None,
) -> tuple[float, dict[str, Any]]

Score an agent output asynchronously using the critic agent.

Executes the critic agent with formatted input and extracts structured score and metadata from the response.

PARAMETER DESCRIPTION
input_text

The original input provided to the agent being evaluated.

TYPE: str

output

The agent's generated output to score.

TYPE: str

expected

Optional expected/reference output for comparison.

TYPE: str | None DEFAULT: None

session_id

Optional session ID to share state with main agent workflow. If None, creates an isolated session.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
float

Tuple of (score, metadata) where:

dict[str, Any]
  • score: Float value, conventionally 0.0-1.0
tuple[float, dict[str, Any]]
  • metadata: Dict with feedback, dimension_scores, actionable_guidance, and any additional fields
RAISES DESCRIPTION
CriticOutputParseError

If critic output is not valid JSON.

MissingScoreFieldError

If score field missing from output.

Examples:

Basic async scoring:

score, metadata = await scorer.async_score(
    input_text="What is Python?",
    output="Python is a programming language.",
)

With session sharing:

score, metadata = await scorer.async_score(
    input_text="...",
    output="...",
    session_id="existing_session_123",
)
Note

Orchestrates critic agent execution via AgentExecutor and extracts structured output. Creates isolated session unless session_id provided for state sharing.

Source code in src/gepa_adk/adapters/critic_scorer.py
async def async_score(
    self,
    input_text: str,
    output: str,
    expected: str | None = None,
    session_id: str | None = None,
) -> tuple[float, dict[str, Any]]:
    """Score an agent output asynchronously using the critic agent.

    Executes the critic agent with formatted input and extracts structured
    score and metadata from the response.

    Args:
        input_text: The original input provided to the agent being evaluated.
        output: The agent's generated output to score.
        expected: Optional expected/reference output for comparison.
        session_id: Optional session ID to share state with main agent
            workflow. If None, creates an isolated session.

    Returns:
        Tuple of (score, metadata) where:
        - score: Float value, conventionally 0.0-1.0
        - metadata: Dict with feedback, dimension_scores,
            actionable_guidance, and any additional fields

    Raises:
        CriticOutputParseError: If critic output is not valid JSON.
        MissingScoreFieldError: If score field missing from output.

    Examples:
        Basic async scoring:

        ```python
        score, metadata = await scorer.async_score(
            input_text="What is Python?",
            output="Python is a programming language.",
        )
        ```

        With session sharing:

        ```python
        score, metadata = await scorer.async_score(
            input_text="...",
            output="...",
            session_id="existing_session_123",
        )
        ```

    Note:
        Orchestrates critic agent execution via AgentExecutor and extracts
        structured output. Creates isolated session unless session_id provided
        for state sharing.
    """
    self._logger.debug(
        "scorer.async_score.start",
        input_preview=input_text[:50] if input_text else "",
        output_preview=output[:50] if output else "",
        has_expected=expected is not None,
        session_id=session_id,
    )

    # Format input for critic
    critic_input = self._format_critic_input(input_text, output, expected)

    # Execute via AgentExecutor
    result = await self._executor.execute_agent(
        agent=self.critic_agent,
        input_text=critic_input,
        existing_session_id=session_id,
    )

    if result.status == ExecutionStatus.FAILED:
        raise ScoringError(
            f"Critic agent execution failed: {result.error_message}",
        )

    final_output = result.extracted_value

    if not final_output:
        raise ScoringError("Critic agent returned empty output")

    # Parse output and extract score
    try:
        score, metadata = self._parse_critic_output(final_output)
    except (CriticOutputParseError, MissingScoreFieldError) as e:
        self._logger.error(
            "scorer.async_score.parse_error",
            error=str(e),
            error_type=type(e).__name__,
        )
        raise

    # Log multi-dimensional scoring context if present
    log_context: dict[str, Any] = {
        "score": score,
        "has_feedback": "feedback" in metadata,
        "has_dimension_scores": "dimension_scores" in metadata,
        "has_actionable_guidance": "actionable_guidance" in metadata,
    }
    if "dimension_scores" in metadata:
        log_context["dimension_count"] = len(metadata["dimension_scores"])

    self._logger.info(
        "scorer.async_score.complete",
        **log_context,
    )

    return score, metadata

score

score(
    input_text: str,
    output: str,
    expected: str | None = None,
) -> tuple[float, dict[str, Any]]

Score an agent output synchronously using the critic agent.

Synchronous wrapper around async_score() using asyncio.run().

PARAMETER DESCRIPTION
input_text

The original input provided to the agent being evaluated.

TYPE: str

output

The agent's generated output to score.

TYPE: str

expected

Optional expected/reference output for comparison.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
float

Tuple of (score, metadata) where:

dict[str, Any]
  • score: Float value, conventionally 0.0-1.0
tuple[float, dict[str, Any]]
  • metadata: Dict with feedback, dimension_scores, actionable_guidance, and any additional fields
RAISES DESCRIPTION
CriticOutputParseError

If critic output is not valid JSON.

MissingScoreFieldError

If score field missing from output.

Examples:

Basic sync scoring:

score, metadata = scorer.score(
    input_text="What is 2+2?",
    output="4",
    expected="4",
)
Note

Operates synchronously by wrapping async_score() with asyncio.run(). Uses asyncio.run() to execute async_score(). Prefer async_score() for better performance in async contexts.

Source code in src/gepa_adk/adapters/critic_scorer.py
def score(
    self,
    input_text: str,
    output: str,
    expected: str | None = None,
) -> tuple[float, dict[str, Any]]:
    """Score an agent output synchronously using the critic agent.

    Synchronous wrapper around async_score() using asyncio.run().

    Args:
        input_text: The original input provided to the agent being evaluated.
        output: The agent's generated output to score.
        expected: Optional expected/reference output for comparison.

    Returns:
        Tuple of (score, metadata) where:
        - score: Float value, conventionally 0.0-1.0
        - metadata: Dict with feedback, dimension_scores,
            actionable_guidance, and any additional fields

    Raises:
        CriticOutputParseError: If critic output is not valid JSON.
        MissingScoreFieldError: If score field missing from output.

    Examples:
        Basic sync scoring:

        ```python
        score, metadata = scorer.score(
            input_text="What is 2+2?",
            output="4",
            expected="4",
        )
        ```

    Note:
        Operates synchronously by wrapping async_score() with asyncio.run().
        Uses asyncio.run() to execute async_score(). Prefer async_score()
        for better performance in async contexts.
    """
    return asyncio.run(self.async_score(input_text, output, expected))

normalize_feedback

normalize_feedback(
    score: float, metadata: dict[str, Any] | None
) -> dict[str, Any]

Normalize critic feedback to consistent trial format.

Converts both simple and advanced critic outputs to a standardized format for use in trial records. This enables the reflection agent to receive consistent feedback regardless of which critic schema was used.

PARAMETER DESCRIPTION
score

The numeric score from the critic (0.0-1.0).

TYPE: float

metadata

Optional metadata dict from critic output. May contain: - feedback (str): Simple feedback text - dimension_scores (dict): Per-dimension scores - actionable_guidance (str): Improvement suggestions - Any additional fields from critic output

TYPE: dict[str, Any] | None

RETURNS DESCRIPTION
dict[str, Any]

Normalized feedback dict with structure:

dict[str, Any]

```python

dict[str, Any]

{ "score": 0.75, "feedback_text": "Main feedback message", "dimension_scores": {...}, # Optional "actionable_guidance": "...", # Optional

dict[str, Any]

}

dict[str, Any]

```

Examples:

Normalize simple feedback:

normalized = normalize_feedback(0.8, {"feedback": "Good job"})
# {"score": 0.8, "feedback_text": "Good job"}

Normalize advanced feedback:

normalized = normalize_feedback(
    0.6,
    {
        "feedback": "Needs work",
        "dimension_scores": {"clarity": 0.5},
        "actionable_guidance": "Add examples",
    },
)
# {
#     "score": 0.6,
#     "feedback_text": "Needs work",
#     "dimension_scores": {"clarity": 0.5},
#     "actionable_guidance": "Add examples",
# }

Handle missing feedback:

normalized = normalize_feedback(0.5, None)
# {"score": 0.5, "feedback_text": ""}
Note

Supports both SimpleCriticOutput and CriticOutput schemas for flexible critic integration. Extracts the "feedback" field and renames it to "feedback_text" for consistent trial structure. Additional fields like dimension_scores are preserved when present.

Source code in src/gepa_adk/adapters/critic_scorer.py
def normalize_feedback(
    score: float,
    metadata: dict[str, Any] | None,
) -> dict[str, Any]:
    """Normalize critic feedback to consistent trial format.

    Converts both simple and advanced critic outputs to a standardized
    format for use in trial records. This enables the reflection agent
    to receive consistent feedback regardless of which critic schema
    was used.

    Args:
        score: The numeric score from the critic (0.0-1.0).
        metadata: Optional metadata dict from critic output. May contain:
            - feedback (str): Simple feedback text
            - dimension_scores (dict): Per-dimension scores
            - actionable_guidance (str): Improvement suggestions
            - Any additional fields from critic output

    Returns:
        Normalized feedback dict with structure:
        ```python
        {
            "score": 0.75,
            "feedback_text": "Main feedback message",
            "dimension_scores": {...},  # Optional
            "actionable_guidance": "...",  # Optional
        }
        ```

    Examples:
        Normalize simple feedback:

        ```python
        normalized = normalize_feedback(0.8, {"feedback": "Good job"})
        # {"score": 0.8, "feedback_text": "Good job"}
        ```

        Normalize advanced feedback:

        ```python
        normalized = normalize_feedback(
            0.6,
            {
                "feedback": "Needs work",
                "dimension_scores": {"clarity": 0.5},
                "actionable_guidance": "Add examples",
            },
        )
        # {
        #     "score": 0.6,
        #     "feedback_text": "Needs work",
        #     "dimension_scores": {"clarity": 0.5},
        #     "actionable_guidance": "Add examples",
        # }
        ```

        Handle missing feedback:

        ```python
        normalized = normalize_feedback(0.5, None)
        # {"score": 0.5, "feedback_text": ""}
        ```

    Note:
        Supports both SimpleCriticOutput and CriticOutput schemas for flexible
        critic integration. Extracts the "feedback" field and renames it to
        "feedback_text" for consistent trial structure. Additional fields
        like dimension_scores are preserved when present.
    """
    result: dict[str, Any] = {"score": score}

    if metadata is None:
        result["feedback_text"] = ""
        return result

    # Extract feedback text - handle both "feedback" and "feedback_text" keys
    feedback_text = metadata.get("feedback_text") or metadata.get("feedback") or ""
    if isinstance(feedback_text, str) and feedback_text.strip():
        result["feedback_text"] = feedback_text.strip()
    else:
        result["feedback_text"] = ""

    # Preserve dimension_scores if present
    dimension_scores = metadata.get("dimension_scores")
    if dimension_scores and isinstance(dimension_scores, dict):
        result["dimension_scores"] = dimension_scores

    # Preserve actionable_guidance if present
    actionable_guidance = metadata.get("actionable_guidance")
    if actionable_guidance and isinstance(actionable_guidance, str):
        guidance_str = actionable_guidance.strip()
        if guidance_str:
            result["actionable_guidance"] = guidance_str

    return result