ADR-005: Three-Layer Testing Strategy¶
Status: Accepted Date: 2026-01-10 Deciders: gepa-adk maintainers
Context¶
gepa-adk has multiple testing concerns:
- Protocol compliance: Do adapters correctly implement port interfaces?
- Business logic: Does the evolution engine work correctly in isolation?
- End-to-end: Does evolution actually improve agents with real ADK/LLM calls?
We need a testing strategy that balances: - Fast feedback during development - Confidence that real integrations work - Maintainable test suite
Decision¶
Adopt a three-layer testing strategy aligned with hexagonal architecture:
┌─────────────────────────────────────────────────────────┐
│ Contract Tests (tests/contracts/) │
│ • Verify protocols are correctly defined │
│ • Ensure adapters implement ports │
│ • Mock ADK for speed │
│ • Run on every commit │
└─────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────┐
│ Integration Tests (tests/integration/) │
│ • End-to-end evolution with real ADK agents │
│ • Real LLM calls (marked @pytest.mark.slow) │
│ • Verify async concurrency works │
│ • Run in CI, skip locally by default │
└─────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────┐
│ Unit Tests (tests/unit/) │
│ • Engine logic with mock adapter │
│ • State guard, parsing utilities │
│ • No I/O, fastest execution │
│ • Run on every save (watch mode) │
└─────────────────────────────────────────────────────────┘
Test Directory Structure¶
tests/
├── conftest.py # Shared fixtures and test utilities
├── contracts/
│ ├── test_adapter_protocol.py # AsyncGEPAAdapter compliance
│ ├── test_scorer_protocol.py # Scorer compliance
│ └── test_agent_provider_protocol.py
├── integration/
│ ├── conftest.py # Real ADK fixtures
│ ├── test_adk_evolution.py # End-to-end evolution
│ ├── test_concurrent_evaluation.py
│ └── test_multi_agent.py
└── unit/
├── test_engine.py # AsyncGEPAEngine
├── test_proposer.py # Mutation proposer
├── test_state_guard.py # State key preservation
└── test_parsing.py # JSON/YAML utilities
Layer Details¶
Contract Tests¶
Verify that adapters implement port protocols correctly:
# tests/contracts/test_adapter_protocol.py
from typing import runtime_checkable
from gepa_adk.ports import AsyncGEPAAdapter
from gepa_adk.adapters import ADKAdapter
def test_adk_adapter_implements_protocol():
"""ADKAdapter must implement AsyncGEPAAdapter protocol."""
# Note: Actually instantiating requires mocked dependencies
assert hasattr(ADKAdapter, 'evaluate')
assert hasattr(ADKAdapter, 'make_reflective_dataset')
assert hasattr(ADKAdapter, 'propose_new_texts')
def test_protocol_methods_are_async():
"""All adapter methods must be coroutines."""
import inspect
assert inspect.iscoroutinefunction(ADKAdapter.evaluate)
assert inspect.iscoroutinefunction(ADKAdapter.make_reflective_dataset)
assert inspect.iscoroutinefunction(ADKAdapter.propose_new_texts)
Unit Tests¶
Test core logic with mock adapters (no external dependencies):
# tests/unit/test_engine.py
import pytest
from pytest_mock import MockerFixture
from gepa_adk.engine import AsyncGEPAEngine
from gepa_adk.domain.models import EvaluationBatch
@pytest.fixture
def mock_adapter(mocker: MockerFixture):
"""Mock adapter for unit tests - no ADK dependency."""
adapter = mocker.AsyncMock()
adapter.evaluate.return_value = EvaluationBatch(
outputs=["output1", "output2"],
scores=[0.8, 0.9],
trajectories=[{}, {}],
)
adapter.propose_new_texts.return_value = {"instruction": "improved"}
return adapter
@pytest.mark.asyncio
async def test_engine_runs_evolution_loop(mock_adapter):
"""Engine executes evaluation → proposal → acceptance loop."""
engine = AsyncGEPAEngine(adapter=mock_adapter, max_iterations=3)
state = await engine.run()
assert mock_adapter.evaluate.call_count >= 1
assert state.iterations_completed > 0
@pytest.mark.asyncio
async def test_engine_accepts_improved_candidate(mock_adapter, mocker: MockerFixture):
"""Engine accepts candidates with higher scores."""
mock_adapter.evaluate.side_effect = [
EvaluationBatch(outputs=["o"], scores=[0.5], trajectories=[{}]),
EvaluationBatch(outputs=["o"], scores=[0.8], trajectories=[{}]),
]
engine = AsyncGEPAEngine(adapter=mock_adapter, max_iterations=2)
state = await engine.run()
assert state.best_score >= 0.5
Shared Test Utilities (tests/conftest.py)¶
The root conftest.py provides reusable test utilities:
# MockScorer: Implements the Scorer protocol for testing
from tests.conftest import MockScorer
def test_with_scorer():
scorer = MockScorer(score_value=0.9) # Custom score value
score, metadata = scorer.score("input", "output", "expected")
assert score == 0.9
assert scorer.score_calls == [("input", "output", "expected")] # Tracks calls
# MockExecutor: Implements AgentExecutorProtocol for testing
from tests.conftest import MockExecutor
def test_with_executor():
executor = MockExecutor()
# executor.execute_count and executor.calls track usage
Fixtures provided: - mock_scorer_factory - Factory for creating MockScorer with custom scores - mock_executor - Fresh MockExecutor instance per test - mock_proposer - Mock AsyncReflectiveMutationProposer - trainset_samples, valset_samples - Standard test datasets - deterministic_scores, deterministic_score_batch - Predictable score sequences
Integration Tests¶
Test real evolution with ADK agents (slow, requires API keys):
# tests/integration/test_adk_evolution.py
import pytest
from google.adk.agents import LlmAgent
from gepa_adk import evolve
@pytest.mark.slow
@pytest.mark.integration
@pytest.mark.asyncio
async def test_evolve_improves_instruction():
"""End-to-end: evolution improves agent instruction."""
agent = LlmAgent(
name="test_agent",
model="gemini-2.5-flash",
instruction="Answer the question.",
)
critic = LlmAgent(
name="critic",
model="gemini-2.5-flash",
instruction="Rate the answer quality from 0 to 1.",
output_schema={
"type": "object",
"properties": {"score": {"type": "number"}}
}
)
result = await evolve(
agent=agent,
trainset=[{"input": "What is 2+2?", "expected": "4"}],
critic=critic,
max_iterations=5,
)
assert result.final_score >= result.original_score
assert result.evolved_instruction != agent.instruction
Test Markers¶
Configure pytest markers in pyproject.toml:
[tool.pytest.ini_options]
markers = [
"unit: Fast, isolated unit tests (no I/O)",
"contract: Interface compliance tests",
"integration: Real ADK/LLM tests (requires API keys)",
"slow: Tests taking >10s (LLM calls)",
]
# Skip slow/integration by default for fast local development
addopts = "-m 'not slow and not integration'"
asyncio_mode = "auto"
Running Tests¶
# Fast feedback (unit + contract only)
uv run pytest
# All tests including integration
uv run pytest -m ""
# Only integration tests
uv run pytest -m integration
# With coverage
uv run pytest --cov=src --cov-report=term-missing
TDD Approach¶
Follow Test-Driven Development:
- Write failing test for new feature
- Implement minimum code to pass
- Refactor while keeping tests green
# Example TDD cycle for state guard
# Step 1: Write failing test
def test_state_guard_repairs_missing_token():
guard = StateGuard(repair_missing=True)
original = "Use {session_data} in your response"
mutated = "Use the data in your response" # Token removed
repaired = guard.repair(mutated, original)
assert "{session_data}" in repaired
# Step 2: Implement StateGuard.repair()
# Step 3: Refactor if needed
Consequences¶
Positive¶
- Fast feedback: Unit tests run in <1 second
- Confidence: Integration tests verify real behavior
- Clear separation: Each layer tests different concerns
- CI-friendly: Can run fast tests on every PR, slow tests on merge
- TDD support: Easy to write tests first with mock adapters
Negative¶
- Mock maintenance: Mock adapters must stay in sync with real ones
- Integration test cost: Real LLM calls cost money and time
- Complexity: Three layers require understanding of when to use each
Neutral¶
- Coverage targets: Aim for >90% on unit tests, lower acceptable for integration
- API key management: Integration tests need secure credential handling
Alternatives Considered¶
1. Single Test Layer¶
Rejected: Mixes fast and slow tests, hard to get quick feedback.
2. Two Layers (Unit + Integration)¶
Rejected: Missing contract tests means protocol violations caught late.
3. Property-Based Testing (Hypothesis)¶
@given(st.text(), st.floats(0, 1))
def test_scorer_returns_valid_score(input_text, expected_score): ...
Considered for future: Good for utility functions, overkill for MVP.
References¶
- pytest documentation
- pytest-asyncio
- Test Pyramid (Martin Fowler)
- ADR-001: Async-First Architecture (async testing patterns)
- ADR-002: Protocol for Interfaces (contract testing)
- ADR-009: Exception Hierarchy (exception testing)
- ADR Index - All architectural decisions