Skip to content

Encoding

encoding

Cross-platform encoding support for structured logging.

This module provides the EncodingSafeProcessor, a structlog processor that sanitizes string values in event dictionaries to prevent UnicodeEncodeError exceptions on consoles with limited encoding support (e.g., Windows cp1252).

The processor uses a two-phase sanitization strategy: 1. Smart character replacements that preserve semantic meaning (e.g., smart quotes → regular quotes, em dash → double hyphen) 2. Fallback encode/decode with 'replace' error handler for any remaining unencodable characters

This approach ensures log output is always writable to the console without raising exceptions, while preserving as much of the original meaning as possible.

Example

Add to structlog processor chain before the renderer:

import structlog
from gepa_adk.utils.encoding import EncodingSafeProcessor

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        EncodingSafeProcessor(),  # Before renderer
        structlog.dev.ConsoleRenderer(),
    ],
    wrapper_class=structlog.BoundLogger,
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory(),
)
See Also
  • ADR-011: Cross-Platform Encoding for design decisions
  • ADR-008: Structured Logging for processor chain patterns
Note

This processor is designed to be transparent on UTF-8 consoles (macOS, Linux) while preventing crashes on cp1252 consoles (Windows).

EncodingSafeProcessor

Sanitize strings for console encoding compatibility.

A structlog processor that ensures all string values in the event dict can be safely written to the console regardless of its encoding (cp1252 on Windows, UTF-8 on macOS/Linux).

The sanitization strategy: 1. Apply smart character replacements (preserve meaning) 2. Encode to console encoding with 'replace' error handler 3. Decode back to string

This produces strings that are guaranteed to be writable to the console without raising UnicodeEncodeError.

ATTRIBUTE DESCRIPTION
REPLACEMENTS

Class constant mapping Unicode characters to ASCII equivalents that preserve semantic meaning.

TYPE: dict[str, str]

encoding

The target console encoding detected at initialization.

TYPE: str

sanitize_string

Sanitize a single string for console encoding.

TYPE: method

Examples:

Use as a structlog processor:

processor = EncodingSafeProcessor()
event = {"event": "User said \u2018hello\u2019"}
result = processor(None, "info", event)
# result["event"] == "User said 'hello'"
Note

All sanitization is idempotent - processing already-sanitized strings produces identical output.

Source code in src/gepa_adk/utils/encoding.py
class EncodingSafeProcessor:
    r"""Sanitize strings for console encoding compatibility.

    A structlog processor that ensures all string values in the event dict
    can be safely written to the console regardless of its encoding (cp1252
    on Windows, UTF-8 on macOS/Linux).

    The sanitization strategy:
    1. Apply smart character replacements (preserve meaning)
    2. Encode to console encoding with 'replace' error handler
    3. Decode back to string

    This produces strings that are guaranteed to be writable to the console
    without raising UnicodeEncodeError.

    Attributes:
        REPLACEMENTS (dict[str, str]): Class constant mapping Unicode characters
            to ASCII equivalents that preserve semantic meaning.
        encoding (str): The target console encoding detected at initialization.
        sanitize_string (method): Sanitize a single string for console encoding.

    Examples:
        Use as a structlog processor:

        ```python
        processor = EncodingSafeProcessor()
        event = {"event": "User said \u2018hello\u2019"}
        result = processor(None, "info", event)
        # result["event"] == "User said 'hello'"
        ```

    Note:
        All sanitization is idempotent - processing already-sanitized strings
        produces identical output.
    """

    # Smart character replacements (preserve meaning)
    # These map common Unicode characters that cause issues on cp1252
    # to their closest ASCII equivalents
    REPLACEMENTS: dict[str, str] = {
        "\u2018": "'",  # Left single quote → apostrophe
        "\u2019": "'",  # Right single quote → apostrophe
        "\u201c": '"',  # Left double quote → quotation mark
        "\u201d": '"',  # Right double quote → quotation mark
        "\u2011": "-",  # Non-breaking hyphen → hyphen-minus
        "\u2013": "-",  # En dash → hyphen-minus
        "\u2014": "--",  # Em dash → double hyphen
        "\u2026": "...",  # Horizontal ellipsis → three periods
        "\u00a0": " ",  # Non-breaking space → regular space
    }

    def __init__(self) -> None:
        """Initialize the processor with console encoding detection.

        Detects the console encoding from sys.stdout.encoding, falling back
        to UTF-8 if detection fails (e.g., when stdout is redirected or
        unavailable).

        Note:
            Console encoding is cached at initialization time to avoid
            repeated attribute lookups during log processing.
        """
        self.encoding: str = getattr(sys.stdout, "encoding", None) or "utf-8"

    def __call__(
        self,
        logger: Any,
        method_name: str,
        event_dict: MutableMapping[str, Any],
    ) -> MutableMapping[str, Any]:
        """Process event dictionary, sanitizing all string values.

        Implements the structlog processor protocol. Recursively sanitizes
        all string values in the event dictionary, including nested dicts
        and lists.

        Args:
            logger: The wrapped logger instance (unused but required by
                protocol).
            method_name: Name of the log method called (e.g., "info", "debug").
                Unused but required by protocol.
            event_dict: Mutable mapping of log event data to sanitize.

        Returns:
            The event_dict with all string values sanitized for console
            encoding compatibility.

        Note:
            Original event_dict is not modified; a new dict is returned
            with sanitized values.
        """
        return self._sanitize_dict(event_dict)

    def sanitize_string(self, s: str) -> str:
        """Sanitize a single string for console encoding.

        Applies smart character replacements first to preserve meaning,
        then uses encode/decode with 'replace' error handler for any
        remaining unencodable characters.

        Args:
            s (str): The string to sanitize.

        Returns:
            A string that is guaranteed to be encodable to the console
            encoding without raising UnicodeEncodeError.

        Examples:
            Sanitize a string with smart quotes:

            ```python
            processor = EncodingSafeProcessor()
            result = processor.sanitize_string("User said \u2018hello\u2019")
            assert result == "User said 'hello'"
            ```

        Note:
            Order of operations: smart replacements run first to preserve
            semantic meaning, then encode/decode handles remaining characters.
        """
        # Apply smart replacements first (preserve meaning)
        for char, replacement in self.REPLACEMENTS.items():
            s = s.replace(char, replacement)

        # Encode/decode with replace for any remaining unencodable chars
        # This handles any Unicode characters not in our explicit mapping
        return s.encode(self.encoding, errors="replace").decode(self.encoding)

    def _sanitize_value(self, value: Any) -> Any:
        """Recursively sanitize any value.

        Handles strings, dicts, lists, and tuples. Other types pass through
        unchanged.

        Args:
            value: The value to sanitize.

        Returns:
            The sanitized value, with the same type as the input (except
            strings may have different content).
        """
        if isinstance(value, str):
            return self.sanitize_string(value)
        elif isinstance(value, MutableMapping):
            return self._sanitize_dict(value)
        elif isinstance(value, (list, tuple)):
            # Preserve the original type (list or tuple)
            return type(value)(self._sanitize_value(v) for v in value)
        # Non-string, non-collection types pass through unchanged
        return value

    def _sanitize_dict(self, d: MutableMapping[str, Any]) -> MutableMapping[str, Any]:
        """Sanitize all values in a dictionary.

        Args:
            d: The dictionary or mutable mapping to sanitize.

        Returns:
            A new dictionary with all string values sanitized.
        """
        return {k: self._sanitize_value(v) for k, v in d.items()}

__init__

__init__() -> None

Initialize the processor with console encoding detection.

Detects the console encoding from sys.stdout.encoding, falling back to UTF-8 if detection fails (e.g., when stdout is redirected or unavailable).

Note

Console encoding is cached at initialization time to avoid repeated attribute lookups during log processing.

Source code in src/gepa_adk/utils/encoding.py
def __init__(self) -> None:
    """Initialize the processor with console encoding detection.

    Detects the console encoding from sys.stdout.encoding, falling back
    to UTF-8 if detection fails (e.g., when stdout is redirected or
    unavailable).

    Note:
        Console encoding is cached at initialization time to avoid
        repeated attribute lookups during log processing.
    """
    self.encoding: str = getattr(sys.stdout, "encoding", None) or "utf-8"

__call__

__call__(
    logger: Any,
    method_name: str,
    event_dict: MutableMapping[str, Any],
) -> MutableMapping[str, Any]

Process event dictionary, sanitizing all string values.

Implements the structlog processor protocol. Recursively sanitizes all string values in the event dictionary, including nested dicts and lists.

PARAMETER DESCRIPTION
logger

The wrapped logger instance (unused but required by protocol).

TYPE: Any

method_name

Name of the log method called (e.g., "info", "debug"). Unused but required by protocol.

TYPE: str

event_dict

Mutable mapping of log event data to sanitize.

TYPE: MutableMapping[str, Any]

RETURNS DESCRIPTION
MutableMapping[str, Any]

The event_dict with all string values sanitized for console

MutableMapping[str, Any]

encoding compatibility.

Note

Original event_dict is not modified; a new dict is returned with sanitized values.

Source code in src/gepa_adk/utils/encoding.py
def __call__(
    self,
    logger: Any,
    method_name: str,
    event_dict: MutableMapping[str, Any],
) -> MutableMapping[str, Any]:
    """Process event dictionary, sanitizing all string values.

    Implements the structlog processor protocol. Recursively sanitizes
    all string values in the event dictionary, including nested dicts
    and lists.

    Args:
        logger: The wrapped logger instance (unused but required by
            protocol).
        method_name: Name of the log method called (e.g., "info", "debug").
            Unused but required by protocol.
        event_dict: Mutable mapping of log event data to sanitize.

    Returns:
        The event_dict with all string values sanitized for console
        encoding compatibility.

    Note:
        Original event_dict is not modified; a new dict is returned
        with sanitized values.
    """
    return self._sanitize_dict(event_dict)

sanitize_string

sanitize_string(s: str) -> str

Sanitize a single string for console encoding.

Applies smart character replacements first to preserve meaning, then uses encode/decode with 'replace' error handler for any remaining unencodable characters.

PARAMETER DESCRIPTION
s

The string to sanitize.

TYPE: str

RETURNS DESCRIPTION
str

A string that is guaranteed to be encodable to the console

str

encoding without raising UnicodeEncodeError.

Examples:

Sanitize a string with smart quotes:

processor = EncodingSafeProcessor()
result = processor.sanitize_string("User said ‘hello’")
assert result == "User said 'hello'"
Note

Order of operations: smart replacements run first to preserve semantic meaning, then encode/decode handles remaining characters.

Source code in src/gepa_adk/utils/encoding.py
def sanitize_string(self, s: str) -> str:
    """Sanitize a single string for console encoding.

    Applies smart character replacements first to preserve meaning,
    then uses encode/decode with 'replace' error handler for any
    remaining unencodable characters.

    Args:
        s (str): The string to sanitize.

    Returns:
        A string that is guaranteed to be encodable to the console
        encoding without raising UnicodeEncodeError.

    Examples:
        Sanitize a string with smart quotes:

        ```python
        processor = EncodingSafeProcessor()
        result = processor.sanitize_string("User said \u2018hello\u2019")
        assert result == "User said 'hello'"
        ```

    Note:
        Order of operations: smart replacements run first to preserve
        semantic meaning, then encode/decode handles remaining characters.
    """
    # Apply smart replacements first (preserve meaning)
    for char, replacement in self.REPLACEMENTS.items():
        s = s.replace(char, replacement)

    # Encode/decode with replace for any remaining unencodable chars
    # This handles any Unicode characters not in our explicit mapping
    return s.encode(self.encoding, errors="replace").decode(self.encoding)