跳轉到

font_utils

font_utils

Unified font family handling across all file types (image, office, PDF).

Implements a hybrid font selection strategy: 1. Determine the generic family (serif / sans-serif / monospace) from the source font name or PDF font flags. 2. Select a concrete font that supports the target language and belongs to the same generic family. 3. Fall back to the generic CSS family name when no concrete match is found.

This module is PySide6-free — it works headlessly for CLI / MCP / REST usage.

detect_script

detect_script(text)

Detects the dominant non-Latin script family from text.

Scans characters until a non-Latin script is identified. Returns "latin" for ASCII / Latin-only text (including extended Latin for Vietnamese, Turkish, etc.).

PARAMETER DESCRIPTION
text

The text to analyse.

TYPE: str

RETURNS DESCRIPTION
str

A script family identifier (e.g. "latin", "cyrillic").

Source code in src/utils/font_utils.py
def detect_script(text: str) -> str:
    """Detects the dominant non-Latin script family from *text*.

    Scans characters until a non-Latin script is identified.  Returns
    ``"latin"`` for ASCII / Latin-only text (including extended Latin
    for Vietnamese, Turkish, etc.).

    Args:
        text: The text to analyse.

    Returns:
        A script family identifier (e.g. ``"latin"``, ``"cyrillic"``).
    """
    _latin_upper = 0x02FF
    for ch in text:
        cp = ord(ch)
        if cp <= _latin_upper:
            continue
        for lo, hi, family in _SCRIPT_RANGES:
            if lo <= cp <= hi:
                if family is not None:
                    return family
                break
    return SCRIPT_LATIN

classify_generic_family

classify_generic_family(*, font_name=None, font_flags=None)

Determines the generic CSS family from a source font.

Uses two inputs (either or both may be provided): - font_name: The font's family name (e.g. "Times New Roman"). - font_flags: PyMuPDF font flags (bit 3 = mono, bit 2 = serif).

When both are provided, font_name takes precedence since it's more specific than PyMuPDF's coarse 2-bit classification.

PARAMETER DESCRIPTION
font_name

The source font family name.

TYPE: str | None DEFAULT: None

font_flags

PyMuPDF span font flags.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
str

One of "serif", "sans-serif", or "monospace".

Source code in src/utils/font_utils.py
def classify_generic_family(  # noqa: PLR0911
    *,
    font_name: str | None = None,
    font_flags: int | None = None,
) -> str:
    """Determines the generic CSS family from a source font.

    Uses two inputs (either or both may be provided):
    - ``font_name``: The font's family name (e.g. "Times New Roman").
    - ``font_flags``: PyMuPDF font flags (bit 3 = mono, bit 2 = serif).

    When both are provided, ``font_name`` takes precedence since it's
    more specific than PyMuPDF's coarse 2-bit classification.

    Args:
        font_name: The source font family name.
        font_flags: PyMuPDF span font flags.

    Returns:
        One of ``"serif"``, ``"sans-serif"``, or ``"monospace"``.
    """
    # 1. Try font name classification (more specific)
    if font_name:
        lower = font_name.lower().strip()
        if lower in _MONO_NAMES or _MONO_RE.search(lower):
            return FAMILY_MONO
        if lower in _SERIF_NAMES or _SERIF_RE.search(lower):
            return FAMILY_SERIF
        # Most UI / document fonts default to sans-serif when not
        # explicitly serif or monospace.
        # But if we also have font_flags, fall through to let flags decide.
        if font_flags is None:
            return FAMILY_SANS

    # 2. Fall back to PyMuPDF font flags
    if font_flags is not None:
        if font_flags & 8:
            return FAMILY_MONO
        if font_flags & 4:
            return FAMILY_SERIF
        return FAMILY_SANS

    # 3. Default
    return FAMILY_SANS

_resolve_font_key

_resolve_font_key(target_lang)

Resolve the target language to a _FONT_DB key.

Tries exact match, then _LANG_TO_SCRIPT mapping, then substring match against _FONT_DB keys, and finally "default".

Source code in src/utils/font_utils.py
def _resolve_font_key(target_lang: str) -> str:
    """Resolve the target language to a _FONT_DB key.

    Tries exact match, then ``_LANG_TO_SCRIPT`` mapping, then substring
    match against _FONT_DB keys, and finally ``"default"``.
    """
    lang = target_lang.lower()

    # Exact match
    if lang in _FONT_DB:
        return lang

    # Explicit language → script mapping
    if lang in _LANG_TO_SCRIPT:
        return _LANG_TO_SCRIPT[lang]

    # Substring match (e.g. "chinese" in "chinese (simplified)")
    for key in _FONT_DB:
        if key in lang or lang in key:
            return key

    return "default"

get_font_for_language

get_font_for_language(target_lang, generic_family=FAMILY_SANS)

Selects the best concrete font for a target language and generic family.

Returns the first candidate from _FONT_DB for the resolved language/script key. Falls back to the generic CSS family name when no candidates exist.

PARAMETER DESCRIPTION
target_lang

Target language name (e.g. "Japanese", "Vietnamese").

TYPE: str

generic_family

One of "serif", "sans-serif", "monospace".

TYPE: str DEFAULT: FAMILY_SANS

RETURNS DESCRIPTION
str

A concrete font family name or a generic CSS family name.

Source code in src/utils/font_utils.py
def get_font_for_language(
    target_lang: str,
    generic_family: str = FAMILY_SANS,
) -> str:
    """Selects the best concrete font for a target language and generic family.

    Returns the first candidate from ``_FONT_DB`` for the resolved
    language/script key.  Falls back to the generic CSS family name
    when no candidates exist.

    Args:
        target_lang: Target language name (e.g. "Japanese", "Vietnamese").
        generic_family: One of ``"serif"``, ``"sans-serif"``, ``"monospace"``.

    Returns:
        A concrete font family name or a generic CSS family name.
    """
    key = _resolve_font_key(target_lang)
    entry = _FONT_DB.get(key, _FONT_DB["default"])
    candidates = entry.get(generic_family, entry.get(FAMILY_SANS, []))

    if candidates:
        return candidates[0]

    return generic_family