콘텐츠로 이동

font_utils

font_utils

Unified font family handling across all file types (image, office, PDF).

Implements a hybrid font selection strategy: 1. Determine the generic family (serif / sans-serif / monospace) from the source font name or PDF font flags. 2. Select a concrete font that supports the target language and belongs to the same generic family. 3. Fall back to the generic CSS family name when no concrete match is found.

This module is PySide6-free — it works headlessly for CLI / MCP / REST usage.

detect_script

detect_script(text)

Detects the dominant non-Latin script family from text.

Scans characters until a non-Latin script is identified. Returns "latin" for ASCII / Latin-only text (including extended Latin for Vietnamese, Turkish, etc.).

引数 デスクリプション
text

The text to analyse.

タイプ: str

戻り値 デスクリプション
str

A script family identifier (e.g. "latin", "cyrillic").

ソースコード位置: src/utils/font_utils.py
def detect_script(text: str) -> str:
    """Detects the dominant non-Latin script family from *text*.

    Scans characters until a non-Latin script is identified.  Returns
    ``"latin"`` for ASCII / Latin-only text (including extended Latin
    for Vietnamese, Turkish, etc.).

    Args:
        text: The text to analyse.

    Returns:
        A script family identifier (e.g. ``"latin"``, ``"cyrillic"``).
    """
    _latin_upper = 0x02FF
    for ch in text:
        cp = ord(ch)
        if cp <= _latin_upper:
            continue
        for lo, hi, family in _SCRIPT_RANGES:
            if lo <= cp <= hi:
                if family is not None:
                    return family
                break
    return SCRIPT_LATIN

classify_generic_family

classify_generic_family(*, font_name=None, font_flags=None)

Determines the generic CSS family from a source font.

Uses two inputs (either or both may be provided): - font_name: The font's family name (e.g. "Times New Roman"). - font_flags: PyMuPDF font flags (bit 3 = mono, bit 2 = serif).

When both are provided, font_name takes precedence since it's more specific than PyMuPDF's coarse 2-bit classification.

引数 デスクリプション
font_name

The source font family name.

タイプ: str | None デフォルト: None

font_flags

PyMuPDF span font flags.

タイプ: int | None デフォルト: None

戻り値 デスクリプション
str

One of "serif", "sans-serif", or "monospace".

ソースコード位置: src/utils/font_utils.py
def classify_generic_family(  # noqa: PLR0911
    *,
    font_name: str | None = None,
    font_flags: int | None = None,
) -> str:
    """Determines the generic CSS family from a source font.

    Uses two inputs (either or both may be provided):
    - ``font_name``: The font's family name (e.g. "Times New Roman").
    - ``font_flags``: PyMuPDF font flags (bit 3 = mono, bit 2 = serif).

    When both are provided, ``font_name`` takes precedence since it's
    more specific than PyMuPDF's coarse 2-bit classification.

    Args:
        font_name: The source font family name.
        font_flags: PyMuPDF span font flags.

    Returns:
        One of ``"serif"``, ``"sans-serif"``, or ``"monospace"``.
    """
    # 1. Try font name classification (more specific)
    if font_name:
        lower = font_name.lower().strip()
        if lower in _MONO_NAMES or _MONO_RE.search(lower):
            return FAMILY_MONO
        if lower in _SERIF_NAMES or _SERIF_RE.search(lower):
            return FAMILY_SERIF
        # Most UI / document fonts default to sans-serif when not
        # explicitly serif or monospace.
        # But if we also have font_flags, fall through to let flags decide.
        if font_flags is None:
            return FAMILY_SANS

    # 2. Fall back to PyMuPDF font flags
    if font_flags is not None:
        if font_flags & 8:
            return FAMILY_MONO
        if font_flags & 4:
            return FAMILY_SERIF
        return FAMILY_SANS

    # 3. Default
    return FAMILY_SANS

_resolve_font_key

_resolve_font_key(target_lang)

Resolve the target language to a _FONT_DB key.

Tries exact match, then _LANG_TO_SCRIPT mapping, then substring match against _FONT_DB keys, and finally "default".

ソースコード位置: src/utils/font_utils.py
def _resolve_font_key(target_lang: str) -> str:
    """Resolve the target language to a _FONT_DB key.

    Tries exact match, then ``_LANG_TO_SCRIPT`` mapping, then substring
    match against _FONT_DB keys, and finally ``"default"``.
    """
    lang = target_lang.lower()

    # Exact match
    if lang in _FONT_DB:
        return lang

    # Explicit language → script mapping
    if lang in _LANG_TO_SCRIPT:
        return _LANG_TO_SCRIPT[lang]

    # Substring match (e.g. "chinese" in "chinese (simplified)")
    for key in _FONT_DB:
        if key in lang or lang in key:
            return key

    return "default"

get_font_for_language

get_font_for_language(target_lang, generic_family=FAMILY_SANS)

Selects the best concrete font for a target language and generic family.

Returns the first candidate from _FONT_DB for the resolved language/script key. Falls back to the generic CSS family name when no candidates exist.

引数 デスクリプション
target_lang

Target language name (e.g. "Japanese", "Vietnamese").

タイプ: str

generic_family

One of "serif", "sans-serif", "monospace".

タイプ: str デフォルト: FAMILY_SANS

戻り値 デスクリプション
str

A concrete font family name or a generic CSS family name.

ソースコード位置: src/utils/font_utils.py
def get_font_for_language(
    target_lang: str,
    generic_family: str = FAMILY_SANS,
) -> str:
    """Selects the best concrete font for a target language and generic family.

    Returns the first candidate from ``_FONT_DB`` for the resolved
    language/script key.  Falls back to the generic CSS family name
    when no candidates exist.

    Args:
        target_lang: Target language name (e.g. "Japanese", "Vietnamese").
        generic_family: One of ``"serif"``, ``"sans-serif"``, ``"monospace"``.

    Returns:
        A concrete font family name or a generic CSS family name.
    """
    key = _resolve_font_key(target_lang)
    entry = _FONT_DB.get(key, _FONT_DB["default"])
    candidates = entry.get(generic_family, entry.get(FAMILY_SANS, []))

    if candidates:
        return candidates[0]

    return generic_family