text_utils¶
text_utils
¶
Utility functions for text processing and HTML cleaning.
_AttrEntry
¶
Bases: NamedTuple
A single attribute with its translatable flag.
AttrRecord
¶
Bases: NamedTuple
Stores all original attributes for a single tag, in order.
_AttrStripper
¶
Bases: HTMLParser
HTMLParser subclass that strips non-translatable attributes.
Uses get_starttag_text() to obtain the raw tag text (preserving
original formatting, entities, quoting) and then applies _ATTR_RE
to classify individual attributes into keep/strip groups.
Each tag that has attributes stripped receives a data-ftid="N"
marker so that :func:restore_html_attributes can match tags by
ID instead of sequential order — robust against LLM tag mutations.
All original attributes are stored in document order so that restoration can reconstruct the original attribute sequence.
Source code in src/utils/text_utils.py
_emit_raw
¶
Emit raw source text up to end_offset.
_process_start_tag
¶
Classify attributes in the current start tag, emit rebuilt tag.
Source code in src/utils/text_utils.py
handle_starttag
¶
handle_startendtag
¶
feed_and_collect
¶
Feed source HTML and return the stripped result.
Source code in src/utils/text_utils.py
_TagInfo
¶
Bases: NamedTuple
Stores a tag's text, signature, and position in the string.
strip_bom
¶
Strips a leading UTF-8 BOM (U+FEFF) from text if present.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Input string that may start with a BOM.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The string without the leading BOM character. |
Source code in src/utils/text_utils.py
clean_llm_html
¶
Removes leading/trailing noise tags that interfere with layout.
Handles all
variants:
,
,
.
| PARAMETER | DESCRIPTION |
|---|---|
html
|
The raw HTML string from the LLM.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Cleaned HTML string. |
Source code in src/utils/text_utils.py
html_to_plain_text
¶
Converts enriched HTML to plain text for fallback or logging.
| PARAMETER | DESCRIPTION |
|---|---|
html
|
HTML string with tags like , , ,
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Stripped plain text. |
Source code in src/utils/text_utils.py
normalize_for_search
¶
Normalizes text for accent/case-insensitive search.
Uses NFKD for compatibility decomposition (ligatures fi→fi,
CJK width variants), casefold() for locale-aware lowering
(German ß→ss), strips combining marks (Mn) and invisible
formatting chars (Cf, e.g. zero-width joiners), then maps
non-decomposable extended-Latin letters (Đ, Ł, Ø, Å, Æ, Œ, Þ, Ð)
to their base letter via :data:_EXTENDED_LATIN_BASE_MAP.
"Xin Chào" → "xin chao", "café" → "cafe",
"Straße" → "strasse", "Đan Mạch" → "dan mach", "Łukasz" → "lukasz".
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Input string.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Casefolded string with diacritics and invisible chars removed.
TYPE:
|
Source code in src/utils/text_utils.py
build_norm_map
¶
Builds a normalized string with a position map back to the original.
Each character in the original text is casefolded and NFKD-decomposed,
then combining marks (Mn) and invisible chars (Cf) are stripped, and
extended-Latin letters (Đ, Ł, Ø, …) are mapped to their base letter
via :data:_EXTENDED_LATIN_BASE_MAP. The resulting characters are
collected along with the index of the original character that produced
them — the map only contains 1:1 substitutions, so the position
alignment survives.
This is used by the HighlightDelegate to find match spans in normalized text and map them back to the correct original-text positions for highlighting.
Example::
build_norm_map("Café") → ("cafe", [0, 1, 2, 3])
build_norm_map("Straße") → ("strasse", [0, 1, 2, 3, 4, 4, 5])
# ß casefolds to "ss" — both map back to original index 4.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Original (un-normalized) string.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Tuple of (normalized_text, orig_indices) where |
list[int]
|
|
tuple[str, list[int]]
|
the i-th character of normalized_text. |
Source code in src/utils/text_utils.py
strip_html_attributes
¶
Strips non-translatable attributes from HTML tags.
Keeps translatable attributes (alt, title, placeholder, aria-label,
etc.) in the tag for the LLM to translate. Strips all other
attributes, records them for later restoration, and adds a
data-ftid="N" marker to each modified tag.
Uses html.parser.HTMLParser for robust tag boundary detection,
which correctly handles > inside quoted attribute values and
multiline attributes.
| PARAMETER | DESCRIPTION |
|---|---|
html_text
|
Raw HTML string.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Tuple of (stripped_html, attr_records) where attr_records is a |
dict[int, AttrRecord]
|
dict mapping marker ID → AttrRecord. |
Source code in src/utils/text_utils.py
restore_html_attributes
¶
Re-injects stripped attributes into translated HTML.
Finds tags with data-ftid="N" markers, looks up the
corresponding :class:AttrRecord, and rebuilds the tag with
all original attributes in their original order. Translated
(translatable) attribute values are taken from the LLM output;
non-translatable values are taken from the stored record.
| PARAMETER | DESCRIPTION |
|---|---|
html_text
|
Translated HTML (with markers from stripping).
TYPE:
|
records
|
Attribute records — dict keyed by marker ID.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
HTML with original attributes restored and markers removed. |
Source code in src/utils/text_utils.py
_tag_signature
¶
Extracts a tag signature for matching: name + type.
Opening tags like '
' → "br" (treated as opening).
| PARAMETER | DESCRIPTION |
|---|---|
tag_text
|
Full tag string.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Signature string for comparison. |
Source code in src/utils/text_utils.py
repair_html_tags
¶
Re-inserts tags that the LLM dropped from the translated HTML.
Uses greedy two-pointer alignment between the original and translated tag sequences. Tags are matched by name and type (opening/closing), not by full text — so attribute changes from LLM translation don't cause false mismatches.
Any tag present in the original but missing from the translated output is re-inserted at the corresponding position. Tags that the LLM added (not in original) are left as-is.
| PARAMETER | DESCRIPTION |
|---|---|
original
|
The attribute-stripped HTML sent to the LLM.
TYPE:
|
translated
|
The LLM's translated response.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Translated HTML with missing tags re-inserted. |
Source code in src/utils/text_utils.py
strip_xml_overhead
¶
Strips processing instructions and CDATA markers from XML.
Replaces each non-translatable construct with a bracketed placeholder
[__PRESERVE_XML_N__] so the LLM only sees translatable text and
tag structure. Comments (<!-- ... -->) are left intact — LLMs
naturally skip them. CDATA text content is preserved — only the
<![CDATA[ and ]]> markers are replaced.
| PARAMETER | DESCRIPTION |
|---|---|
xml
|
Raw XML string.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Tuple of (stripped_xml, records) where |
list[str]
|
original text that |
Source code in src/utils/text_utils.py
restore_xml_overhead
¶
Re-injects XML processing instructions and CDATA markers.
Replaces [__PRESERVE_XML_N__] placeholders back with the original
content stored in records.
| PARAMETER | DESCRIPTION |
|---|---|
xml
|
Translated XML containing
TYPE:
|
records
|
List of original constructs from :func:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
XML with original non-translatable constructs restored. |
Source code in src/utils/text_utils.py
strip_xml_attributes
¶
Strips ALL attributes from XML tags.
Unlike :func:strip_html_attributes which keeps translatable
attributes (alt, title, etc.), XML attributes are almost never
translatable so everything is stripped. Each modified tag receives
a data-ftid="N" marker for robust restoration.
| PARAMETER | DESCRIPTION |
|---|---|
xml
|
XML string (with overhead already stripped, if desired).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Tuple of (stripped_xml, attr_records) where attr_records is a |
dict[int, AttrRecord]
|
dict mapping marker ID → AttrRecord. |
Source code in src/utils/text_utils.py
strip_rtf_overhead
¶
Strips RTF control words, symbols, braces, and Unicode escapes.
Replaces each non-text construct with a bracketed placeholder
[__PRESERVE_RTF_N__]. Unicode escapes (\uN?) are decoded
to the actual Unicode character in the stripped text so the LLM can
read them; the original escape is still recorded for round-trip
fidelity.
| PARAMETER | DESCRIPTION |
|---|---|
rtf
|
RTF text chunk (already split by
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Tuple of (stripped_text, records) where |
list[str]
|
original RTF construct that |
Source code in src/utils/text_utils.py
restore_rtf_overhead
¶
Re-injects RTF control words and symbols from placeholders.
Replaces [__PRESERVE_RTF_N__] placeholders back with the original
RTF constructs stored in records.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Translated text containing
TYPE:
|
records
|
List of original RTF constructs from :func:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
RTF text with original control sequences restored. |
Source code in src/utils/text_utils.py
strip_md_overhead
¶
Strips URLs from Markdown links/images and reference definitions.
Replaces each URL with a bracketed placeholder
[__PRESERVE_MD_N__] so the LLM only sees translatable text.
The caller should chain :func:strip_html_attributes afterwards
to handle embedded HTML.
Handles:
- Inline links: [text](url) → [text]([__PRESERVE_MD_N__])
- Inline images:  → 
- Reference definitions: [id]: url → [id]: [__PRESERVE_MD_N__]
| PARAMETER | DESCRIPTION |
|---|---|
md
|
Raw Markdown string.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Tuple of (stripped_md, records) where |
list[str]
|
original URL/content that |
Source code in src/utils/text_utils.py
restore_md_overhead
¶
Re-injects Markdown URLs from placeholders.
Replaces [__PRESERVE_MD_N__] placeholders back with the original
URLs stored in records.
| PARAMETER | DESCRIPTION |
|---|---|
md
|
Translated Markdown containing
TYPE:
|
records
|
List of original URLs from :func:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Markdown with original URLs restored. |