office_processor¶
office_processor
¶
Office document processing for DOCX, XLSX, PPTX, ODT, ODS, ODP and legacy formats.
Uses a 3-tier backend system
- win32com (Windows + MS Office)
- LibreOffice UNO API (cross-platform)
- python-docx / openpyxl / python-pptx / odfpy (modern + ODF formats)
Legacy formats (.doc, .xls, .ppt) require backend 1 or 2.
_detect_backend
¶
Detects the best available backend for the given file extension.
Priority order depends on format family: - OOXML (.docx/.xlsx/.pptx): python_lib immediately (lightweight). - ODF (.odt/.ods/.odp): UNO → win32com → python_lib (odfpy). - Legacy (.doc/.xls/.ppt): win32com → UNO → error.
| PARAMETER | DESCRIPTION |
|---|---|
suffix
|
Lowercase file extension (e.g. ".docx", ".doc").
TYPE:
|
libreoffice_path
|
User-configured LibreOffice path; forwarded
to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
One of the backend identifiers.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no backend is available for the format. |
Source code in src/core/office_processor.py
_substitute_font
¶
Determines the font name to use after translation.
When the original and translated texts share the same script family, the original font name is returned unchanged. When scripts differ (e.g. Latin → CJK), a compatible font from the same generic family (serif / sans-serif / monospace) is selected for the target language.
| PARAMETER | DESCRIPTION |
|---|---|
original_font
|
The source document's font name.
TYPE:
|
original_text
|
Text before translation.
TYPE:
|
translated_text
|
Text after translation.
TYPE:
|
target_lang
|
Target language name (used for font selection).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str | None
|
The font name to apply, or |
str | None
|
application pick a default). |
Source code in src/core/office_processor.py
_save_win32com_font
¶
Saves font properties from a win32com Font object.
Reads each property in WIN32COM_FONT_PROPERTIES and stores non-undefined values. Properties that raise (e.g. on merged cells) are silently skipped.
| PARAMETER | DESCRIPTION |
|---|---|
font_obj
|
A win32com Range.Font COM object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Mapping of property name to saved value.
TYPE:
|
Source code in src/core/office_processor.py
_restore_win32com_font
¶
Restores previously saved font properties to a win32com Font object.
Sets each property independently so a single failure does not prevent other properties from being restored.
When target_lang is provided and "Name" is present in saved,
the font name is substituted via :func:_substitute_font when the
source and target scripts differ.
| PARAMETER | DESCRIPTION |
|---|---|
font_obj
|
A win32com Range.Font COM object.
TYPE:
|
saved
|
Mapping of property name to value (from _save_win32com_font).
TYPE:
|
original_text
|
The text before translation (for script detection).
TYPE:
|
translated_text
|
The text after translation (for script detection).
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_read_win32com_char_formatting
¶
Reads inline formatting from a single win32com Word character range.
| PARAMETER | DESCRIPTION |
|---|---|
char_range
|
A win32com
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
Tuple of (bold, italic, underline, strike, superscript, subscript, |
bool
|
font_size_pt, color_hex, bg_color_hex). |
bool
|
Properties equal to |
Source code in src/core/office_processor.py
_has_win32com_range_mixed_formatting
¶
Checks whether a win32com Range has mixed per-character formatting.
Uses a quick-exit via rng.Font.Bold == WIN32COM_UNDEFINED before
falling back to full character-level iteration. Returns False on
any COM exception (conservative: assume uniform formatting).
| PARAMETER | DESCRIPTION |
|---|---|
rng
|
A win32com
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if at least two characters have different formatting. |
Source code in src/core/office_processor.py
_has_win32com_range_hyperlinks
¶
Checks whether a win32com Range contains hyperlinks.
| PARAMETER | DESCRIPTION |
|---|---|
rng
|
A win32com
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the range has at least one hyperlink. |
Source code in src/core/office_processor.py
_win32com_range_runs_to_html
¶
Converts a win32com Range's characters to inline HTML.
Groups consecutive characters with identical formatting and hyperlink
URL into runs, skipping paragraph marks (\r), then emits HTML via
_wrap_with_tags. Characters inside a hyperlink are tagged with
<a href="..."> so the LLM can preserve links during translation.
| PARAMETER | DESCRIPTION |
|---|---|
rng
|
A win32com
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
HTML string representing the range's formatted text. |
Source code in src/core/office_processor.py
484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 | |
_has_win32com_word_mixed_formatting
¶
Checks whether a win32com Word paragraph has mixed per-char formatting.
Delegates to _has_win32com_range_mixed_formatting on the
paragraph's Range.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A win32com
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if at least two characters have different formatting. |
Source code in src/core/office_processor.py
_has_win32com_word_hyperlinks
¶
Checks whether a win32com Word paragraph contains hyperlinks.
Delegates to _has_win32com_range_hyperlinks on the paragraph's
Range.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A win32com
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the paragraph has at least one hyperlink. |
Source code in src/core/office_processor.py
_win32com_word_runs_to_html
¶
Converts a win32com Word paragraph's characters to inline HTML.
Delegates to _win32com_range_runs_to_html on the paragraph's
Range.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A win32com
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
HTML string representing the paragraph's formatted text. |
Source code in src/core/office_processor.py
_extract_win32com_word
¶
Extracts text from a Word document via win32com.
For paragraphs with mixed per-run formatting, inline HTML is emitted
via _win32com_word_runs_to_html so the LLM can preserve it.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .doc or .docx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_word_html_runs
¶
_inject_win32com_word_html_runs(
doc, rng, html_text, original_text="", *, is_cell=False, target_lang=""
)
Replaces a win32com Word range's text with HTML-formatted segments.
Parses html_text via _parse_html_formatting, sets the full
plain text on the range, then applies per-segment formatting by
creating sub-ranges via doc.Range(start, end).
The original font Name is preserved on the whole range (unless source and target script families differ).
| PARAMETER | DESCRIPTION |
|---|---|
doc
|
The win32com Word
TYPE:
|
rng
|
The target
TYPE:
|
html_text
|
Translated text with inline
TYPE:
|
original_text
|
The text before translation (for script detection).
TYPE:
|
is_cell
|
True when injecting into a table cell (no trailing
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 | |
_inject_win32com_word
¶
Injects translations into a Word document via win32com.
For translations containing inline HTML formatting tags, uses
_inject_win32com_word_html_runs to preserve per-run formatting.
Otherwise falls back to uniform font save/restore.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 | |
_extract_win32com_excel
¶
Extracts text from an Excel workbook via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .xls or .xlsx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_excel
¶
Injects translations into an Excel workbook via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_read_win32com_ppt_run_formatting
¶
Reads inline formatting from a win32com PPT run TextRange.
PPT Font.Color is a ColorFormat object — the BGR integer
is accessed via .RGB. PPT Font.Strikethrough is lowercase 't'.
Superscript/subscript is detected via Font.BaselineOffset:
positive values indicate superscript, negative values indicate subscript.
Background colour is read via Font.Highlight.ForeColor.RGB
(Office 365 / 2019+). Older versions silently return None.
| PARAMETER | DESCRIPTION |
|---|---|
run_range
|
A win32com PPT
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
Tuple of (bold, italic, underline, strike, superscript, subscript, |
bool
|
font_size_pt, color_hex, bg_color_hex). |
Source code in src/core/office_processor.py
996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 | |
_has_win32com_ppt_mixed_formatting
¶
Checks whether a win32com PPT paragraph has mixed per-run formatting.
Iterates para_range.Runs(i) (1-based) and compares formatting tuples.
| PARAMETER | DESCRIPTION |
|---|---|
para_range
|
A win32com PPT
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if at least two runs have different formatting. |
Source code in src/core/office_processor.py
_has_win32com_ppt_hyperlinks
¶
Checks whether a win32com PPT paragraph has hyperlinked runs.
Iterates para_range.Runs(i) and checks each run's
ActionSettings(ppMouseClick).Hyperlink.Address.
| PARAMETER | DESCRIPTION |
|---|---|
para_range
|
A win32com PPT
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if at least one run has a non-empty hyperlink address. |
Source code in src/core/office_processor.py
_win32com_ppt_runs_to_html
¶
Converts a win32com PPT paragraph's runs to inline HTML.
Two-pass: first collects run data, then emits HTML with <span>
only when size/colour actually vary.
| PARAMETER | DESCRIPTION |
|---|---|
para_range
|
A win32com PPT
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
HTML string representing the paragraph's formatted text. |
Source code in src/core/office_processor.py
1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 | |
_extract_win32com_ppt
¶
Extracts text from a PowerPoint presentation via win32com.
For paragraphs with mixed per-run formatting or hyperlinks, inline
HTML is emitted via _win32com_ppt_runs_to_html so the LLM can
preserve them.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .ppt or .pptx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_ppt_html_runs
¶
Replaces a win32com PPT paragraph's text with HTML-formatted segments.
Parses html_text via _parse_html_formatting, sets the full
plain text on the paragraph, then applies per-segment formatting
using para_rng.Characters(offset + 1, length) (1-based).
The original font Name is preserved on the whole paragraph (unless source and target script families differ).
| PARAMETER | DESCRIPTION |
|---|---|
tf
|
The win32com PPT
TYPE:
|
p_idx
|
1-based paragraph index within the text frame.
TYPE:
|
html_text
|
Translated text with inline
TYPE:
|
original_text
|
The text before translation (for script detection).
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 | |
_inject_win32com_ppt
¶
Injects translations into a PowerPoint presentation via win32com.
For translations containing inline HTML formatting tags, uses
_inject_win32com_ppt_html_runs to preserve per-run formatting.
Otherwise falls back to uniform font save/restore.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 | |
_extract_win32com_word_comments
¶
Extracts comments from a Word document via win32com.
Only top-level comments (where Ancestor is None) are extracted.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .doc file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'comment:{index}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_word_comments
¶
Injects translated comments into a Word document via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .doc file to modify in place.
TYPE:
|
translations
|
Mapping of 'comment:{index}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_win32com_excel_comments
¶
Extracts cell comments from an Excel workbook via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .xls file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'comment:{sheet}:{row}:{col}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_excel_comments
¶
Injects translated comments into an Excel workbook via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .xls file to modify in place.
TYPE:
|
translations
|
Mapping of 'comment:{sheet}:{row}:{col}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_win32com_ppt_comments
¶
Extracts comments from a PowerPoint presentation via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .ppt file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'comment:{slide_idx}:{comment_idx}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_ppt_comments
¶
Injects translated comments into a PowerPoint presentation via win32com.
Comment.Text in PowerPoint COM may be read-only. Falls back to deleting and re-adding with the same author and metadata.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .ppt file to modify in place.
TYPE:
|
translations
|
Mapping of 'comment:{slide_idx}:{comment_idx}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_uno_file_url
¶
Converts a file path to a file:/// URL for UNO.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
File path to convert.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The file URL.
TYPE:
|
_uno_open
¶
Opens a document via LibreOffice UNO in hidden mode.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the document.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
object
|
The UNO document object. Caller MUST call |
object
|
in a |
Source code in src/core/office_processor.py
_uno_save
¶
Saves a UNO document preserving its original format.
Reads the FilterName from the document's own MediaDescriptor
(set during import) and passes it to storeToURL so UNO writes in
the same format as the source file rather than defaulting to ODF.
Falls back to a hardcoded lookup if the descriptor is unavailable.
| PARAMETER | DESCRIPTION |
|---|---|
doc
|
The UNO document object.
TYPE:
|
output_path
|
Destination file path.
TYPE:
|
Source code in src/core/office_processor.py
_save_uno_char_props
¶
Saves character formatting properties from a UNO text object.
Reads each property in UNO_CHAR_PROPERTIES via getPropertyValue(). Properties that raise are silently skipped.
| PARAMETER | DESCRIPTION |
|---|---|
text_obj
|
A UNO object supporting XPropertySet (paragraph, cell, etc.).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Mapping of property name to saved value.
TYPE:
|
Source code in src/core/office_processor.py
_restore_uno_char_props
¶
Restores previously saved character properties to a UNO text object.
Sets each property independently so a single failure does not prevent other properties from being restored.
When target_lang is provided and "CharFontName" is present in
saved, the font name is substituted via :func:_substitute_font
when the source and target scripts differ.
| PARAMETER | DESCRIPTION |
|---|---|
text_obj
|
A UNO object supporting XPropertySet.
TYPE:
|
saved
|
Mapping of property name to value (from _save_uno_char_props).
TYPE:
|
original_text
|
The text before translation (for script detection).
TYPE:
|
translated_text
|
The text after translation (for script detection).
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_read_uno_effective_formatting
¶
Reads the effective (resolved) formatting from a UNO text object.
Returns the effective values, which include formatting inherited from paragraph/character styles.
Note: UNO's CharPosture returns a uno.Enum (FontSlant) object,
not a plain integer. Comparing enum != 0 always evaluates to
True, so we detect the enum via its .value string attribute
(e.g. "NONE", "ITALIC").
| PARAMETER | DESCRIPTION |
|---|---|
obj
|
A UNO object supporting getPropertyValue (paragraph, portion, or text cursor).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[bool, bool, bool, bool, bool, bool]
|
(bold, italic, underline, strike, superscript, subscript) booleans. |
Source code in src/core/office_processor.py
_read_uno_portion_formatting
¶
Reads effective inline formatting flags from a UNO text portion.
Delegates to _read_uno_effective_formatting which handles the
uno.Enum comparison for CharPosture.
| PARAMETER | DESCRIPTION |
|---|---|
portion
|
A UNO TextPortion object (XPropertySet).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[bool, bool, bool, bool, bool, bool]
|
(bold, italic, underline, strike, superscript, subscript) booleans. |
Source code in src/core/office_processor.py
_read_uno_portion_bg_hex
¶
Reads background/highlight colour from a UNO text portion.
Checks CharHighlight first, then CharBackColor.
Both are integer RGB values; -1 / 0xFFFFFFFF means no colour.
| PARAMETER | DESCRIPTION |
|---|---|
portion
|
A UNO TextPortion object (XPropertySet).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str | None
|
Lowercase hex colour string like |
Source code in src/core/office_processor.py
_read_uno_portion_full_formatting
¶
Reads formatting flags plus font size, colour and bg from a UNO portion.
Extends _read_uno_portion_formatting with CharHeight (float pt),
CharColor (int → hex), and background colour via
_read_uno_portion_bg_hex.
| PARAMETER | DESCRIPTION |
|---|---|
portion
|
A UNO TextPortion object (XPropertySet).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
(bold, italic, underline, strike, superscript, subscript, |
bool
|
font_size_pt, color_hex, bg_color_hex). |
Source code in src/core/office_processor.py
_has_uno_mixed_formatting
¶
Checks whether a UNO paragraph has text portions with differing formatting.
Compares each portion's full formatting (bold, italic, underline, strike, superscript, subscript, font size, colour, background colour). Only considers portions with TextPortionType == "Text" and non-empty text. Returns False if 0 or 1 text portions remain.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A UNO paragraph supporting createEnumeration().
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if at least two text portions have different formatting. |
Source code in src/core/office_processor.py
_has_uno_hyperlinks
¶
Checks whether a UNO paragraph has any portions with hyperlinks.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A UNO paragraph supporting createEnumeration().
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if at least one text portion has a non-empty HyperLinkURL. |
Source code in src/core/office_processor.py
_uno_runs_to_html
¶
Converts a UNO paragraph's text portions to inline HTML.
Two-pass approach: first collects all portion data to detect
size/colour/bg variation, then emits HTML with <span> only when
needed. Portions with hyperlinks are wrapped in <a href="..."> tags.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A UNO paragraph supporting createEnumeration().
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
HTML string representing the paragraph's formatted text. |
Source code in src/core/office_processor.py
2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 | |
_save_uno_first_portion_props
¶
Reads UNO_CHAR_PROPERTIES from the first text portion of a paragraph.
This captures the actual font properties (name, size, colour) from the first run rather than from the paragraph level, which may differ.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A UNO paragraph supporting createEnumeration().
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, object]
|
dict mapping property names to values. Empty if no text portion found. |
Source code in src/core/office_processor.py
_inject_uno_html_runs
¶
Replaces a UNO paragraph's text with HTML-formatted segments.
Parses html_text via _parse_html_formatting, sets the full
plain text on the paragraph, then applies per-segment formatting via
a text cursor.
Base properties (font name, size, colour) from base_props are restored
on the whole paragraph first, excluding the four formatting properties
that are applied per-segment. CharFontName is substituted with a
compatible font when original and translated script families differ.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A UNO paragraph object.
TYPE:
|
html_text
|
Translated text with inline ///
TYPE:
|
base_props
|
Saved properties from
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 | |
_inject_uno_impress_html_runs
¶
Impress-specific variant of _inject_uno_html_runs.
Impress text cursors do not implement XParagraphCursor
(no gotoStartOfParagraph/gotoEndOfParagraph). This
function uses pure offset-based positioning via goRight
from the paragraph start range instead.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A UNO Impress paragraph object.
TYPE:
|
html_text
|
Translated text with inline HTML tags.
TYPE:
|
base_props
|
Saved properties from
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 | |
_inject_uno_impress_para_text
¶
Injects translated text into a single UNO Impress paragraph.
Uses _inject_uno_impress_html_runs for HTML-tagged text,
plain setString with property save/restore otherwise.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A UNO Impress paragraph object.
TYPE:
|
text
|
Translated text (plain or HTML-tagged).
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_extract_uno_writer
¶
Extracts text from a Writer document via UNO.
When a paragraph has mixed per-run formatting (e.g. bold + italic portions), the text is encoded as inline HTML so the LLM can preserve formatting tags.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .doc or .docx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs — plain text or inline HTML.
TYPE:
|
Source code in src/core/office_processor.py
2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 | |
_inject_uno_para_text
¶
Injects translated text into a single UNO paragraph.
Dispatches to _inject_uno_html_runs when text contains inline
HTML formatting tags, otherwise uses plain setString with
paragraph-level property save/restore.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A UNO paragraph object.
TYPE:
|
text
|
Translated text (plain or HTML-tagged).
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_cell_text
¶
Injects translated text into a UNO table cell.
For single-paragraph cells with HTML tags, dispatches to
_inject_uno_html_runs. Otherwise uses plain setString
with cell-level property save/restore.
| PARAMETER | DESCRIPTION |
|---|---|
cell
|
A UNO table cell object.
TYPE:
|
text
|
Translated text (plain or HTML-tagged).
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_writer
¶
Injects translations into a Writer document via UNO.
When the translated text contains inline HTML formatting tags
(<b>, <i>, <u>, <s>), per-segment formatting is
applied via _inject_uno_html_runs. Otherwise, plain text is
set with paragraph-level property restore.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_extract_uno_calc
¶
Extracts text from a Calc spreadsheet via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .xls or .xlsx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_calc
¶
Injects translations into a Calc spreadsheet via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_extract_uno_impress
¶
Extracts text from an Impress presentation via UNO.
When any paragraph within a shape has mixed per-run formatting, the
entire shape is extracted as inline HTML via _uno_runs_to_html
(paragraphs joined by newlines). Otherwise, plain text is returned.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .ppt or .pptx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_impress
¶
Injects translations into an Impress presentation via UNO.
When the translated text contains inline HTML formatting tags,
dispatches to _inject_uno_impress_para_text for per-run
formatting on each paragraph (lines separated by newlines).
Uses offset-based cursor positioning instead of XParagraphCursor
methods. Otherwise, uses plain setString with shape-level
property save/restore.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 | |
_extract_uno_writer_comments
¶
Extracts annotation comments from a Writer document via UNO.
Enumerates text fields and filters by Annotation service.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .doc file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'comment:{index}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_writer_comments
¶
Injects translated comments into a Writer document via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .doc file to modify in place.
TYPE:
|
translations
|
Mapping of 'comment:{index}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_uno_calc_comments
¶
Extracts cell annotations from a Calc spreadsheet via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .xls file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'comment:{sheet}:{row}:{col}' (1-based for XLSX compatibility).
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_calc_comments
¶
Injects translated comments into a Calc spreadsheet via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .xls file to modify in place.
TYPE:
|
translations
|
Mapping of 'comment:{sheet}:{row}:{col}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_uno_impress_comments
¶
Extracts annotations from an Impress presentation via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .ppt file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'comment:{page_idx}:{anno_idx}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_impress_comments
¶
Injects translated comments into an Impress presentation via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .ppt file to modify in place.
TYPE:
|
translations
|
Mapping of 'comment:{page_idx}:{anno_idx}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_convert_with_win32com
¶
Converts an office file to another format using win32com SaveAs.
Uses the output extension to determine the application and format code.
| PARAMETER | DESCRIPTION |
|---|---|
input_path
|
Path to the source file.
TYPE:
|
output_path
|
Path for the converted file.
TYPE:
|
Source code in src/core/office_processor.py
_convert_with_uno
¶
Converts an office file to another format using LibreOffice UNO.
Uses the output extension to select the export filter name.
| PARAMETER | DESCRIPTION |
|---|---|
input_path
|
Path to the source file.
TYPE:
|
output_path
|
Path for the converted file.
TYPE:
|
Source code in src/core/office_processor.py
convert_to_modern_format
¶
Converts a legacy/ODF office file to modern format (.docx/.xlsx/.pptx).
Detects the available backend (win32com or UNO) and delegates to the appropriate conversion helper. Returns True on success, False on failure (logs a warning instead of raising).
| PARAMETER | DESCRIPTION |
|---|---|
input_path
|
Path to the translated file in legacy/ODF format.
TYPE:
|
output_path
|
Path for the converted modern format file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if conversion succeeded, False otherwise.
TYPE:
|
Source code in src/core/office_processor.py
_odf_qnames
¶
Returns cached (tab_qname, linebreak_qname, span_qname, a_qname).
Source code in src/core/office_processor.py
_odf_element_text
¶
Recursively extracts all text content from an ODF element.
Walks the element's childNodes tree. Text nodes (nodeType == 3) have their data collected. Element nodes (nodeType == 1) are recursed into. Tab elements produce a tab character; line-break elements produce a newline.
When preserve_links is True, <text:a> hyperlinks are emitted as
<a href="url">text</a> HTML tags instead of plain text. This is
used during extraction so the LLM sees (and preserves) hyperlink
structure.
| PARAMETER | DESCRIPTION |
|---|---|
element
|
An odfpy element node.
TYPE:
|
preserve_links
|
If True, emit
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The concatenated text content (may contain
TYPE:
|
Source code in src/core/office_processor.py
_odf_replace_text
¶
Replaces all text content in an ODF element with new text.
Preserves the first <text:span>'s stylename attribute so that
character formatting (bold, italic, font, etc.) is retained. If no
span is found, falls back to plain addText().
When new_text contains <a href="..."> HTML tags (from hyperlink
preservation during extraction), parses them via
_parse_html_formatting and creates <text:a> elements with the
correct xlink:href attribute.
Note
odfpy's removeChild() cannot handle text nodes (nodeType == 3) because its internal cache assertion requires Element instances. We manually clear childNodes and only update caches for Elements.
| PARAMETER | DESCRIPTION |
|---|---|
element
|
An odfpy element node (typically a P or H element).
TYPE:
|
new_text
|
The replacement text (may contain
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 | |
_is_inside_table_cell
¶
Checks if an ODF element is nested inside a table cell.
| PARAMETER | DESCRIPTION |
|---|---|
element
|
An odfpy element node.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if a TableCell ancestor is found.
TYPE:
|
Source code in src/core/office_processor.py
_resolve_para_hyperlink_rels
¶
Resolves hyperlink r:id values to URLs for a paragraph.
Scans para._element for <w:hyperlink> children, looks up
each r:id in the document's relationship collection, and
returns a mapping of r:id → target URL.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A python-docx Paragraph object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, str]
|
dict mapping |
dict[str, str]
|
external hyperlinks exist. |
Source code in src/core/office_processor.py
_extract_para_with_links
¶
Extracts text from a paragraph, preserving hyperlinks as <a> tags.
Uses the HTML path (_runs_to_html) when the paragraph has mixed
formatting or <w:hyperlink> children. Falls back to para.text
for simple uniform-formatting paragraphs without hyperlinks.
| PARAMETER | DESCRIPTION |
|---|---|
para
|
A python-docx Paragraph object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Plain text or inline HTML string. |
Source code in src/core/office_processor.py
_extract_python_docx
¶
Extracts text from a DOCX file via python-docx.
Extracts paragraph text and table cell text. Each paragraph or cell with non-empty text gets a unique location key. When a paragraph has mixed formatting or hyperlinks, the text is encoded as inline HTML so the LLM can preserve formatting and link tags.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .docx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs — plain text or inline HTML.
TYPE:
|
Source code in src/core/office_processor.py
_set_odf_default_rtl
¶
Rewrites file_path (an ODF zip) so paragraphs default to RTL.
Adds — or extends — the <style:default-style style:family="paragraph">
block in styles.xml to set style:writing-mode="rl-tb" and
fo:text-align="end". Idempotent: running on an already-RTL
document is a no-op.
Source code in src/core/office_processor.py
3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 | |
_set_docx_paragraph_rtl
¶
Adds <w:bidi/> to the paragraph and <w:rtl/> to every run.
Word and LibreOffice Writer use these flags to flip paragraph direction and shape mirrored punctuation (parens, quotes) at run boundaries. Without them an Arabic / Hebrew paragraph renders flush-left with broken punctuation.
Source code in src/core/office_processor.py
_inject_python_docx
¶
Injects translations into a DOCX file via python-docx.
When the translated text contains inline HTML formatting tags
(<b>, <i>, <u>, <s>, <a>), _inject_html_runs
creates per-run formatting and hyperlink wrappers. Otherwise falls
back to _replace_paragraph_text.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name; when RTL, every paragraph in
the document is marked with
TYPE:
|
Source code in src/core/office_processor.py
3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 | |
_extract_python_xlsx
¶
Extracts text from an XLSX file via openpyxl.
Iterates all sheets and collects cells with string values.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .xlsx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_python_xlsx
¶
Injects translations into an XLSX file via openpyxl.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_walk_pptx_text_shapes
¶
Yield (shape_path, leaf_shape) for every text-bearing shape.
Recurses into shape groups via duck-typing on .shapes: a
GroupShape exposes child shapes there, a regular text box
doesn't. The returned shape_path is a dotted index chain
("0", "0.1", "2.0.3", …) so leaf positions stay stable
across runs and survive the extract → inject round trip.
Source code in src/core/office_processor.py
_extract_python_pptx
¶
Extracts text from a PPTX file via python-pptx.
Iterates slides and recurses through shape groups, then walks
paragraphs and runs of every text frame. Each non-empty paragraph
gets a location key encoding the slide + dotted shape path + para
index so grouped text round-trips through inject. Paragraphs with
mixed formatting or hyperlinks are encoded as inline HTML so the
LLM can preserve formatting and <a> tags.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .pptx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs — plain text or inline HTML.
TYPE:
|
Source code in src/core/office_processor.py
_set_pptx_paragraph_rtl
¶
Adds rtl="1" to a python-pptx paragraph's <a:pPr>.
PowerPoint and Keynote use this attribute to flip text-frame paragraph direction. Idempotent.
Source code in src/core/office_processor.py
_inject_python_pptx
¶
Injects translations into a PPTX file via python-pptx.
For each translated paragraph: puts all text in the first run
and clears other runs (preserves first run's formatting).
When the translated text contains <a> tags, hyperlink
relationships are created via the slide part.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language; when RTL, every paragraph in every
text frame is marked with
TYPE:
|
Source code in src/core/office_processor.py
_extract_python_odt
¶
Extracts text from an ODT file via odfpy.
Extracts body paragraphs, headings, and table cell text. Paragraphs inside table cells are excluded from body paragraph counting (they are handled via the table iteration).
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .odt file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_python_odt
¶
Injects translations into an ODT file via odfpy.
For paragraphs and headings: replaces all child text with the translated text (inline formatting is not preserved, matching UNO backend behavior). For table cells: replaces text in the first paragraph element.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 | |
_extract_python_ods
¶
Extracts text from an ODS file via odfpy.
Iterates all sheets and collects cells with string text content.
Uses the same key format as _extract_python_xlsx:
sheet:{name}:{row}:{col} with 1-based indices.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .ods file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_python_ods
¶
Injects translations into an ODS file via odfpy.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_extract_python_odp
¶
Extracts text from an ODP file via odfpy.
Iterates presentation pages, draw frames, and paragraphs within.
Each non-empty paragraph gets a location key using the same format
as _extract_python_pptx: slide:{s}:{sh}:{p}.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .odp file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_python_odp
¶
Injects translations into an ODP file via odfpy.
For each translated paragraph: replaces all text content, matching UNO backend behavior.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_extract_python_word
¶
Routes word-category extraction based on file extension.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the document (.docx or .odt).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_python_word
¶
Routes word-category injection based on file extension.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_extract_python_excel
¶
Routes excel-category extraction based on file extension.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the spreadsheet (.xlsx or .ods).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_python_excel
¶
Routes excel-category injection based on file extension.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_extract_python_ppt
¶
Routes ppt-category extraction based on file extension.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the presentation (.pptx or .odp).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_python_ppt
¶
Routes ppt-category injection based on file extension.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Source file path.
TYPE:
|
output_path
|
Output file path.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_get_file_category
¶
Returns the file category for dispatch.
| PARAMETER | DESCRIPTION |
|---|---|
suffix
|
Lowercase file extension.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
"word", "excel", or "ppt".
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the extension is not an office format. |
Source code in src/core/office_processor.py
_is_fatal_llm_error
¶
Returns True when error_tag is in _FATAL_LLM_ERRORS.
Delegates to :func:src.constants.errors.base_error_tag to strip
the optional :Service suffix the engine appends to AUTH_ERROR
so "AUTH_ERROR:Gemini" matches as fatal alongside the bare
"AUTH_ERROR".
Source code in src/core/office_processor.py
_should_translate_images
¶
Checks whether image translation should be attempted for this file.
Returns True when the setting is enabled, OCR is configured, and the
format supports embedded image translation. Modern/ODF formats use
zipfile directly; legacy formats (.doc, .xls, .ppt) use round-trip
conversion to a modern format first.
| PARAMETER | DESCRIPTION |
|---|---|
suffix
|
Lowercase file extension (e.g. ".docx").
TYPE:
|
backend
|
The detected backend identifier (unused, kept for API
consistency with
TYPE:
|
config
|
Optional TranslationConfig snapshot; falls back to load_setting().
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if image translation should proceed.
TYPE:
|
Source code in src/core/office_processor.py
_should_translate_comments
¶
Checks whether comment translation should be attempted for this file.
Returns True when the setting is enabled and the format supports comment extraction. Comment handling uses its own libraries (python-docx, openpyxl, python-pptx, zipfile+lxml) independently of the text-extraction backend, so no backend restriction is needed.
| PARAMETER | DESCRIPTION |
|---|---|
suffix
|
Lowercase file extension (e.g. ".docx").
TYPE:
|
backend
|
The detected backend identifier (unused, kept for API
consistency with
TYPE:
|
config
|
Optional TranslationConfig snapshot; falls back to load_setting().
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if comment translation should proceed.
TYPE:
|
Source code in src/core/office_processor.py
_should_translate_shapes
¶
Checks whether shape/text-box translation should be attempted.
Returns True when the setting is enabled and the format supports shape extraction. PPT formats are excluded because their primary extractors already handle shapes.
| PARAMETER | DESCRIPTION |
|---|---|
suffix
|
Lowercase file extension (e.g. ".docx").
TYPE:
|
backend
|
The detected backend identifier (unused, kept for API
consistency with
TYPE:
|
config
|
Optional TranslationConfig snapshot; falls back to load_setting().
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if shape translation should proceed.
TYPE:
|
Source code in src/core/office_processor.py
_extract_comments
¶
Extracts comments from an office file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the office file.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs for comments.
TYPE:
|
Source code in src/core/office_processor.py
_extract_docx_comments
¶
Extracts comments from a DOCX file via low-level XML access.
Detects <w:hyperlink> elements within comment paragraphs and emits
<a href="..."> HTML tags so that hyperlinks are preserved through
the LLM translation round-trip. Hyperlink URLs are resolved from the
comments part's .rels file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .docx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'comment:{id}'.
Text may contain
TYPE:
|
Source code in src/core/office_processor.py
4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 | |
_extract_xlsx_comments
¶
Extracts cell comments from an XLSX file via openpyxl.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .xlsx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'comment:{sheet}:{row}:{col}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_comments
¶
Injects translated comments back into the output document.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the output file (already written by inject_fn).
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
Source code in src/core/office_processor.py
_inject_docx_comments
¶
Injects translated comments into a DOCX file via low-level XML.
When a translation contains <a href="..."> tags, the comment's
paragraphs are rebuilt with <w:hyperlink> elements and the
corresponding relationships are added to
word/_rels/comments.xml.rels. Plain-text translations use the
simpler <w:t> replacement path.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .docx file to modify in place.
TYPE:
|
translations
|
Mapping of 'comment:{id}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 | |
_inject_docx_comment_html
¶
Rebuilds a single comment element's paragraphs from HTML.
Parses html_text via _parse_html_formatting to obtain
_FormattedSegment objects. Segments with hyperlink_url are
wrapped in <w:hyperlink> elements with relationship IDs created
via comments_part.relate_to().
| PARAMETER | DESCRIPTION |
|---|---|
comment_el
|
The
TYPE:
|
html_text
|
Translated HTML string (may contain
TYPE:
|
comments_part
|
The python-docx comments
TYPE:
|
qn
|
The python-docx
TYPE:
|
Source code in src/core/office_processor.py
4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 | |
_patch_docx_comment_rels
¶
Ensures word/_rels/comments.xml.rels is persisted in the DOCX ZIP.
python-docx may not serialize .rels for the comments part
when saved via doc.save(). This function verifies and patches
the ZIP directly if the rels data is missing or stale.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the saved .docx file.
TYPE:
|
comments_part
|
The python-docx comments Part (with .rels data).
TYPE:
|
Source code in src/core/office_processor.py
5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 | |
_inject_xlsx_comments
¶
Injects translated comments into an XLSX file via openpyxl.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .xlsx file to modify in place.
TYPE:
|
translations
|
Mapping of 'comment:{sheet}:{row}:{col}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_get_rels_path
¶
Returns the .rels path for a given XML part path inside a ZIP.
E.g. 'word/document.xml' → 'word/_rels/document.xml.rels'.
| PARAMETER | DESCRIPTION |
|---|---|
part_path
|
Path of the XML part inside the ZIP.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Path of the corresponding |
Source code in src/core/office_processor.py
_parse_hyperlink_rels
¶
Parses a .rels XML file into {r_id: url} for hyperlinks.
Only external hyperlink relationships (TargetMode="External") are
included.
| PARAMETER | DESCRIPTION |
|---|---|
rels_xml
|
Raw bytes of the
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, str]
|
dict mapping relationship IDs to target URLs. |
Source code in src/core/office_processor.py
_add_hyperlink_to_rels
¶
Adds a hyperlink relationship to a .rels XML file.
If rels_xml is None, creates a new Relationships document.
| PARAMETER | DESCRIPTION |
|---|---|
rels_xml
|
Existing
TYPE:
|
url
|
The target URL for the hyperlink.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[bytes, str]
|
Tuple of |
Source code in src/core/office_processor.py
_extract_drawingml_text
¶
Extracts plain text from a DrawingML <txBody> element.
Iterates <a:p> paragraphs and joins <a:t> runs within each.
Paragraphs are separated by newlines, and <a:br/> tags are preserved
as newlines.
| PARAMETER | DESCRIPTION |
|---|---|
tx_body_el
|
An lxml element representing
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The concatenated plain text.
TYPE:
|
Source code in src/core/office_processor.py
_inject_drawingml_text
¶
Replaces text in a DrawingML <txBody> element.
Puts all translated text in the first <a:t> of the first
<a:r> in the first <a:p>, and clears remaining <a:t>
elements. Handles newlines by inserting <a:br/> and new <a:r>
elements.
| PARAMETER | DESCRIPTION |
|---|---|
tx_body_el
|
An lxml element representing
TYPE:
|
new_text
|
The replacement text.
TYPE:
|
Source code in src/core/office_processor.py
_inject_drawingml_html_runs
¶
Replaces DrawingML <a:txBody> runs with HTML-formatted segments.
Parses html_text via _parse_html_formatting, clears existing
<a:r> elements, and rebuilds runs with per-segment <a:rPr>
formatting. Falls back to _inject_drawingml_text if no HTML
formatting tags are detected.
When rels_adder is provided, segments with hyperlink_url get an
<a:hlinkClick> element created inside <a:rPr> with a
relationship ID returned by the callback.
| PARAMETER | DESCRIPTION |
|---|---|
tx_body_el
|
An lxml element representing
TYPE:
|
html_text
|
Translated text with inline
TYPE:
|
rels_adder
|
Callback that accepts a URL string and returns a
relationship ID (
TYPE:
|
Source code in src/core/office_processor.py
5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 | |
_extract_pptx_legacy_comments
¶
Extracts legacy comments from an already-opened Presentation.
Legacy comments use <p:cm> elements with <p:text> children
(PowerPoint 2007–2019).
| PARAMETER | DESCRIPTION |
|---|---|
prs
|
A python-pptx Presentation object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like
TYPE:
|
Source code in src/core/office_processor.py
_extract_pptx_modern_comments
¶
Extracts modern threaded comments from an already-opened Presentation.
Modern comments use <p188:cm> elements with <txBody> rich
text and an optional <replyLst> (PowerPoint 365, 2021+).
| PARAMETER | DESCRIPTION |
|---|---|
prs
|
A python-pptx Presentation object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs. Main comments use keys like
TYPE:
|
Source code in src/core/office_processor.py
_extract_pptx_comments
¶
Extracts comments from a PPTX file via low-level XML on slide parts.
Handles both legacy comments (<p:cm>) and modern threaded
comments (<p188:cm>). A single file typically uses one format,
but both are checked.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .pptx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_pptx_legacy_comments
¶
Injects translated text into legacy PPTX comments.
| PARAMETER | DESCRIPTION |
|---|---|
prs
|
A python-pptx Presentation object.
TYPE:
|
translations
|
Mapping of location keys to translated text.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if any comment was modified.
TYPE:
|
Source code in src/core/office_processor.py
_inject_pptx_modern_comments
¶
Injects translated text into modern threaded PPTX comments.
| PARAMETER | DESCRIPTION |
|---|---|
prs
|
A python-pptx Presentation object.
TYPE:
|
translations
|
Mapping of location keys to translated text.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if any comment was modified.
TYPE:
|
Source code in src/core/office_processor.py
5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 | |
_inject_pptx_comments
¶
Injects translated comments into a PPTX file via low-level XML.
Handles both legacy and modern threaded comment formats.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .pptx file to modify in place.
TYPE:
|
translations
|
Mapping of location keys to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_rewrite_zip_content
¶
Atomically rewrites a ZIP archive with modified file data.
Writes to a temporary file then replaces the original. Used by all zip-based inject functions (DOCX/XLSX/ODF shapes and comments).
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the ZIP file to overwrite.
TYPE:
|
file_data
|
Mapping of archive entry names to their (possibly modified) content bytes.
TYPE:
|
all_items
|
Original
TYPE:
|
Source code in src/core/office_processor.py
_patch_rels_for_embeddings
¶
Restores embedding relationship entries into output rels files.
| PARAMETER | DESCRIPTION |
|---|---|
file_data
|
Mutable mapping of output ZIP entries (modified in place).
TYPE:
|
src_rels
|
Source
TYPE:
|
new_items
|
Accumulator for new ZIP entries (appended if a rels file is entirely missing from file_data).
TYPE:
|
Source code in src/core/office_processor.py
_restore_xlsx_embeddings
¶
Restores embedded objects that openpyxl drops during save.
openpyxl does not preserve OLE/package embedded objects stored
under xl/embeddings/ or their relationship and content-type
entries. This function reads those artefacts from source_path
and patches them back into output_path after openpyxl's save.
| PARAMETER | DESCRIPTION |
|---|---|
source_path
|
Original XLSX before openpyxl processing.
TYPE:
|
output_path
|
XLSX written by openpyxl (modified in place).
TYPE:
|
Source code in src/core/office_processor.py
5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 | |
_extract_odf_paragraph_text
¶
Extracts concatenated paragraph text from an ODF element.
Works with any element that contains <text:p> children, such as
<draw:text-box> and <office:annotation>. Handles mixed
content: direct text, child element text, and tail text.
ODF hyperlinks (<text:a xlink:href="URL">) are emitted as
<a href="URL">text</a> so that downstream HTML-aware injection
can reconstruct them.
| PARAMETER | DESCRIPTION |
|---|---|
parent
|
An lxml element containing
TYPE:
|
text_p_tag
|
Fully-qualified
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Paragraphs joined by newlines, stripped. May contain
TYPE:
|
Source code in src/core/office_processor.py
_inject_odf_paragraph_text
¶
Replaces text in an ODF element that contains <text:p> children.
Puts the translated text in the first <text:p>, clears its
children, and removes any extra <text:p> elements. Handles newlines
by creating additional <text:p> elements. Works for
both <draw:text-box> and <office:annotation>.
Preserves the first <text:span>'s attributes so that character
formatting (font name, size, bold, etc.) is retained.
| PARAMETER | DESCRIPTION |
|---|---|
parent
|
An lxml element containing
TYPE:
|
new_text
|
The replacement text.
TYPE:
|
text_p_tag
|
Fully-qualified
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the element was modified.
TYPE:
|
Source code in src/core/office_processor.py
5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 | |
_inject_odf_paragraph_text_html
¶
Injects HTML-formatted text (with hyperlinks) into ODF paragraphs.
Parses new_text via _parse_html_formatting and reconstructs
<text:p> children. Segments with hyperlink_url become
<text:a xlink:href="..."> elements; plain segments become direct
text or <text:span> elements (preserving character style).
Falls back to plain-text injection when parsing yields no segments.
| PARAMETER | DESCRIPTION |
|---|---|
parent
|
An lxml element containing
TYPE:
|
new_text
|
Translated text with inline
TYPE:
|
text_p_tag
|
Fully-qualified
TYPE:
|
paras
|
Pre-found
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the element was modified.
TYPE:
|
Source code in src/core/office_processor.py
6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 | |
_extract_odf_comments
¶
Extracts annotation text from an ODF file (.odt, .ods, .odp).
Opens the ZIP archive, parses content.xml, and collects text
from all <office:annotation> elements.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the ODF file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'comment:{annotation_name}' or 'comment:{index}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_odf_comments
¶
Injects translated comments into an ODF file (.odt, .ods, .odp).
Reads the ZIP archive, modifies <office:annotation> text in
content.xml, and writes the archive back.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the ODF file to modify in place.
TYPE:
|
translations
|
Mapping of
TYPE:
|
Source code in src/core/office_processor.py
_extract_shapes
¶
Extracts text from shapes and text boxes in an office file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the office file.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs for shape text.
TYPE:
|
Source code in src/core/office_processor.py
_inject_shapes
¶
Injects translated shape text back into the output document.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the output file (already written by inject_fn).
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
Source code in src/core/office_processor.py
_sanitize_sheet_name
¶
Sanitises a translated sheet name for Excel/Calc compatibility.
Removes invalid characters and truncates to 31 characters.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
Raw translated sheet name.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Sanitised name, or "Sheet" if the result is empty.
TYPE:
|
Source code in src/core/office_processor.py
_should_translate_sheet_names
¶
Checks whether sheet-name translation should be attempted.
| PARAMETER | DESCRIPTION |
|---|---|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
The detected backend identifier (unused).
TYPE:
|
config
|
Optional TranslationConfig; falls back to load_setting().
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if sheet-name translation should proceed.
TYPE:
|
Source code in src/core/office_processor.py
_extract_sheet_names
¶
Extracts sheet names from a spreadsheet file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the spreadsheet.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'sheetname:{name}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_sheet_names
¶
Injects translated sheet names back into the output spreadsheet.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the output file.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
Source code in src/core/office_processor.py
_extract_xlsx_sheet_names
¶
Extracts sheet names from an XLSX file via ZIP+lxml.
Reads only xl/workbook.xml (a few KB) instead of loading the
full workbook through openpyxl, which would parse all cell data.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .xlsx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, sheet_name) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_xlsx_sheet_names
¶
Injects translated sheet names into XLSX via ZIP+lxml.
Uses direct XML manipulation to avoid openpyxl's lossy round-trip (which would drop restored embedded objects).
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .xlsx file to modify in place.
TYPE:
|
translations
|
Mapping of 'sheetname:{name}' to translated name.
TYPE:
|
Source code in src/core/office_processor.py
_extract_ods_sheet_names
¶
Extracts sheet names from an ODS file via odfpy.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .ods file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, sheet_name) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_ods_sheet_names
¶
Injects translated sheet names into ODS via ZIP+lxml.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .ods file to modify in place.
TYPE:
|
translations
|
Mapping of 'sheetname:{name}' to translated name.
TYPE:
|
Source code in src/core/office_processor.py
_extract_win32com_excel_sheet_names
¶
Extracts sheet names from an XLS file via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .xls file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, sheet_name) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_excel_sheet_names
¶
Injects translated sheet names into an XLS file via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .xls file to modify in place.
TYPE:
|
translations
|
Mapping of 'sheetname:{name}' to translated name.
TYPE:
|
Source code in src/core/office_processor.py
_extract_uno_calc_sheet_names
¶
Extracts sheet names from an XLS/ODS file via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the spreadsheet.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, sheet_name) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_calc_sheet_names
¶
Injects translated sheet names into a spreadsheet via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the spreadsheet to modify in place.
TYPE:
|
translations
|
Mapping of 'sheetname:{name}' to translated name.
TYPE:
|
Source code in src/core/office_processor.py
_should_translate_notes
¶
Checks whether speaker-notes translation should be attempted.
| PARAMETER | DESCRIPTION |
|---|---|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
The detected backend identifier (unused).
TYPE:
|
config
|
Optional TranslationConfig; falls back to load_setting().
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if speaker-notes translation should proceed.
TYPE:
|
Source code in src/core/office_processor.py
_extract_notes
¶
Extracts speaker notes from a presentation file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the presentation.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'note:{slide}:{para}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_notes
¶
Injects translated speaker notes back into the output presentation.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the output file.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
Source code in src/core/office_processor.py
_extract_pptx_notes
¶
Extracts speaker notes from a PPTX file via python-pptx.
Paragraphs with mixed formatting or hyperlinks are encoded as inline HTML so the LLM can preserve them.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .pptx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_pptx_notes
¶
Injects translated speaker notes into a PPTX file via python-pptx.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .pptx file to modify in place.
TYPE:
|
translations
|
Mapping of 'note:{slide}:{para}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_odp_notes
¶
Extracts speaker notes from an ODP file via odfpy.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .odp file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_odp_notes
¶
Injects translated speaker notes into an ODP file via odfpy.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .odp file to modify in place.
TYPE:
|
translations
|
Mapping of 'note:{slide}:{para}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_win32com_ppt_notes
¶
Extracts speaker notes from a PPT file via win32com.
Iterates the notes page of each slide and extracts text from shapes that have text frames.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .ppt file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_ppt_notes
¶
Injects translated speaker notes into a PPT file via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .ppt file to modify in place.
TYPE:
|
translations
|
Mapping of 'note:{slide}:{para}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_uno_impress_notes
¶
Extracts speaker notes from a PPT/ODP file via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the presentation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_impress_notes
¶
Injects translated speaker notes into a presentation via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the presentation to modify in place.
TYPE:
|
translations
|
Mapping of 'note:{slide}:{para}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_headers_footers
¶
Extracts headers and footers from a word-processing document.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the document.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'header:{section}:{type}:{para}' or 'footer:...'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_headers_footers
¶
Injects translated headers/footers back into the output document.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the output file.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
Source code in src/core/office_processor.py
_extract_docx_hf_part
¶
Extracts text from a DOCX header/footer part's paragraphs.
| PARAMETER | DESCRIPTION |
|---|---|
paragraphs
|
List of python-docx Paragraph objects.
TYPE:
|
section_idx
|
Section index.
TYPE:
|
hf_type
|
Type identifier ('default', 'first', 'even').
TYPE:
|
prefix
|
'header' or 'footer'.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_extract_docx_headers_footers
¶
Extracts headers and footers from a DOCX file via python-docx.
Extracts default, first-page, and even-page headers/footers from each section. Paragraphs with mixed formatting or hyperlinks are encoded as inline HTML.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .docx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
7120 7121 7122 7123 7124 7125 7126 7127 7128 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140 7141 7142 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154 7155 7156 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168 7169 7170 7171 7172 7173 7174 7175 7176 7177 7178 7179 7180 7181 7182 7183 7184 7185 7186 7187 7188 7189 7190 7191 7192 7193 7194 7195 7196 7197 7198 7199 7200 7201 7202 7203 7204 7205 7206 7207 7208 7209 | |
_inject_docx_headers_footers
¶
Injects translated headers/footers into a DOCX file via python-docx.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .docx file to modify in place.
TYPE:
|
translations
|
Mapping of 'header:...' / 'footer:...' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
7212 7213 7214 7215 7216 7217 7218 7219 7220 7221 7222 7223 7224 7225 7226 7227 7228 7229 7230 7231 7232 7233 7234 7235 7236 7237 7238 7239 7240 7241 7242 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 7253 7254 7255 7256 7257 7258 7259 7260 7261 7262 7263 7264 7265 7266 7267 7268 7269 7270 7271 7272 7273 7274 7275 7276 7277 7278 7279 7280 7281 7282 7283 7284 7285 7286 | |
_build_odf_hf_map
¶
Builds an ODF header/footer element-tag → (prefix, type) lookup.
Used by both _extract_odt_headers_footers and
_inject_odt_headers_footers to avoid duplicating the mapping.
Source code in src/core/office_processor.py
_extract_odt_headers_footers
¶
Extracts headers and footers from an ODT file via ZIP+lxml.
ODT stores headers/footers in styles.xml under
<style:master-page> elements.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .odt file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_odt_headers_footers
¶
Injects translated headers/footers into an ODT file via ZIP+lxml.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .odt file to modify in place.
TYPE:
|
translations
|
Mapping of 'header:...' / 'footer:...' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_win32com_word_headers_footers
¶
Extracts headers/footers from a DOC file via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .doc file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_word_headers_footers
¶
Injects translated headers/footers into a DOC file via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .doc file to modify in place.
TYPE:
|
translations
|
Mapping of 'header:...' / 'footer:...' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_uno_writer_headers_footers
¶
Extracts headers/footers from a DOC/ODT file via UNO.
UNO stores headers/footers on page styles. Each unique page style is treated as a "section" for key purposes.
Note
Only default headers/footers are extracted. UNO exposes
first-page (HeaderTextFirst) and even-page
(HeaderTextLeft) properties, but they require additional
page-style flags (HeaderIsShared / FirstIsShared) that
vary across LibreOffice versions. Default-only is sufficient
for the vast majority of DOC files.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the document.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
7508 7509 7510 7511 7512 7513 7514 7515 7516 7517 7518 7519 7520 7521 7522 7523 7524 7525 7526 7527 7528 7529 7530 7531 7532 7533 7534 7535 7536 7537 7538 7539 7540 7541 7542 7543 7544 7545 7546 7547 7548 7549 7550 7551 7552 7553 7554 7555 7556 7557 7558 7559 7560 7561 7562 7563 7564 7565 7566 7567 7568 7569 7570 7571 7572 7573 7574 7575 7576 7577 7578 7579 | |
_inject_uno_writer_headers_footers
¶
Injects translated headers/footers into a document via UNO.
Only default headers/footers are handled. See
:func:_extract_uno_writer_headers_footers note on first/even-page
limitation.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the document to modify in place.
TYPE:
|
translations
|
Mapping of 'header:...' / 'footer:...' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_footnotes
¶
Extracts footnotes and endnotes from a word-processing document.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the document.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'footnote:{id}' or 'endnote:{id}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_footnotes
¶
Injects translated footnotes/endnotes into the output document.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the output file.
TYPE:
|
translations
|
Mapping of location_key to translated text.
TYPE:
|
suffix
|
Lowercase file extension.
TYPE:
|
backend
|
Backend identifier for legacy format dispatch.
TYPE:
|
Source code in src/core/office_processor.py
_extract_docx_fn_xml
¶
Extracts text from DOCX footnote or endnote XML.
| PARAMETER | DESCRIPTION |
|---|---|
xml_data
|
Raw XML bytes of footnotes.xml or endnotes.xml.
TYPE:
|
element_tag
|
Fully-qualified tag (e.g.
TYPE:
|
key_prefix
|
Key prefix ('footnote' or 'endnote').
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_extract_docx_footnotes
¶
Extracts footnotes and endnotes from a DOCX file via ZIP+lxml.
Reads word/footnotes.xml and word/endnotes.xml. IDs 0, 1,
and -1 are internal separators and are skipped.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .docx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_docx_footnotes
¶
Injects translated footnotes/endnotes into a DOCX file via ZIP+lxml.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .docx file to modify in place.
TYPE:
|
translations
|
Mapping of 'footnote:{id}' / 'endnote:{id}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
7781 7782 7783 7784 7785 7786 7787 7788 7789 7790 7791 7792 7793 7794 7795 7796 7797 7798 7799 7800 7801 7802 7803 7804 7805 7806 7807 7808 7809 7810 7811 7812 7813 7814 7815 7816 7817 7818 7819 7820 7821 7822 7823 7824 7825 7826 7827 7828 7829 7830 7831 7832 7833 7834 7835 7836 7837 7838 7839 7840 7841 7842 7843 7844 7845 7846 7847 7848 | |
_inject_docx_fn_text
¶
Replaces text in DOCX footnote/endnote paragraphs.
Preserves the footnote-reference run (<w:footnoteRef/>) in the
first paragraph and replaces text in subsequent runs.
| PARAMETER | DESCRIPTION |
|---|---|
paras
|
List of
TYPE:
|
new_text
|
Translated text (paragraphs separated by newlines).
TYPE:
|
w_ns
|
WordprocessingML namespace URI.
TYPE:
|
Source code in src/core/office_processor.py
_extract_odt_footnotes
¶
Extracts footnotes and endnotes from an ODT file via ZIP+lxml.
ODT stores footnotes as <text:note> elements inline in
content.xml. The text:note-class attribute distinguishes
footnotes from endnotes.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .odt file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_odt_footnotes
¶
Injects translated footnotes/endnotes into an ODT file via ZIP+lxml.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .odt file to modify in place.
TYPE:
|
translations
|
Mapping of 'footnote:{id}' / 'endnote:{id}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_win32com_word_footnotes
¶
Extracts footnotes and endnotes from a DOC file via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .doc file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_word_footnotes
¶
Injects translated footnotes/endnotes into a DOC file via win32com.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .doc file to modify in place.
TYPE:
|
translations
|
Mapping of 'footnote:{id}' / 'endnote:{id}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_uno_writer_footnotes
¶
Extracts footnotes and endnotes from a DOC/ODT file via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the document.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs.
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_writer_footnotes
¶
Injects translated footnotes/endnotes into a document via UNO.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the document to modify in place.
TYPE:
|
translations
|
Mapping of 'footnote:{id}' / 'endnote:{id}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_win32com_word_shapes
¶
Extracts text from shapes/text boxes in a Word document via win32com.
When a shape's text range has mixed per-run formatting, inline HTML is
emitted via _win32com_range_runs_to_html so the LLM can preserve it.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .doc file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'shape:{index}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_word_shapes
¶
Injects translated text into Word shapes via win32com.
When the translated text contains inline HTML formatting tags,
per-segment formatting is applied via _inject_win32com_word_html_runs.
Otherwise, plain text is set with uniform font save/restore.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .doc file to modify in place.
TYPE:
|
translations
|
Mapping of 'shape:{index}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_win32com_excel_shapes
¶
Extracts text from shapes in an Excel workbook via win32com.
When a shape's text range has mixed per-run formatting, inline HTML is
emitted via _win32com_range_runs_to_html so the LLM can preserve it.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .xls file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'shape:{sheet_name}:{index}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_win32com_excel_html_runs
¶
Replaces an Excel shape's text with HTML-formatted segments via win32com.
Parses html_text via _parse_html_formatting, sets the full
plain text on the range, then applies per-segment formatting using
Characters(start, length) sub-ranges (1-based indexing).
| PARAMETER | DESCRIPTION |
|---|---|
text_rng
|
A win32com
TYPE:
|
html_text
|
Translated text with inline
TYPE:
|
original_text
|
The text before translation (for script detection).
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
8244 8245 8246 8247 8248 8249 8250 8251 8252 8253 8254 8255 8256 8257 8258 8259 8260 8261 8262 8263 8264 8265 8266 8267 8268 8269 8270 8271 8272 8273 8274 8275 8276 8277 8278 8279 8280 8281 8282 8283 8284 8285 8286 8287 8288 8289 8290 8291 8292 8293 8294 8295 8296 8297 8298 8299 8300 8301 8302 8303 8304 8305 8306 8307 8308 8309 8310 8311 8312 8313 8314 8315 8316 8317 8318 8319 8320 8321 8322 8323 8324 8325 8326 8327 8328 8329 8330 8331 8332 8333 8334 | |
_inject_win32com_excel_shapes
¶
Injects translated text into Excel shapes via win32com.
When the translated text contains inline HTML formatting tags,
per-segment formatting is applied via
_inject_win32com_excel_html_runs. Otherwise, plain text is set
with uniform font save/restore.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .xls file to modify in place.
TYPE:
|
translations
|
Mapping of 'shape:{sheet_name}:{index}' to translated text.
TYPE:
|
target_lang
|
Target language name for font substitution.
TYPE:
|
Source code in src/core/office_processor.py
_extract_uno_writer_shapes
¶
Extracts text from shapes/text boxes in a Writer document via UNO.
When any paragraph within a shape has mixed per-run formatting, the
entire shape is extracted as inline HTML via _uno_runs_to_html
(paragraphs joined by newlines). Otherwise, plain text is returned.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .doc or .docx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'shape:{index}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_writer_shapes
¶
Injects translated text into Writer shapes via UNO.
When the translated text contains inline HTML formatting tags,
dispatches to _inject_uno_para_text for per-run formatting on
each paragraph (lines separated by newlines). Otherwise, uses plain
setString with shape-level property save/restore.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .doc or .docx file to modify in place.
TYPE:
|
translations
|
Mapping of 'shape:{index}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_extract_uno_calc_shapes
¶
Extracts text from shapes in a Calc spreadsheet via UNO.
When any paragraph within a shape has mixed per-run formatting, the
entire shape is extracted as inline HTML via _uno_runs_to_html
(paragraphs joined by newlines). Otherwise, plain text is returned.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .xls file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'shape:{sheet_name}:{index}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_uno_calc_shapes
¶
Injects translated text into Calc shapes via UNO.
When the translated text contains inline HTML formatting tags,
dispatches to _inject_uno_para_text for per-run formatting on
each paragraph (lines separated by newlines). Otherwise, uses plain
setString with shape-level property save/restore.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .xls file to modify in place.
TYPE:
|
translations
|
Mapping of 'shape:{sheet_name}:{index}' to translated text.
TYPE:
|
Source code in src/core/office_processor.py
_read_txbx_data
¶
Reads plain text and <w:t> elements from a single <wps:txbx>.
Iterates paragraph-by-paragraph to preserve structural newlines between paragraphs.
| PARAMETER | DESCRIPTION |
|---|---|
txbx_el
|
An lxml element for a
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Tuple of (plain_text, t_elements) where plain_text is the stripped |
list[object]
|
concatenated text of all paragraphs joined by |
tuple[str, list[object]]
|
t_elements is the flat list of all |
Source code in src/core/office_processor.py
_wps_txbx_to_text_or_html
¶
Extracts text from a <wps:txbx> element, using HTML when formatting varies.
Iterates direct children of each <w:p> paragraph — both <w:r>
runs and <w:hyperlink> wrappers. If run formatting varies or any
hyperlinks are present, wraps the text in inline HTML tags via
_wrap_with_tags and <a href="..."> tags. Otherwise returns
plain text identical to _read_txbx_data. Paragraphs are joined
with '\n'.
Character-style references (<w:rStyle>) are resolved when
char_styles is provided: the style supplies base formatting and
direct <w:rPr> attributes override.
All <w:t> elements within a single run are concatenated so that
split runs (e.g. from spell-checking) do not silently drop text.
| PARAMETER | DESCRIPTION |
|---|---|
txbx_el
|
An lxml element for a
TYPE:
|
char_styles
|
Mapping of style IDs to formatting tuples, as returned
by
TYPE:
|
hyperlink_rels
|
Mapping of relationship IDs to target URLs,
parsed from the part's
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Plain text or inline-HTML string representing the text box content. |
Source code in src/core/office_processor.py
8631 8632 8633 8634 8635 8636 8637 8638 8639 8640 8641 8642 8643 8644 8645 8646 8647 8648 8649 8650 8651 8652 8653 8654 8655 8656 8657 8658 8659 8660 8661 8662 8663 8664 8665 8666 8667 8668 8669 8670 8671 8672 8673 8674 8675 8676 8677 8678 8679 8680 8681 8682 8683 8684 8685 8686 8687 8688 8689 8690 8691 8692 8693 8694 8695 8696 8697 8698 8699 8700 8701 8702 8703 8704 8705 8706 8707 8708 8709 8710 8711 8712 8713 8714 8715 8716 8717 8718 8719 8720 8721 8722 8723 8724 8725 8726 8727 8728 8729 8730 8731 8732 8733 8734 8735 8736 8737 8738 8739 8740 8741 8742 8743 8744 8745 8746 8747 8748 8749 8750 8751 8752 8753 8754 8755 8756 8757 8758 8759 8760 8761 8762 8763 8764 8765 8766 8767 8768 8769 8770 8771 8772 8773 8774 8775 8776 8777 8778 8779 8780 8781 8782 8783 8784 8785 8786 8787 8788 8789 8790 8791 8792 8793 8794 8795 8796 8797 8798 8799 8800 8801 8802 8803 8804 8805 8806 8807 8808 8809 8810 8811 8812 8813 8814 8815 8816 8817 | |
_inject_wps_txbx_plain
¶
Injects plain text into a <wps:txbx> element in-place.
Sets the first <w:t> element's text to the first line and appends
<w:br/> and new <w:t> elements for subsequent lines. Remaining
original <w:t> elements are cleared.
| PARAMETER | DESCRIPTION |
|---|---|
txbx_el
|
An lxml element for the
TYPE:
|
plain_text
|
The translated plain text (lines separated by
TYPE:
|
t_elements
|
Flat list of all
TYPE:
|
Source code in src/core/office_processor.py
_inject_wps_txbx_html_runs
¶
Injects HTML-formatted text into a <wps:txbx> element in-place.
Parses html_text via _parse_html_formatting to obtain
_FormattedSegment objects. Segments containing '\n' are split
across multiple <w:p> elements. Existing run children are cleared
and replaced with new <w:r>/<w:rPr>/<w:t> elements. Excess
paragraphs are removed; new ones are cloned from the last existing
paragraph when more are needed.
When rels_adder is provided, segments with hyperlink_url are
wrapped in <w:hyperlink> elements with the relationship ID
returned by the callback.
| PARAMETER | DESCRIPTION |
|---|---|
txbx_el
|
An lxml element for the
TYPE:
|
html_text
|
Translated HTML string with inline formatting tags.
TYPE:
|
rels_adder
|
Callback that accepts a URL string and returns a
relationship ID (
TYPE:
|
Source code in src/core/office_processor.py
8854 8855 8856 8857 8858 8859 8860 8861 8862 8863 8864 8865 8866 8867 8868 8869 8870 8871 8872 8873 8874 8875 8876 8877 8878 8879 8880 8881 8882 8883 8884 8885 8886 8887 8888 8889 8890 8891 8892 8893 8894 8895 8896 8897 8898 8899 8900 8901 8902 8903 8904 8905 8906 8907 8908 8909 8910 8911 8912 8913 8914 8915 8916 8917 8918 8919 8920 8921 8922 8923 8924 8925 8926 8927 8928 8929 8930 8931 8932 8933 8934 8935 8936 8937 8938 8939 8940 8941 8942 8943 8944 8945 8946 8947 8948 8949 8950 8951 8952 8953 8954 8955 8956 8957 8958 8959 8960 8961 8962 8963 8964 8965 8966 8967 8968 8969 8970 8971 8972 8973 8974 8975 8976 8977 8978 8979 8980 8981 8982 8983 8984 8985 8986 8987 8988 8989 8990 8991 8992 8993 8994 8995 8996 8997 8998 8999 9000 9001 9002 9003 9004 9005 9006 9007 9008 9009 9010 9011 9012 9013 9014 9015 9016 9017 9018 9019 9020 9021 9022 9023 9024 9025 9026 9027 9028 9029 9030 9031 9032 9033 9034 9035 9036 9037 9038 9039 9040 9041 9042 9043 9044 9045 9046 9047 9048 9049 9050 9051 9052 9053 9054 9055 9056 9057 9058 9059 9060 9061 9062 9063 9064 | |
_collect_wps_texts
¶
Finds all <wps:txbx> text boxes and their <w:t> elements.
Delegates per-element data reading to _read_txbx_data.
| PARAMETER | DESCRIPTION |
|---|---|
root
|
lxml root element of an XML part.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
Pairs of (concatenated_text, list_of_wt_elements).
TYPE:
|
Source code in src/core/office_processor.py
_extract_docx_shapes
¶
Extracts text from shapes/text boxes in a DOCX file via ZIP + lxml.
Parses word/document.xml and word/header*.xml / word/footer*.xml
looking for <wps:txbx> elements that contain <w:t> runs.
When run formatting varies or hyperlinks are present within a text box,
inline HTML is emitted so the LLM can preserve it.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .docx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'shape:{index}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_docx_shapes
¶
Injects translated text into DOCX shapes/text boxes via ZIP + lxml.
When the translated text contains inline HTML formatting tags,
_inject_wps_txbx_html_runs is used to rebuild <w:r> elements
with per-segment <w:rPr> formatting. When <a href="...">
tags are present, hyperlink relationships are added to the part's
.rels file. Otherwise, plain text is injected via
_inject_wps_txbx_plain.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .docx file to modify in place.
TYPE:
|
translations
|
Mapping of
TYPE:
|
Source code in src/core/office_processor.py
9146 9147 9148 9149 9150 9151 9152 9153 9154 9155 9156 9157 9158 9159 9160 9161 9162 9163 9164 9165 9166 9167 9168 9169 9170 9171 9172 9173 9174 9175 9176 9177 9178 9179 9180 9181 9182 9183 9184 9185 9186 9187 9188 9189 9190 9191 9192 9193 9194 9195 9196 9197 9198 9199 9200 9201 9202 9203 9204 9205 9206 9207 9208 9209 9210 9211 9212 9213 9214 9215 9216 9217 9218 9219 9220 9221 9222 9223 9224 9225 9226 9227 9228 9229 9230 9231 9232 9233 9234 9235 9236 | |
_resolve_xlsx_sheet_drawings
¶
Resolves sheet-name → drawing-path mappings from an XLSX ZIP.
Reads xl/workbook.xml to get sheet names and
xl/worksheets/_rels/sheet{N}.xml.rels to find associated drawings.
| PARAMETER | DESCRIPTION |
|---|---|
zf
|
An open
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
Pairs of (sheet_name, drawing_xml_path) like
TYPE:
|
Source code in src/core/office_processor.py
9245 9246 9247 9248 9249 9250 9251 9252 9253 9254 9255 9256 9257 9258 9259 9260 9261 9262 9263 9264 9265 9266 9267 9268 9269 9270 9271 9272 9273 9274 9275 9276 9277 9278 9279 9280 9281 9282 9283 9284 9285 9286 9287 9288 9289 9290 9291 9292 9293 9294 9295 9296 9297 9298 9299 9300 9301 9302 9303 9304 9305 9306 9307 | |
_extract_xlsx_shapes
¶
Extracts text from shapes in an XLSX file via ZIP + lxml.
Uses DrawingML <a:txBody> elements within each sheet's drawing XML.
When run formatting varies or hyperlinks are present within a shape,
inline HTML is emitted via _drawingml_to_html so the LLM can
preserve it.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .xlsx file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like 'shape:{sheet_name}:{index}'.
TYPE:
|
Source code in src/core/office_processor.py
_inject_xlsx_shapes
¶
Injects translated text into XLSX shapes via ZIP + lxml.
When translated text contains <a href="..."> tags, hyperlink
relationships are added to the drawing's .rels file.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .xlsx file to modify in place.
TYPE:
|
translations
|
Mapping of
TYPE:
|
Source code in src/core/office_processor.py
9360 9361 9362 9363 9364 9365 9366 9367 9368 9369 9370 9371 9372 9373 9374 9375 9376 9377 9378 9379 9380 9381 9382 9383 9384 9385 9386 9387 9388 9389 9390 9391 9392 9393 9394 9395 9396 9397 9398 9399 9400 9401 9402 9403 9404 9405 9406 9407 9408 9409 9410 9411 9412 9413 9414 9415 9416 9417 9418 9419 9420 9421 9422 9423 9424 9425 9426 9427 9428 9429 9430 9431 9432 9433 | |
_build_odf_style_map
¶
Builds a mapping of style names to <style:style> elements.
Scans <office:automatic-styles> for <style:style> entries
with style:family="text" and returns a dict keyed by
style:name.
| PARAMETER | DESCRIPTION |
|---|---|
root
|
The lxml root element of an ODF
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Mapping of style name to the
TYPE:
|
Source code in src/core/office_processor.py
_inject_odf_text_box_html_runs
¶
Injects HTML-formatted text into an ODF <draw:text-box> element.
Parses html_text via _parse_html_formatting. For each unique
formatting signature, generates a <style:style> entry in
auto_styles_el and wraps the text in <text:span> with the
corresponding text:style-name. Handles '\n' by creating
multiple <text:p> elements.
Falls back to _inject_odf_paragraph_text when no HTML tags are
detected.
| PARAMETER | DESCRIPTION |
|---|---|
text_box_el
|
An lxml element for
TYPE:
|
html_text
|
Translated text with inline formatting tags.
TYPE:
|
text_p_tag
|
The fully-qualified
TYPE:
|
auto_styles_el
|
The
TYPE:
|
style_counter
|
Mutable
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the element was modified. |
Source code in src/core/office_processor.py
9473 9474 9475 9476 9477 9478 9479 9480 9481 9482 9483 9484 9485 9486 9487 9488 9489 9490 9491 9492 9493 9494 9495 9496 9497 9498 9499 9500 9501 9502 9503 9504 9505 9506 9507 9508 9509 9510 9511 9512 9513 9514 9515 9516 9517 9518 9519 9520 9521 9522 9523 9524 9525 9526 9527 9528 9529 9530 9531 9532 9533 9534 9535 9536 9537 9538 9539 9540 9541 9542 9543 9544 9545 9546 9547 9548 9549 9550 9551 9552 9553 9554 9555 9556 9557 9558 9559 9560 9561 9562 9563 9564 9565 9566 9567 9568 9569 9570 9571 9572 9573 9574 9575 9576 9577 9578 9579 9580 9581 9582 9583 9584 9585 9586 9587 9588 9589 9590 9591 9592 9593 9594 9595 9596 9597 9598 9599 9600 9601 9602 9603 9604 9605 9606 9607 9608 9609 9610 9611 9612 9613 9614 9615 9616 9617 9618 9619 9620 9621 9622 9623 9624 9625 9626 9627 9628 9629 9630 9631 9632 9633 9634 9635 9636 9637 9638 9639 9640 9641 9642 9643 9644 9645 9646 9647 9648 9649 9650 9651 9652 9653 9654 9655 9656 9657 9658 9659 9660 9661 9662 9663 9664 9665 | |
_extract_odt_shapes
¶
Extracts text from <draw:text-box> elements in an ODT file.
When span formatting varies within a text box, inline HTML is emitted
via _odf_text_box_to_html so the LLM can preserve it.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .odt file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like
TYPE:
|
Source code in src/core/office_processor.py
_inject_odt_shapes
¶
Injects translated text into <draw:text-box> elements in an ODT.
When the translated text contains inline HTML formatting tags,
_inject_odf_text_box_html_runs is used to create styled spans.
Otherwise, falls back to _inject_odf_paragraph_text.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .odt file to modify in place.
TYPE:
|
translations
|
Mapping of
TYPE:
|
Source code in src/core/office_processor.py
9702 9703 9704 9705 9706 9707 9708 9709 9710 9711 9712 9713 9714 9715 9716 9717 9718 9719 9720 9721 9722 9723 9724 9725 9726 9727 9728 9729 9730 9731 9732 9733 9734 9735 9736 9737 9738 9739 9740 9741 9742 9743 9744 9745 9746 9747 9748 9749 9750 9751 9752 9753 9754 9755 9756 9757 9758 9759 9760 9761 9762 9763 9764 9765 | |
_extract_ods_shapes
¶
Extracts text from <draw:text-box> elements in an ODS file.
Iterates per <table:table> to produce sheet-qualified keys.
When span formatting varies within a text box, inline HTML is emitted
via _odf_text_box_to_html so the LLM can preserve it.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the .ods file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
(location_key, text) pairs with keys like
TYPE:
|
Source code in src/core/office_processor.py
_inject_ods_shapes
¶
Injects translated text into <draw:text-box> elements in an ODS.
When the translated text contains inline HTML formatting tags,
_inject_odf_text_box_html_runs is used to create styled spans.
Otherwise, falls back to _inject_odf_paragraph_text.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the .ods file to modify in place.
TYPE:
|
translations
|
Mapping of
TYPE:
|
Source code in src/core/office_processor.py
9814 9815 9816 9817 9818 9819 9820 9821 9822 9823 9824 9825 9826 9827 9828 9829 9830 9831 9832 9833 9834 9835 9836 9837 9838 9839 9840 9841 9842 9843 9844 9845 9846 9847 9848 9849 9850 9851 9852 9853 9854 9855 9856 9857 9858 9859 9860 9861 9862 9863 9864 9865 9866 9867 9868 9869 9870 9871 9872 9873 9874 9875 9876 9877 9878 9879 9880 9881 9882 | |
_translate_single_image
¶
_translate_single_image(
image_bytes,
content_type,
target_lang,
src_lang,
glossary_entries,
ocr_method,
*,
provider=None,
model=None,
)
Translates a single image using the OCR → LLM → render pipeline.
Writes the image to a temp file, processes it, and returns the translated image bytes. Returns None if the image has no translatable text or rendering fails. Does not catch ValueError so that fatal LLM errors can propagate to the caller.
| PARAMETER | DESCRIPTION |
|---|---|
image_bytes
|
Raw image data.
TYPE:
|
content_type
|
MIME type (e.g. "image/png").
TYPE:
|
target_lang
|
Target language name.
TYPE:
|
src_lang
|
Source language name.
TYPE:
|
glossary_entries
|
Optional glossary entries.
TYPE:
|
ocr_method
|
OCR method name (e.g. "TesseractOCR").
TYPE:
|
provider
|
Optional LLM provider override.
TYPE:
|
model
|
Optional LLM model override.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bytes | None
|
bytes | None: Translated image bytes, or None. |
Source code in src/core/office_processor.py
9885 9886 9887 9888 9889 9890 9891 9892 9893 9894 9895 9896 9897 9898 9899 9900 9901 9902 9903 9904 9905 9906 9907 9908 9909 9910 9911 9912 9913 9914 9915 9916 9917 9918 9919 9920 9921 9922 9923 9924 9925 9926 9927 9928 9929 9930 9931 9932 9933 9934 9935 9936 9937 9938 9939 9940 9941 9942 9943 9944 9945 9946 9947 9948 9949 9950 9951 9952 9953 9954 9955 9956 9957 9958 9959 9960 9961 9962 9963 9964 9965 9966 9967 9968 9969 9970 9971 9972 | |
_translate_zip_images
¶
_translate_zip_images(
output_path,
suffix,
target_lang,
src_lang,
glossary_entries,
ocr_method,
progress_callback,
cancel_check,
*,
provider=None,
model=None,
checkpoint_dir=None,
)
Translates images embedded in an Office document using zipfile.
Opens the document as a ZIP archive, identifies raster images in the
known media directory, translates each via the OCR → LLM → render
pipeline, replaces the originals in memory, and rewrites the archive
atomically (write to .tmp, then shutil.move).
Supports .docx, .xlsx, .pptx, .odt, .ods, .odp,
and .epub.
Skip-with-warning policy for non-fatal per-image errors: a
bad image (e.g. IMAGE_TOO_LARGE, an unreadable JPEG header,
a vision model returning empty text) leaves the original image in
place and the loop continues. The user gets a document with
most images translated and the broken ones in their source form,
rather than one stubborn image blocking the whole document.
Fatal LLM errors (AUTH_ERROR, QUOTA_ERROR, VISION_NOT_SUPPORTED)
still break out immediately — those indicate the entire pipeline
can't continue, not "this one image won't translate".
When checkpoint_dir is provided, each image's translated bytes
are persisted under <checkpoint_dir>/office_images/<sha256>.bin
and consulted on re-runs. This means an interrupted batch (50/100
images done, then a quota error or cancellation) only retries the
remaining 50 on resume instead of redoing the whole document. The
SHA256 of the source bytes is the cache key, so duplicate images
(e.g. a company logo repeated on every page) deduplicate naturally.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the saved translated document (modified in place).
TYPE:
|
suffix
|
Lowercase file extension (e.g. ".docx").
TYPE:
|
target_lang
|
Target language name.
TYPE:
|
src_lang
|
Source language name.
TYPE:
|
glossary_entries
|
Optional glossary entries.
TYPE:
|
ocr_method
|
OCR method name (e.g. "TesseractOCR").
TYPE:
|
progress_callback
|
Called with 0-100 for the image phase.
TYPE:
|
cancel_check
|
Returns True if the task was cancelled.
TYPE:
|
provider
|
Optional LLM provider override.
TYPE:
|
model
|
Optional LLM model override.
TYPE:
|
checkpoint_dir
|
Task storage directory for per-image cache.
TYPE:
|
Source code in src/core/office_processor.py
9975 9976 9977 9978 9979 9980 9981 9982 9983 9984 9985 9986 9987 9988 9989 9990 9991 9992 9993 9994 9995 9996 9997 9998 9999 10000 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 10011 10012 10013 10014 10015 10016 10017 10018 10019 10020 10021 10022 10023 10024 10025 10026 10027 10028 10029 10030 10031 10032 10033 10034 10035 10036 10037 10038 10039 10040 10041 10042 10043 10044 10045 10046 10047 10048 10049 10050 10051 10052 10053 10054 10055 10056 10057 10058 10059 10060 10061 10062 10063 10064 10065 10066 10067 10068 10069 10070 10071 10072 10073 10074 10075 10076 10077 10078 10079 10080 10081 10082 10083 10084 10085 10086 10087 10088 10089 10090 10091 10092 10093 10094 10095 10096 10097 10098 10099 10100 10101 10102 10103 10104 10105 10106 10107 10108 10109 10110 10111 10112 10113 10114 10115 10116 10117 10118 10119 10120 10121 10122 10123 10124 10125 10126 10127 10128 10129 10130 10131 10132 10133 10134 10135 10136 10137 10138 10139 10140 10141 10142 10143 10144 10145 10146 10147 10148 10149 10150 10151 10152 10153 10154 10155 10156 10157 10158 10159 10160 10161 10162 10163 10164 10165 10166 10167 10168 10169 10170 | |
_translate_legacy_images
¶
_translate_legacy_images(
output_path,
suffix,
backend,
target_lang,
src_lang,
glossary_entries,
ocr_method,
progress_callback,
cancel_check,
*,
provider=None,
model=None,
checkpoint_dir=None,
)
Translates images in legacy office files via round-trip conversion.
Converts the legacy file (.doc/.xls/.ppt) to its modern equivalent (.docx/.xlsx/.pptx), runs the existing ZIP-based image pipeline on the modern file, then converts back to the legacy format.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the saved legacy document (modified in place).
TYPE:
|
suffix
|
Lowercase legacy extension (e.g. ".doc").
TYPE:
|
backend
|
Backend identifier ("win32com" or "uno").
TYPE:
|
target_lang
|
Target language name.
TYPE:
|
src_lang
|
Source language name.
TYPE:
|
glossary_entries
|
Optional glossary entries.
TYPE:
|
ocr_method
|
OCR method name (e.g. "TesseractOCR").
TYPE:
|
progress_callback
|
Called with 0-100 for the image phase.
TYPE:
|
cancel_check
|
Returns True if the task was cancelled.
TYPE:
|
provider
|
Optional LLM provider override.
TYPE:
|
model
|
Optional LLM model override.
TYPE:
|
checkpoint_dir
|
Task storage directory for per-image cache.
Forwarded to
TYPE:
|
Source code in src/core/office_processor.py
10173 10174 10175 10176 10177 10178 10179 10180 10181 10182 10183 10184 10185 10186 10187 10188 10189 10190 10191 10192 10193 10194 10195 10196 10197 10198 10199 10200 10201 10202 10203 10204 10205 10206 10207 10208 10209 10210 10211 10212 10213 10214 10215 10216 10217 10218 10219 10220 10221 10222 10223 10224 10225 10226 10227 10228 10229 10230 10231 10232 10233 10234 10235 10236 10237 10238 10239 10240 10241 10242 10243 10244 10245 10246 10247 10248 10249 10250 10251 10252 10253 | |
_translate_doc_images
¶
_translate_doc_images(
output_path,
suffix,
backend,
target_lang,
src_lang,
glossary_entries,
progress_callback,
cancel_check,
config=None,
*,
provider=None,
model=None,
checkpoint_dir=None,
)
Translates images embedded in an Office document.
For modern/ODF formats: uses the ZIP-based image pipeline directly. For legacy formats (.doc/.xls/.ppt): converts to modern format first, runs the ZIP pipeline, then converts back.
| PARAMETER | DESCRIPTION |
|---|---|
output_path
|
Path to the saved translated document.
TYPE:
|
suffix
|
Lowercase file extension (e.g. ".docx", ".doc").
TYPE:
|
backend
|
Backend identifier for legacy format conversion.
TYPE:
|
target_lang
|
Target language name.
TYPE:
|
src_lang
|
Source language name.
TYPE:
|
glossary_entries
|
Optional glossary entries.
TYPE:
|
progress_callback
|
Called with 0-100 for the image phase.
TYPE:
|
cancel_check
|
Returns True if the task was cancelled.
TYPE:
|
config
|
Optional TranslationConfig snapshot; falls back to load_setting().
TYPE:
|
provider
|
Optional LLM provider override.
TYPE:
|
model
|
Optional LLM model override.
TYPE:
|
checkpoint_dir
|
Task storage directory for per-image cache.
Forwarded to the underlying ZIP pipeline;
TYPE:
|
Source code in src/core/office_processor.py
10256 10257 10258 10259 10260 10261 10262 10263 10264 10265 10266 10267 10268 10269 10270 10271 10272 10273 10274 10275 10276 10277 10278 10279 10280 10281 10282 10283 10284 10285 10286 10287 10288 10289 10290 10291 10292 10293 10294 10295 10296 10297 10298 10299 10300 10301 10302 10303 10304 10305 10306 10307 10308 10309 10310 10311 10312 10313 10314 10315 10316 10317 10318 10319 10320 10321 10322 10323 10324 10325 10326 10327 10328 10329 10330 | |
process_office_file
¶
process_office_file(
file_path,
output_path,
target_lang,
src_lang="",
progress_callback=None,
glossary_entries=None,
cancel_check=None,
checkpoint_dir=None,
config=None,
*,
provider=None,
model=None,
)
Translates an Office document using the best available backend.
Extracts translatable text, translates via LLM, and injects translations back into a copy of the document.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the source office file.
TYPE:
|
output_path
|
Path to write the translated file.
TYPE:
|
target_lang
|
Target language name.
TYPE:
|
src_lang
|
Source language name.
TYPE:
|
progress_callback
|
Called with 0-100 progress percentage.
TYPE:
|
glossary_entries
|
Optional glossary entries for translation.
TYPE:
|
cancel_check
|
Returns True if the task was cancelled.
TYPE:
|
checkpoint_dir
|
Directory for saving/loading checkpoints.
TYPE:
|
config
|
Optional TranslationConfig for dependency injection.
TYPE:
|
provider
|
Optional LLM provider override (Gemini / Custom).
TYPE:
|
model
|
Optional LLM model override.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True on success, False if cancelled.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
On backend or processing errors. |
Source code in src/core/office_processor.py
10338 10339 10340 10341 10342 10343 10344 10345 10346 10347 10348 10349 10350 10351 10352 10353 10354 10355 10356 10357 10358 10359 10360 10361 10362 10363 10364 10365 10366 10367 10368 10369 10370 10371 10372 10373 10374 10375 10376 10377 10378 10379 10380 10381 10382 10383 10384 10385 10386 10387 10388 10389 10390 10391 10392 10393 10394 10395 10396 10397 10398 10399 10400 10401 10402 10403 10404 10405 10406 10407 10408 10409 10410 10411 10412 10413 10414 10415 10416 10417 10418 10419 10420 10421 10422 10423 10424 10425 10426 10427 10428 10429 10430 10431 10432 10433 10434 10435 10436 10437 10438 10439 10440 10441 10442 10443 10444 10445 10446 10447 10448 10449 10450 10451 10452 10453 10454 10455 10456 10457 10458 10459 10460 10461 10462 10463 10464 10465 10466 10467 10468 10469 10470 10471 10472 10473 10474 10475 10476 10477 10478 10479 10480 10481 10482 10483 10484 10485 10486 10487 10488 10489 10490 10491 10492 10493 10494 10495 10496 10497 10498 10499 10500 10501 10502 10503 10504 10505 10506 10507 10508 10509 10510 10511 10512 10513 10514 10515 10516 10517 10518 10519 10520 10521 10522 10523 10524 10525 10526 10527 10528 10529 10530 10531 10532 10533 10534 10535 10536 10537 10538 10539 10540 10541 10542 10543 10544 10545 10546 10547 10548 10549 10550 10551 10552 10553 10554 10555 10556 10557 10558 10559 10560 10561 10562 10563 10564 10565 10566 10567 10568 10569 10570 10571 10572 10573 10574 10575 10576 10577 10578 10579 10580 10581 10582 10583 10584 10585 10586 10587 10588 10589 10590 10591 10592 10593 10594 10595 10596 10597 10598 10599 10600 10601 10602 10603 10604 10605 10606 10607 10608 10609 10610 10611 10612 10613 10614 10615 10616 10617 10618 10619 10620 10621 10622 10623 10624 10625 10626 10627 10628 10629 10630 10631 10632 10633 10634 10635 10636 10637 10638 10639 10640 10641 10642 10643 10644 10645 10646 10647 10648 10649 10650 10651 10652 10653 10654 10655 10656 10657 10658 10659 10660 10661 10662 10663 10664 10665 10666 10667 10668 10669 10670 10671 10672 10673 10674 10675 10676 10677 10678 10679 10680 10681 10682 10683 10684 10685 10686 10687 10688 10689 10690 10691 10692 10693 10694 10695 10696 10697 10698 10699 10700 10701 10702 10703 10704 10705 10706 10707 10708 | |