checkpoint¶
checkpoint
¶
Checkpoint I/O for resumable translation tasks.
Saves and loads intermediate artifacts as JSON files in each task's storage directory. Uses atomic write-then-rename to prevent corruption on crash. All public functions are pure (no side effects beyond the filesystem) and return None on any load failure so callers fall back to a full restart.
get_storage_dir
¶
Returns the task's storage directory from its cloned file path.
| PARAMETER | DESCRIPTION |
|---|---|
storage_path
|
Absolute path to the cloned file in storage.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path
|
The parent directory of the cloned file.
TYPE:
|
Source code in src/core/checkpoint.py
_write_checkpoint
¶
Writes a checkpoint JSON file atomically (write-tmp then rename).
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
filename
|
Checkpoint file name.
TYPE:
|
data
|
Dictionary to serialise as JSON.
TYPE:
|
Source code in src/core/checkpoint.py
_read_checkpoint
¶
Reads a checkpoint JSON file. Returns None if missing or corrupt.
Also returns None when the file's version differs from _VERSION, so a schema change forces a clean restart.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
filename
|
Checkpoint file name.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any] | None
|
dict[str, Any] | None: Parsed data or None. |
Source code in src/core/checkpoint.py
clear_checkpoints
¶
Deletes all checkpoint artefacts from the storage directory.
Removes every checkpoint_*.json file plus the
office_images/ per-image cache directory. Best-effort; errors
are logged but never raised because checkpoint cleanup must never
block task completion.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
Source code in src/core/checkpoint.py
_serialize_ocr_result
¶
Converts an OCRResult to a fully serializable dictionary.
Extends the built-in to_dict() with fields needed for resuming (translated_html, original_text_height, line_height_ratio, is_single_line). Color and alignment are already strings.
| PARAMETER | DESCRIPTION |
|---|---|
result
|
An OCRResult instance.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
JSON-safe dictionary.
TYPE:
|
Source code in src/core/checkpoint.py
_deserialize_ocr_result
¶
Reconstructs an OCRResult from a serialized dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Dictionary produced by _serialize_ocr_result().
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
OCRResult
|
Reconstructed result with all fields restored.
TYPE:
|
Source code in src/core/checkpoint.py
save_ocr_checkpoint
¶
Saves OCR results after the OCR step.
Best-effort: logs and returns on any serialization error.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
ocr_results
|
OCR results list.
TYPE:
|
raw_ocr_results
|
Raw/unmerged OCR results.
TYPE:
|
ocr_method
|
Name of the OCR engine used.
TYPE:
|
Source code in src/core/checkpoint.py
load_ocr_checkpoint
¶
Loads OCR checkpoint data.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[list[OCRResult], list[OCRResult], str] | None
|
(ocr_results, raw_ocr_results, ocr_method) or None. |
Source code in src/core/checkpoint.py
save_llm_checkpoint
¶
Saves LLM results after the LLM + merge step.
Best-effort: logs and returns on any serialization error.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
ocr_results
|
Merged paragraph-level OCR results.
TYPE:
|
translations
|
List of translated strings.
TYPE:
|
confirmed_raw_fragments
|
Raw fragments confirmed by merge.
TYPE:
|
Source code in src/core/checkpoint.py
load_llm_checkpoint
¶
Loads LLM checkpoint data.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[list[OCRResult], list[str], list[OCRResult]] | None
|
(ocr_results, translations, confirmed_raw_fragments) or None. |
Source code in src/core/checkpoint.py
save_text_chunk
¶
Incrementally saves a translated text chunk.
Reads the existing checkpoint, adds/updates the chunk, and writes back atomically. Best-effort: logs and returns on error.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
chunk_index
|
Zero-based index of the chunk.
TYPE:
|
translated_text
|
The translated chunk text.
TYPE:
|
total_chunks
|
Total number of chunks in the document.
TYPE:
|
Source code in src/core/checkpoint.py
save_text_batch
¶
Saves multiple translated text chunks in a single write.
Reads the existing checkpoint once, merges all chunks, and writes back atomically. Much more efficient than calling save_text_chunk() in a loop (one I/O round-trip vs. N).
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
chunks
|
Mapping of chunk_index to translated text.
TYPE:
|
total_chunks
|
Total number of chunks in the document.
TYPE:
|
Source code in src/core/checkpoint.py
load_text_checkpoint
¶
Loads text chunk checkpoint data.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[int, str] | None
|
dict mapping chunk_index (int) to translated text, or None. |
Source code in src/core/checkpoint.py
save_batch_progress
¶
Incrementally saves a batch of translated values.
Best-effort: logs and returns on error.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
batch_start
|
Zero-based start index of this batch.
TYPE:
|
translated_values
|
Translated strings for this batch.
TYPE:
|
total_values
|
Total number of values to translate.
TYPE:
|
Source code in src/core/checkpoint.py
load_batch_checkpoint
¶
Loads batch checkpoint data.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[int, str] | None
|
dict mapping value_index (int) to translated string, or None. |
Source code in src/core/checkpoint.py
save_epub_file_progress
¶
Incrementally saves a translated EPUB content file.
Best-effort: logs and returns on error.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
file_path
|
Path within the EPUB archive (e.g. "OEBPS/ch1.xhtml").
TYPE:
|
translated_content
|
Translated XHTML content.
TYPE:
|
content_files
|
Full list of content file paths in the EPUB.
TYPE:
|
Source code in src/core/checkpoint.py
load_epub_checkpoint
¶
Loads EPUB checkpoint data.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, str] | None
|
dict mapping archive file path to translated content, or None. |
Source code in src/core/checkpoint.py
save_pdf_page_progress
¶
Incrementally saves translated blocks for one PDF page.
Reads the existing checkpoint, adds/updates the page entry, and writes back atomically. Best-effort: logs and returns on error.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
page_index
|
Zero-based page index.
TYPE:
|
translated_blocks
|
List of block dicts for the page.
TYPE:
|
total_pages
|
Total number of pages in the PDF.
TYPE:
|
Source code in src/core/checkpoint.py
load_pdf_checkpoint
¶
Loads PDF page checkpoint data.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
expected_total_pages
|
When provided, the checkpoint is discarded
if its on-disk
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[int, list[dict[str, Any]]] | None
|
dict mapping page_index (int) to list of block dicts, or None. |
Source code in src/core/checkpoint.py
save_dubbing_checkpoint
¶
save_dubbing_checkpoint(
storage_dir,
*,
srt_text=None,
translated_srt=None,
voice_file=None,
target_lang=None,
)
Saves dubbing pipeline checkpoint (incremental).
Each step appends its result to the existing checkpoint data.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Persistent dubbing storage directory.
TYPE:
|
srt_text
|
Raw SRT text from the STT step.
TYPE:
|
translated_srt
|
Translated SRT text from the LLM step.
TYPE:
|
voice_file
|
Filename of the synthesized voice audio in storage_dir.
TYPE:
|
target_lang
|
Target language label for checkpoint validity check.
TYPE:
|
Source code in src/core/checkpoint.py
load_dubbing_checkpoint
¶
Loads dubbing pipeline checkpoint.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any] | None
|
Dict with optional keys |
dict[str, Any] | None
|
|
Source code in src/core/checkpoint.py
hash_office_image
¶
Returns the SHA256 hex digest of an image's bytes.
Acts as both the cache key and the on-disk filename for the translated image, so identical images anywhere in any document naturally deduplicate.
| PARAMETER | DESCRIPTION |
|---|---|
image_bytes
|
Raw bytes of the original (untranslated) image.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
64-character lowercase hexadecimal digest. |
Source code in src/core/checkpoint.py
_office_image_path
¶
Returns the on-disk path for a cached translated image.
save_office_image_checkpoint
¶
Persists a translated image's bytes keyed by the source hash.
Atomic via tempfile + rename so a crash mid-write can't leave a half-written cache entry that future runs would silently reuse. Best-effort: any I/O error is logged and swallowed because cache failure must never abort an in-flight translation.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
image_hash
|
SHA256 hex digest of the original image bytes.
TYPE:
|
translated_bytes
|
Rendered image bytes to cache.
TYPE:
|
Source code in src/core/checkpoint.py
load_office_image_checkpoint
¶
Returns previously translated bytes for image_hash, or None.
A missing or unreadable file is treated as a cache miss (logged once and the caller retranslates), never an error — corruption here is a defensible reason to redo work, not to abort the run.
| PARAMETER | DESCRIPTION |
|---|---|
storage_dir
|
Task storage directory.
TYPE:
|
image_hash
|
SHA256 hex digest of the original image bytes.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bytes | None
|
Translated image bytes if cached, otherwise |