Pdf

`PdfArchiver(path: Path)`

Bases: Archiver

An archiver for PDF files with support for embedded metadata.

This class provides an interface for reading pages from PDF documents and managing embedded metadata files. Each page is treated as a separate PNG image file within the archive, and metadata files (like ComicInfo.xml and MetronInfo.xml) can be embedded as PDF attachments.

Pages are numbered sequentially with zero-padding (page_001.png, page_002.png, etc.).

The implementation uses pymupdf for PDF rendering and processing.

ATTRIBUTE	DESCRIPTION
`path`	The filesystem path to the PDF file. TYPE: `Path`

Note

PDF pages are read-only virtual files rendered at 150 DPI
Metadata files can be embedded, read, and removed as PDF attachments
copy_from_archive() is not supported for PDFs

Initialize a PdfArchiver with the provided path.

PARAMETER	DESCRIPTION
`path`	The filesystem path to the PDF file. TYPE: `Path`

Note

This constructor does not validate that the file exists or is a valid PDF document. Validation occurs when operations are performed on the document.

Methods:

`copy_from_archive(other_archive: Archiver) -> bool`

Attempt to copy files from another archive to the PDF.

PARAMETER	DESCRIPTION
`other_archive`	The source archive to copy files from. TYPE: `Archiver`

RETURNS	DESCRIPTION
`False`	This operation is not supported for PDF archives. TYPE: `bool`

Note

This method logs a warning and returns False immediately. Copying entire archives to PDF is not supported because: - PDF pages cannot be replaced with image files - PDFs maintain their original page structure - Only embedded metadata files can be written individually

To add metadata to a PDF, use write_file() instead: >>> pdf.write_file("ComicInfo.xml", xml_data)

Warning

A warning will be logged indicating that the copy operation was attempted on a PDF archive.

Examples:

Python Console Session

>>> pdf_archive = PdfArchiver(Path("target.pdf"))
>>> zip_archive = ZipArchiver(Path("source.cbz"))
>>> success = pdf_archive.copy_from_archive(zip_archive)
>>> print(f"Copy successful: {success}")  # Will print: Copy successful: False

`get_filename_list() -> list[str]`

Get a list of all files in the PDF (pages and embedded files).

RETURNS	DESCRIPTION
`list[str]`	A list of filenames including: - Virtual page files: 'page_NNN.png' (zero-padded, starting from 001) - Embedded files: e.g., 'ComicInfo.xml', 'MetronInfo.xml'

RAISES	DESCRIPTION
`ArchiverReadError`	If the PDF cannot be read due to: Corrupt or invalid PDF file File system or permission errors PDF processing errors

Examples:

Python Console Session

>>> archiver = PdfArchiver(Path("document.pdf"))
>>> files = archiver.get_filename_list()
>>> print(files)
>>> # Output: ['page_001.png', 'page_002.png', 'ComicInfo.xml']

Note

Page PNG files are virtual and generated on-demand when read
Embedded files are actual PDF attachments
The list is sorted with pages first, then embedded files alphabetically

`is_write_operation_expected() -> bool`

Check if write operations are supported.

RETURNS	DESCRIPTION
`True`	PDF files support writing embedded files (metadata). TYPE: `bool`

Note

This method is used by the parent class to determine if write operations should be attempted. For PDF files, this returns True to allow embedding metadata files like ComicInfo.xml and MetronInfo.xml. Note that PDF pages themselves remain read-only virtual files.

`read_file(archive_file: str) -> bytes`

Read a file from the PDF (page or embedded file).

PARAMETER	DESCRIPTION
`archive_file`	The filename within the archive. Can be: - Page file: 'page_NNN.png' (zero-padded page number, starts at 001) - Embedded file: e.g., 'ComicInfo.xml', 'MetronInfo.xml' TYPE: `str`

RETURNS	DESCRIPTION
`bytes`	File data as bytes: - For page files: PNG image data rendered at PAGE_DPI resolution - For embedded files: Raw file content

RAISES	DESCRIPTION
`ArchiverReadError`	If the file cannot be read due to: Invalid page filename format Page number out of range File not found in archive Corrupt or invalid PDF file PDF processing errors Other I/O errors

Examples:

Python Console Session

>>> archiver = PdfArchiver(Path("document.pdf"))
>>> # Read a page
>>> image_data = archiver.read_file("page_001.png")
>>> # Read embedded metadata
>>> metadata = archiver.read_file("ComicInfo.xml")

`remove_files(filename_list: list[str]) -> bool`

Remove embedded files from the PDF.

PARAMETER	DESCRIPTION
`filename_list`	A list of filenames to remove. Only embedded files can be removed (not page files like page_NNN.png). TYPE: `list[str]`

RETURNS	DESCRIPTION
`bool`	True if all existing embedded files were successfully removed, False if any error occurred. Returns True if the list is empty or contains only non-existent files.

Note

Only embedded files can be removed (page files are read-only)
Non-existent files are silently ignored
Page files (page_NNN.png) are skipped with a warning
All removals are performed in a single transaction

Examples:

Python Console Session

>>> archiver = PdfArchiver(Path("comic.pdf"))
>>> archiver.remove_files(["ComicInfo.xml", "MetronInfo.xml"])
>>> # Removes both metadata files if they exist

`test() -> bool`

Test whether the file is a valid PDF document.

RETURNS	DESCRIPTION
`bool`	True if the file is a valid PDF, False otherwise. TYPE: `bool`

Note

This method attempts to open the PDF with pymupdf to validate its structure, not just checking the file extension.

`write_file(archive_file: str, data: str | bytes) -> bool`

Write an embedded file to the PDF.

PARAMETER	DESCRIPTION
`archive_file`	The filename for the embedded file (e.g., 'ComicInfo.xml'). Cannot be a page file (page_NNN.png) as pages are read-only. TYPE: `str`
`data`	The data to embed (string or bytes). TYPE: `str \| bytes`

RETURNS	DESCRIPTION
`bool`	True if the file was successfully embedded, False otherwise.

Note

Only embedded files can be written (not PDF pages)
If the embedded file already exists, it will be replaced
String data is automatically encoded as UTF-8
Commonly used for ComicInfo.xml and MetronInfo.xml metadata

Warning

Attempting to write to a page file (page_NNN.png) will fail and log a warning, as pages are read-only virtual files.

Examples:

Python Console Session

>>> archiver = PdfArchiver(Path("comic.pdf"))
>>> xml_data = '<?xml version="1.0"?><ComicInfo>...</ComicInfo>'
>>> success = archiver.write_file("ComicInfo.xml", xml_data)