Skip to content

Pdf

PdfArchiver(path: Path)

Bases: Archiver

An archiver for PDF files with support for embedded metadata.

This class provides an interface for reading pages from PDF documents and managing embedded metadata files. Each page is treated as a separate PNG image file within the archive, and metadata files (like ComicInfo.xml and MetronInfo.xml) can be embedded as PDF attachments.

Pages are numbered sequentially with zero-padding (page_001.png, page_002.png, etc.).

The implementation uses pymupdf for PDF rendering and processing.

ATTRIBUTE DESCRIPTION
path

The filesystem path to the PDF file.

TYPE: Path

Note
  • PDF pages are read-only virtual files rendered at 150 DPI
  • Metadata files can be embedded, read, and removed as PDF attachments
  • copy_from_archive() is not supported for PDFs

Initialize a PdfArchiver with the provided path.

PARAMETER DESCRIPTION
path

The filesystem path to the PDF file.

TYPE: Path

Note

This constructor does not validate that the file exists or is a valid PDF document. Validation occurs when operations are performed on the document.

Functions

copy_from_archive(other_archive: Archiver) -> bool

Attempt to copy files from another archive to the PDF.

PARAMETER DESCRIPTION
other_archive

The source archive to copy files from.

TYPE: Archiver

RETURNS DESCRIPTION
False

This operation is not supported for PDF archives.

TYPE: bool

Note

This method logs a warning and returns False immediately. Copying entire archives to PDF is not supported because: - PDF pages cannot be replaced with image files - PDFs maintain their original page structure - Only embedded metadata files can be written individually

To add metadata to a PDF, use write_file() instead: >>> pdf.write_file("ComicInfo.xml", xml_data)

Warning

A warning will be logged indicating that the copy operation was attempted on a PDF archive.

Examples:

Python Console Session
>>> pdf_archive = PdfArchiver(Path("target.pdf"))
>>> zip_archive = ZipArchiver(Path("source.cbz"))
>>> success = pdf_archive.copy_from_archive(zip_archive)
>>> print(f"Copy successful: {success}")  # Will print: Copy successful: False

get_filename_list() -> list[str]

Get a list of all files in the PDF (pages and embedded files).

RETURNS DESCRIPTION
list[str]

A list of filenames including: - Virtual page files: 'page_NNN.png' (zero-padded, starting from 001) - Embedded files: e.g., 'ComicInfo.xml', 'MetronInfo.xml'

RAISES DESCRIPTION
ArchiverReadError

If the PDF cannot be read due to:

  • Corrupt or invalid PDF file
  • File system or permission errors
  • PDF processing errors

Examples:

Python Console Session
>>> archiver = PdfArchiver(Path("document.pdf"))
>>> files = archiver.get_filename_list()
>>> print(files)
>>> # Output: ['page_001.png', 'page_002.png', 'ComicInfo.xml']
Note
  • Page PNG files are virtual and generated on-demand when read
  • Embedded files are actual PDF attachments
  • The list is sorted with pages first, then embedded files alphabetically

is_write_operation_expected() -> bool

Check if write operations are supported.

RETURNS DESCRIPTION
True

PDF files support writing embedded files (metadata).

TYPE: bool

Note

This method is used by the parent class to determine if write operations should be attempted. For PDF files, this returns True to allow embedding metadata files like ComicInfo.xml and MetronInfo.xml. Note that PDF pages themselves remain read-only virtual files.

read_file(archive_file: str) -> bytes

Read a file from the PDF (page or embedded file).

PARAMETER DESCRIPTION
archive_file

The filename within the archive. Can be: - Page file: 'page_NNN.png' (zero-padded page number, starts at 001) - Embedded file: e.g., 'ComicInfo.xml', 'MetronInfo.xml'

TYPE: str

RETURNS DESCRIPTION
bytes

File data as bytes: - For page files: PNG image data rendered at PAGE_DPI resolution - For embedded files: Raw file content

RAISES DESCRIPTION
ArchiverReadError

If the file cannot be read due to:

  • Invalid page filename format
  • Page number out of range
  • File not found in archive
  • Corrupt or invalid PDF file
  • PDF processing errors
  • Other I/O errors

Examples:

Python Console Session
>>> archiver = PdfArchiver(Path("document.pdf"))
>>> # Read a page
>>> image_data = archiver.read_file("page_001.png")
>>> # Read embedded metadata
>>> metadata = archiver.read_file("ComicInfo.xml")

remove_files(filename_list: list[str]) -> bool

Remove embedded files from the PDF.

PARAMETER DESCRIPTION
filename_list

A list of filenames to remove. Only embedded files can be removed (not page files like page_NNN.png).

TYPE: list[str]

RETURNS DESCRIPTION
bool

True if all existing embedded files were successfully removed, False if any error occurred. Returns True if the list is empty or contains only non-existent files.

Note
  • Only embedded files can be removed (page files are read-only)
  • Non-existent files are silently ignored
  • Page files (page_NNN.png) are skipped with a warning
  • All removals are performed in a single transaction

Examples:

Python Console Session
>>> archiver = PdfArchiver(Path("comic.pdf"))
>>> archiver.remove_files(["ComicInfo.xml", "MetronInfo.xml"])
>>> # Removes both metadata files if they exist

test() -> bool

Test whether the file is a valid PDF document.

RETURNS DESCRIPTION
bool

True if the file is a valid PDF, False otherwise.

TYPE: bool

Note

This method attempts to open the PDF with pymupdf to validate its structure, not just checking the file extension.

write_file(archive_file: str, data: str | bytes) -> bool

Write an embedded file to the PDF.

PARAMETER DESCRIPTION
archive_file

The filename for the embedded file (e.g., 'ComicInfo.xml'). Cannot be a page file (page_NNN.png) as pages are read-only.

TYPE: str

data

The data to embed (string or bytes).

TYPE: str | bytes

RETURNS DESCRIPTION
bool

True if the file was successfully embedded, False otherwise.

Note
  • Only embedded files can be written (not PDF pages)
  • If the embedded file already exists, it will be replaced
  • String data is automatically encoded as UTF-8
  • Commonly used for ComicInfo.xml and MetronInfo.xml metadata
Warning

Attempting to write to a page file (page_NNN.png) will fail and log a warning, as pages are read-only virtual files.

Examples:

Python Console Session
>>> archiver = PdfArchiver(Path("comic.pdf"))
>>> xml_data = '<?xml version="1.0"?><ComicInfo>...</ComicInfo>'
>>> success = archiver.write_file("ComicInfo.xml", xml_data)