PdfArchiver(path: Path)
Bases: Archiver
An archiver for PDF files with support for embedded metadata.
This class provides an interface for reading pages from PDF documents and managing embedded metadata files. Each page is treated as a separate PNG image file within the archive, and metadata files (like ComicInfo.xml and MetronInfo.xml) can be embedded as PDF attachments.
Pages are numbered sequentially with zero-padding (page_001.png, page_002.png, etc.).
The implementation uses pymupdf for PDF rendering and processing.
| ATTRIBUTE | DESCRIPTION |
|---|---|
path |
The filesystem path to the PDF file.
TYPE:
|
Note
- PDF pages are read-only virtual files rendered at 150 DPI
- Metadata files can be embedded, read, and removed as PDF attachments
- copy_from_archive() is not supported for PDFs
Initialize a PdfArchiver with the provided path.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
The filesystem path to the PDF file.
TYPE:
|
Note
This constructor does not validate that the file exists or is a valid PDF document. Validation occurs when operations are performed on the document.
Functions
copy_from_archive(other_archive: Archiver) -> bool
Attempt to copy files from another archive to the PDF.
| PARAMETER | DESCRIPTION |
|---|---|
other_archive
|
The source archive to copy files from.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
False
|
This operation is not supported for PDF archives.
TYPE:
|
Note
This method logs a warning and returns False immediately. Copying entire archives to PDF is not supported because: - PDF pages cannot be replaced with image files - PDFs maintain their original page structure - Only embedded metadata files can be written individually
To add metadata to a PDF, use write_file() instead: >>> pdf.write_file("ComicInfo.xml", xml_data)
Warning
A warning will be logged indicating that the copy operation was attempted on a PDF archive.
Examples:
>>> pdf_archive = PdfArchiver(Path("target.pdf"))
>>> zip_archive = ZipArchiver(Path("source.cbz"))
>>> success = pdf_archive.copy_from_archive(zip_archive)
>>> print(f"Copy successful: {success}") # Will print: Copy successful: False
get_filename_list() -> list[str]
Get a list of all files in the PDF (pages and embedded files).
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
A list of filenames including: - Virtual page files: 'page_NNN.png' (zero-padded, starting from 001) - Embedded files: e.g., 'ComicInfo.xml', 'MetronInfo.xml' |
| RAISES | DESCRIPTION |
|---|---|
ArchiverReadError
|
If the PDF cannot be read due to:
|
Examples:
>>> archiver = PdfArchiver(Path("document.pdf"))
>>> files = archiver.get_filename_list()
>>> print(files)
>>> # Output: ['page_001.png', 'page_002.png', 'ComicInfo.xml']
Note
- Page PNG files are virtual and generated on-demand when read
- Embedded files are actual PDF attachments
- The list is sorted with pages first, then embedded files alphabetically
is_write_operation_expected() -> bool
Check if write operations are supported.
| RETURNS | DESCRIPTION |
|---|---|
True
|
PDF files support writing embedded files (metadata).
TYPE:
|
Note
This method is used by the parent class to determine if write operations should be attempted. For PDF files, this returns True to allow embedding metadata files like ComicInfo.xml and MetronInfo.xml. Note that PDF pages themselves remain read-only virtual files.
read_file(archive_file: str) -> bytes
Read a file from the PDF (page or embedded file).
| PARAMETER | DESCRIPTION |
|---|---|
archive_file
|
The filename within the archive. Can be: - Page file: 'page_NNN.png' (zero-padded page number, starts at 001) - Embedded file: e.g., 'ComicInfo.xml', 'MetronInfo.xml'
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bytes
|
File data as bytes: - For page files: PNG image data rendered at PAGE_DPI resolution - For embedded files: Raw file content |
| RAISES | DESCRIPTION |
|---|---|
ArchiverReadError
|
If the file cannot be read due to:
|
Examples:
>>> archiver = PdfArchiver(Path("document.pdf"))
>>> # Read a page
>>> image_data = archiver.read_file("page_001.png")
>>> # Read embedded metadata
>>> metadata = archiver.read_file("ComicInfo.xml")
remove_files(filename_list: list[str]) -> bool
Remove embedded files from the PDF.
| PARAMETER | DESCRIPTION |
|---|---|
filename_list
|
A list of filenames to remove. Only embedded files can be removed (not page files like page_NNN.png).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if all existing embedded files were successfully removed, False if any error occurred. Returns True if the list is empty or contains only non-existent files. |
Note
- Only embedded files can be removed (page files are read-only)
- Non-existent files are silently ignored
- Page files (page_NNN.png) are skipped with a warning
- All removals are performed in a single transaction
Examples:
>>> archiver = PdfArchiver(Path("comic.pdf"))
>>> archiver.remove_files(["ComicInfo.xml", "MetronInfo.xml"])
>>> # Removes both metadata files if they exist
test() -> bool
Test whether the file is a valid PDF document.
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the file is a valid PDF, False otherwise.
TYPE:
|
Note
This method attempts to open the PDF with pymupdf to validate its structure, not just checking the file extension.
write_file(archive_file: str, data: str | bytes) -> bool
Write an embedded file to the PDF.
| PARAMETER | DESCRIPTION |
|---|---|
archive_file
|
The filename for the embedded file (e.g., 'ComicInfo.xml'). Cannot be a page file (page_NNN.png) as pages are read-only.
TYPE:
|
data
|
The data to embed (string or bytes).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the file was successfully embedded, False otherwise. |
Note
- Only embedded files can be written (not PDF pages)
- If the embedded file already exists, it will be replaced
- String data is automatically encoded as UTF-8
- Commonly used for ComicInfo.xml and MetronInfo.xml metadata
Warning
Attempting to write to a page file (page_NNN.png) will fail and log a warning, as pages are read-only virtual files.
Examples:
>>> archiver = PdfArchiver(Path("comic.pdf"))
>>> xml_data = '<?xml version="1.0"?><ComicInfo>...</ComicInfo>'
>>> success = archiver.write_file("ComicInfo.xml", xml_data)