Nine Basic PDF Document Operations Using the PyPDF2 Library

Time: Column:Python views:202

This article will detail nine basic operations for PDF handling using the PyPDF2 library, along with practical code examples to help you get started quickly. 

PyPDF2 is a very practical Python library for reading, splitting, merging, cropping, and performing other PDF document operations. It is ideal for automating PDF file processing tasks. This article will cover the following nine basic operations, demonstrating how to work with PDFs using PyPDF2.

Nine Basic PDF Document Operations Using the PyPDF2 Library

Installing PyPDF2

First, you need to install the PyPDF2 library. Open your command-line tool and run the following command:

pip install pypdf2

After installation, you can import the PyPDF2 library as follows:

from PyPDF2 import PdfReader, PdfWriter

Reading a PDF File

Reading a PDF file is the first step in any PDF processing task. Here’s a simple example:

# Import the necessary library
from PyPDF2 import PdfReader

# Create a PdfReader object
reader = PdfReader("example.pdf")

# Get the number of pages in the PDF
num_pages = len(reader.pages)
print(f"Number of pages: {num_pages}")

# Get the content of the first page
first_page = reader.pages[0]
text = first_page.extract_text()
print(f"First page text:\n{text}")

This code creates a PdfReader object to read the file named example.pdf, then retrieves the total number of pages and prints it. It then extracts the text from the first page and prints it.

Splitting a PDF File

Sometimes, we need to split a large PDF file into multiple smaller files. Here is an example of how to achieve that:

from PyPDF2 import PdfReader, PdfWriter

def split_pdf(input_pdf, output_prefix):
    # Create a PdfReader object
    reader = PdfReader(input_pdf)

    # Loop through each page
    for i in range(len(reader.pages)):
        writer = PdfWriter()

        # Add the current page
        writer.add_page(reader.pages[i])

        # Write the new PDF file
        with open(f"{output_prefix}_{i + 1}.pdf", "wb") as f:
            writer.write(f)

split_pdf("large_document.pdf", "part")

This code defines a function split_pdf, which accepts two parameters: the input PDF file name and the output prefix. For each page in the input PDF, it creates a new PDF file and saves it.

Merging Multiple PDF Files

Merging multiple PDF files is also quite simple. Here’s an example:

from PyPDF2 import PdfReader, PdfWriter

def merge_pdfs(inputs, output):
    writer = PdfWriter()

    for input_pdf in inputs:
        reader = PdfReader(input_pdf)
        for page in reader.pages:
            writer.add_page(page)

    with open(output, "wb") as f:
        writer.write(f)

merge_pdfs(["doc1.pdf", "doc2.pdf"], "merged.pdf")

This function takes a list of multiple PDF file names as input and merges them into a single PDF file. Each page is added using the add_page method.

Rotating PDF Pages

If you need to adjust the orientation of certain pages in a PDF, you can easily do so with the rotate_clockwise or rotate_counter_clockwise methods:

from PyPDF2 import PdfReader, PdfWriter

def rotate_pages(input_pdf, output_pdf, angle):
    reader = PdfReader(input_pdf)
    writer = PdfWriter()

    for page in reader.pages:
        if angle == 90:
            page.rotate_clockwise(90)
        elif angle == -90:
            page.rotate_counter_clockwise(90)
        else:
            raise ValueError("Angle must be 90 or -90.")

        writer.add_page(page)

    with open(output_pdf, "wb") as f:
        writer.write(f)

rotate_pages("example.pdf", "rotated.pdf", 90)

This function rotates all pages in the PDF by the specified angle (either 90 or -90 degrees) and saves the result to a new PDF file.

Encrypting a PDF File

Encrypting a PDF file can protect its content from unauthorized access. Here’s a simple encryption example:

from PyPDF2 import PdfReader, PdfWriter

def encrypt_pdf(input_pdf, output_pdf, password):
    reader = PdfReader(input_pdf)
    writer = PdfWriter()

    for page in reader.pages:
        writer.add_page(page)

    # Set the password
    writer.encrypt(password)

    # Write the encrypted PDF file
    with open(output_pdf, "wb") as f:
        writer.write(f)

encrypt_pdf("example.pdf", "encrypted.pdf", "secret")

This code defines the encrypt_pdf function, which accepts three parameters: the input PDF file name, the output PDF file name, and the password. It adds each page from the input PDF to the writer object, then encrypts the PDF using the encrypt method, and writes the encrypted file to a new PDF.

Decrypting a PDF File

To read an encrypted PDF, you need to decrypt it first. Here’s an example:

from PyPDF2 import PdfReader, PdfWriter

def decrypt_pdf(input_pdf, output_pdf, password):
    reader = PdfReader(input_pdf)

    # Check if the PDF is encrypted
    if reader.is_encrypted:
        try:
            reader.decrypt(password)
        except Exception as e:
            print(f"Error: {e}")
            return

    writer = PdfWriter()

    for page in reader.pages:
        writer.add_page(page)

    # Write the decrypted PDF file
    with open(output_pdf, "wb") as f:
        writer.write(f)

decrypt_pdf("encrypted.pdf", "decrypted.pdf", "secret")

This function attempts to decrypt the PDF using the provided password. If successful, it adds all pages to the writer object and writes the decrypted PDF to a new file.

Adding a Watermark

Adding a watermark to a PDF can help identify the ownership of the document or remind readers of copyright issues. Here’s an example of adding a watermark:

from PyPDF2 import PdfReader, PdfWriter

def add_watermark(input_pdf, watermark_pdf, output_pdf):
    reader = PdfReader(input_pdf)
    watermark_reader = PdfReader(watermark_pdf)
    writer = PdfWriter()

    # Get the watermark page
    watermark_page = watermark_reader.pages[0]

    for page in reader.pages:
        page.merge_page(watermark_page)
        writer.add_page(page)

    # Write the PDF with the watermark
    with open(output_pdf, "wb") as f:
        writer.write(f)

add_watermark("example.pdf", "watermark.pdf", "watermarked.pdf")

This function defines the add_watermark function, which takes three parameters: the input PDF file name, the watermark PDF file name, and the output PDF file name. It adds the watermark to each page of the input PDF and writes the watermarked PDF to a new file.

Practical Case: Batch Processing PDF Files

Suppose you have a folder with multiple PDF files and need to batch process them, such as adding a watermark, encrypting, or decrypting them. Here is an example of batch processing:

import os
from PyPDF2 import PdfReader, PdfWriter

def batch_process_pdf(input_folder, watermark_pdf, output_folder, password=None):
    # Create the output folder
    os.makedirs(output_folder, exist_ok=True)

    # Get all PDF files in the folder
    pdf_files = [f for f in os.listdir(input_folder) if f.endswith(".pdf")]

    for pdf_file in pdf_files:
        input_pdf_path = os.path.join(input_folder, pdf_file)
        output_pdf_path = os.path.join(output_folder, pdf_file)

        # Create a PdfReader object
        reader = PdfReader(input_pdf_path)
        writer = PdfWriter()

        # Add watermark
        if watermark_pdf is not None:
            watermark_reader = PdfReader(watermark_pdf)
            watermark_page = watermark_reader.pages[0]
            for page in reader.pages:
                page.merge_page(watermark_page)
                writer.add_page(page)

        # Encrypt or decrypt
        if password is not None:
            writer.encrypt(password)
        elif reader.is_encrypted:
            reader.decrypt(password)
            for page in reader.pages:
                writer.add_page(page)

        # Write the new PDF file
        with open(output_pdf_path, "wb") as f:
            writer.write(f)

batch_process_pdf("input_folder", "watermark.pdf", "output_folder", "secret")

This code defines a function batch_process_pdf, which takes four parameters: the input folder path, watermark PDF file name, output folder path, and password. It processes each PDF in the input folder by adding a watermark and applying encryption or decryption, and then saves the processed files to the output folder.

This article introduces the basic operations for reading, splitting, merging, rotating, encrypting, decrypting, and adding watermarks to PDF files using the PyPDF2 library. Through practical code examples, we’ve shown how to automate PDF processing tasks with PyPDF2 to improve work efficiency. We hope these examples will help you better understand and apply the PyPDF2 library.