This article will detail nine basic operations for PDF handling using the PyPDF2 library, along with practical code examples to help you get started quickly.
PyPDF2 is a very practical Python library for reading, splitting, merging, cropping, and performing other PDF document operations. It is ideal for automating PDF file processing tasks. This article will cover the following nine basic operations, demonstrating how to work with PDFs using PyPDF2.
Installing PyPDF2
First, you need to install the PyPDF2 library. Open your command-line tool and run the following command:
pip install pypdf2
After installation, you can import the PyPDF2 library as follows:
from PyPDF2 import PdfReader, PdfWriter
Reading a PDF File
Reading a PDF file is the first step in any PDF processing task. Here’s a simple example:
# Import the necessary library from PyPDF2 import PdfReader # Create a PdfReader object reader = PdfReader("example.pdf") # Get the number of pages in the PDF num_pages = len(reader.pages) print(f"Number of pages: {num_pages}") # Get the content of the first page first_page = reader.pages[0] text = first_page.extract_text() print(f"First page text:\n{text}")
This code creates a PdfReader object to read the file named example.pdf
, then retrieves the total number of pages and prints it. It then extracts the text from the first page and prints it.
Splitting a PDF File
Sometimes, we need to split a large PDF file into multiple smaller files. Here is an example of how to achieve that:
from PyPDF2 import PdfReader, PdfWriter def split_pdf(input_pdf, output_prefix): # Create a PdfReader object reader = PdfReader(input_pdf) # Loop through each page for i in range(len(reader.pages)): writer = PdfWriter() # Add the current page writer.add_page(reader.pages[i]) # Write the new PDF file with open(f"{output_prefix}_{i + 1}.pdf", "wb") as f: writer.write(f) split_pdf("large_document.pdf", "part")
This code defines a function split_pdf
, which accepts two parameters: the input PDF file name and the output prefix. For each page in the input PDF, it creates a new PDF file and saves it.
Merging Multiple PDF Files
Merging multiple PDF files is also quite simple. Here’s an example:
from PyPDF2 import PdfReader, PdfWriter def merge_pdfs(inputs, output): writer = PdfWriter() for input_pdf in inputs: reader = PdfReader(input_pdf) for page in reader.pages: writer.add_page(page) with open(output, "wb") as f: writer.write(f) merge_pdfs(["doc1.pdf", "doc2.pdf"], "merged.pdf")
This function takes a list of multiple PDF file names as input and merges them into a single PDF file. Each page is added using the add_page
method.
Rotating PDF Pages
If you need to adjust the orientation of certain pages in a PDF, you can easily do so with the rotate_clockwise
or rotate_counter_clockwise
methods:
from PyPDF2 import PdfReader, PdfWriter def rotate_pages(input_pdf, output_pdf, angle): reader = PdfReader(input_pdf) writer = PdfWriter() for page in reader.pages: if angle == 90: page.rotate_clockwise(90) elif angle == -90: page.rotate_counter_clockwise(90) else: raise ValueError("Angle must be 90 or -90.") writer.add_page(page) with open(output_pdf, "wb") as f: writer.write(f) rotate_pages("example.pdf", "rotated.pdf", 90)
This function rotates all pages in the PDF by the specified angle (either 90 or -90 degrees) and saves the result to a new PDF file.
Encrypting a PDF File
Encrypting a PDF file can protect its content from unauthorized access. Here’s a simple encryption example:
from PyPDF2 import PdfReader, PdfWriter def encrypt_pdf(input_pdf, output_pdf, password): reader = PdfReader(input_pdf) writer = PdfWriter() for page in reader.pages: writer.add_page(page) # Set the password writer.encrypt(password) # Write the encrypted PDF file with open(output_pdf, "wb") as f: writer.write(f) encrypt_pdf("example.pdf", "encrypted.pdf", "secret")
This code defines the encrypt_pdf
function, which accepts three parameters: the input PDF file name, the output PDF file name, and the password. It adds each page from the input PDF to the writer object, then encrypts the PDF using the encrypt
method, and writes the encrypted file to a new PDF.
Decrypting a PDF File
To read an encrypted PDF, you need to decrypt it first. Here’s an example:
from PyPDF2 import PdfReader, PdfWriter def decrypt_pdf(input_pdf, output_pdf, password): reader = PdfReader(input_pdf) # Check if the PDF is encrypted if reader.is_encrypted: try: reader.decrypt(password) except Exception as e: print(f"Error: {e}") return writer = PdfWriter() for page in reader.pages: writer.add_page(page) # Write the decrypted PDF file with open(output_pdf, "wb") as f: writer.write(f) decrypt_pdf("encrypted.pdf", "decrypted.pdf", "secret")
This function attempts to decrypt the PDF using the provided password. If successful, it adds all pages to the writer object and writes the decrypted PDF to a new file.
Adding a Watermark
Adding a watermark to a PDF can help identify the ownership of the document or remind readers of copyright issues. Here’s an example of adding a watermark:
from PyPDF2 import PdfReader, PdfWriter def add_watermark(input_pdf, watermark_pdf, output_pdf): reader = PdfReader(input_pdf) watermark_reader = PdfReader(watermark_pdf) writer = PdfWriter() # Get the watermark page watermark_page = watermark_reader.pages[0] for page in reader.pages: page.merge_page(watermark_page) writer.add_page(page) # Write the PDF with the watermark with open(output_pdf, "wb") as f: writer.write(f) add_watermark("example.pdf", "watermark.pdf", "watermarked.pdf")
This function defines the add_watermark
function, which takes three parameters: the input PDF file name, the watermark PDF file name, and the output PDF file name. It adds the watermark to each page of the input PDF and writes the watermarked PDF to a new file.
Practical Case: Batch Processing PDF Files
Suppose you have a folder with multiple PDF files and need to batch process them, such as adding a watermark, encrypting, or decrypting them. Here is an example of batch processing:
import os from PyPDF2 import PdfReader, PdfWriter def batch_process_pdf(input_folder, watermark_pdf, output_folder, password=None): # Create the output folder os.makedirs(output_folder, exist_ok=True) # Get all PDF files in the folder pdf_files = [f for f in os.listdir(input_folder) if f.endswith(".pdf")] for pdf_file in pdf_files: input_pdf_path = os.path.join(input_folder, pdf_file) output_pdf_path = os.path.join(output_folder, pdf_file) # Create a PdfReader object reader = PdfReader(input_pdf_path) writer = PdfWriter() # Add watermark if watermark_pdf is not None: watermark_reader = PdfReader(watermark_pdf) watermark_page = watermark_reader.pages[0] for page in reader.pages: page.merge_page(watermark_page) writer.add_page(page) # Encrypt or decrypt if password is not None: writer.encrypt(password) elif reader.is_encrypted: reader.decrypt(password) for page in reader.pages: writer.add_page(page) # Write the new PDF file with open(output_pdf_path, "wb") as f: writer.write(f) batch_process_pdf("input_folder", "watermark.pdf", "output_folder", "secret")
This code defines a function batch_process_pdf
, which takes four parameters: the input folder path, watermark PDF file name, output folder path, and password. It processes each PDF in the input folder by adding a watermark and applying encryption or decryption, and then saves the processed files to the output folder.
This article introduces the basic operations for reading, splitting, merging, rotating, encrypting, decrypting, and adding watermarks to PDF files using the PyPDF2 library. Through practical code examples, we’ve shown how to automate PDF processing tasks with PyPDF2 to improve work efficiency. We hope these examples will help you better understand and apply the PyPDF2 library.