Python: Extract Text and Images from PDF Documents

PDF documents, while preserving the special formatting and visual style of content, often present difficulties when it comes to editing, copying, or searching for specific information. However, by extracting text and images from PDF files, users can easily process them or save them in other formats for further use, thus solving the difficulties in editing the content of PDF files. This article will explain how to use Spire.PDF for Python to extract text and images from PDF documents in Python programs.

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Extract Text from PDF Documents

Spire.PDF for Python provides the PdfPageBase.ExtractText() method which can be used to extract all text from a PDF page (including blank space), and return it as a string. The detailed steps for extracting all text from a PDF document are as follows:

  • Create an object of PdfDocument class.
  • Load a PDF document using PdfDocument.LoadFromFile() method.
  • Iterate through the pages of the document, extract the text from the pages using PdfPageBase.ExtractText() method, and write it to a text file.
  • Python
from spire.pdf import *
from spire.pdf.common import *

# Create an instance of the PdfDocument class
pdf = PdfDocument()

# Load the PDF document
pdf.LoadFromFile("Sample.pdf")

# Create a TXT file to save the extracted text
extractedText = open("output/ExtractedText.txt", "w", encoding="utf-8")

# Iterate through the pages of the document
for i in range(pdf.Pages.Count):
    # Get the page
    page = pdf.Pages.get_Item(i)
    # Extract text from the page
    text = page.ExtractText()
    # Write the text to the text file
    extractedText.write(text + "\n")

extractedText.close()
pdf.Close()

Python: Extract Text and Images from PDF Documents

Extract Text from a Rectangular Area of a PDF Page

The PdfPageBase.ExtractText() method also supports extracting text from a rectangular area on a PDF page. The detailed steps are as follows:

  • Create an object of the PdfDocument class.
  • Load a PDF document using the PdfDocument.LoadFromFile () method.
  • Get a page using PdfDocument.Pages.get_Item() method.
  • Extract the text from a rectangular area on the page using PdfPageBase.ExtractText() method.
  • Save the extracted text to a text file.
  • Python
from spire.pdf import *
from spire.pdf.common import *

# Create an object of PdfDocument class
pdf = PdfDocument()

# Load a PDF document
pdf.LoadFromFile("Sample.pdf")

# Get the first page
page = pdf.Pages.get_Item(0)

# Extract text from a rectangular area on the page
text = page.ExtractText(RectangleF(90.0, 220.0, 770.0, 130.0))

# Save the extracted text to a text file
extractedText = open("output/ExtractedTextArea.txt", "w", encoding="utf-8")
extractedText.write(text)
extractedText.close()
pdf.Close()

Python: Extract Text and Images from PDF Documents

Extract All the Images from a PDF Document

Spire.PDF for Python also provides the PdfPageBase.ExtractImages() method to extract all the images from a PDF page and return them as a list. The detailed steps for extracting all the images from a PDF document are as follows:

  • Create an object of PdfDocument class.
  • Load a PDF document using the PdfDocument.LoadFromFile() method.
  • Iterate through the pages in the document, extract the images from the pages using PdfPageBase.ExtractImages() method, and put them into a list.
  • Save the images in the list as PNG files.
  • Python
from spire.pdf import *
from spire.pdf.common import *

# Create an instance of PdfDocument class
pdf = PdfDocument()

# Load the PDF document
pdf.LoadFromFile("Sample.pdf")

# Create a list to store the images
images = []

# Iterate through the pages in the document
for i in range(pdf.Pages.Count):
    # Get a page
    page = pdf.Pages.get_Item(i)
    # Extract the images from the page and store them in the created list
    for img in page.ExtractImages():
        images.append(img)

# Save the images in the list as PNG files
i = 0
for image in images:
    i += 1
    image.Save("output/Images/Image-{0:d}.png".format(i), ImageFormat.get_Png())

pdf.Close()

Python: Extract Text and Images from PDF Documents

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.