Extract/Read

Extract/Read (3)

Retrieving the coordinates of text or images within a PDF document can quickly locate specific elements, which is valuable for extracting content from PDFs. This capability also enables adding annotations, marks, or stamps to the desired locations in a PDF, allowing for more advanced document processing and manipulation.

In this article, you will learn how to get coordinates of the specified text or image in a PDF document using Spire.PDF for Python.

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Coordinate System in Spire.PDF

When using Spire.PDF to process an existing PDF document, the origin of the coordinate system is located at the top left corner of the page. The X-axis extends horizontally from the origin to the right, and the Y-axis extends vertically downward from the origin (shown as below).

Python: Get Coordinates of the Specified Text or Image in PDF

Get Coordinates of the Specified Text in PDF in Python

To find the coordinates of a specific piece of text within a PDF document, you must first use the PdfTextFinder.Find() method to locate all instances of the target text on a particular page. Once you have found these instances, you can then access the PdfTextFragment.Positions property to retrieve the precise (X, Y) coordinates for each instance of the text.

The steps to get coordinates of the specified text in PDF are as follows.

  • Create a PdfDocument object.
  • Load a PDF document from a specified path.
  • Get a specific page from the document.
  • Create a PdfTextFinder object.
  • Specify find options through PdfTextFinder.Options property.
  • Search for a string within the page using PdfTextFinder.Find() method.
  • Get a specific instance of the search results.
  • Get X and Y coordinates of the text through PdfTextFragment.Positions[0].X and PdfTextFragment.Positions[0].Y properties.
  • Python
from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Privacy Policy.pdf")

# Get a specific page
page = doc.Pages[0]

# Create a PdfTextFinder object
textFinder = PdfTextFinder(page)

# Specify find options
findOptions = PdfTextFindOptions()
findOptions.Parameter = TextFindParameter.IgnoreCase
findOptions.Parameter = TextFindParameter.WholeWord
textFinder.Options = findOptions
 
# Search for the string "PRIVACY POLICY" within the page
findResults = textFinder.Find("PRIVACY POLICY") 

# Get the first instance of the results
result = findResults[0]

# Get X/Y coordinates of the found text
x = int(result.Positions[0].X)
y = int(result.Positions[0].Y)
print("The coordinates of the first instance of the found text are:", (x, y))

# Dispose resources
doc.Dispose()

Python: Get Coordinates of the Specified Text or Image in PDF

Get Coordinates of the Specified Image in PDF in Python

Spire.PDF for Python provides the PdfImageHelper class, which allows users to extract image details from a specific page within a PDF file. By doing so, you can leverage the PdfImageInfo.Bounds property to retrieve the (X, Y) coordinates of an individual image.

The steps to get coordinates of the specified image in PDF are as follows.

  • Create a PdfDocument object.
  • Load a PDF document from a specified path.
  • Get a specific page from the document.
  • Create a PdfImageHelper object.
  • Get the image information from the page using PdfImageHelper.GetImagesInfo() method.
  • Get X and Y coordinates of a specific image through PdfImageInfo.Bounds property.
  • Python
from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Privacy Policy.pdf")

# Get a specific page 
page = doc.Pages[0]

# Create a PdfImageHelper object
imageHelper = PdfImageHelper()

# Get image information from the page
imageInformation = imageHelper.GetImagesInfo(page)

# Get X/Y coordinates of a specific image
x = int(imageInformation[0].Bounds.X)
y = int(imageInformation[0].Bounds.Y)
print("The coordinates of the specified image are:", (x, y))

# Dispose resources
doc.Dispose()

Python: Get Coordinates of the Specified Text or Image in PDF

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Extracting text from a PDF document is a process that allows one to retrieve the textual content within a PDF file. PDFs, or Portable Document Format files, are widely used for their ability to preserve the formatting and layout of documents across different platforms. However, extracting text from a PDF can be necessary when you need to work with the text separately, such as analyzing data, conducting research, or converting it into another format. In this article, you will learn how to extract text from a PDF document in Python using Spire.PDF for Python.

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Extract Text from a Particular Page in Python

The PdfTextExtractor class in Spire.PDF for Python allows you to extract text from a particular page, while the PdfTextExtractOptions class enables you to control the extraction process and define how the text will be extracted. The following are the steps to extract text from a certain page of a PDF document.

  • Create a PdfDocument object.
  • Load a PDF file using PdfDocument.LoadFromFile() method.
  • Get the specific page through PdfDocument.Pages[index] property.
  • Create a PdfTextExtractor object.
  • Create a PdfTextExtractOptions object, and set the IsExtractAllText property to true.
  • Extract text from the selected page using PdfTextExtractor.ExtractText() method.
  • Write the extracted text to a TXT file.
  • Python
from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Get a specific page
page = doc.Pages[1]

# Create a PdfTextExtractot object
textExtractor = PdfTextExtractor(page)

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()

# Set IsExtractAllText to Ture
extractOptions.IsExtractAllText = True

# Extract text from the page keeping white spaces
text = textExtractor.ExtractText(extractOptions)

# Write text to a txt file 
with open('output/TextOfPage.txt', 'w') as file:
    file.write(text)

Python: Extract Text from a PDF Document

Extract Text from a Rectangle Area in Python

The PdfTextExtactOptions.ExtractArea property specifies a rectangle area from which the text will be extracted. The following are the steps to extract text from a rectangle area of a page using Spire.PDF for Python.

  • Create a PdfDocument object.
  • Load a PDF file using PdfDocument.LoadFromFile() method.
  • Get the specific page through PdfDocument.Pages[index] property.
  • Create a PdfTextExtractor object.
  • Create a PdfTextExtractOptions object, and specify the rectangle area through the ExtractArea property of it.
  • Extract text from the rectangle using PdfTextExtractor.ExtractText() method.
  • Write the extracted text to a TXT file.
  • Python
from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf')

# Get a specific page
page = doc.Pages[1]

# Create a PdfTextExtractot object
textExtractor = PdfTextExtractor(page)

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()

# Set the rectangle area
extractOptions.ExtractArea = RectangleF(0.0, 100.0, 890.0, 80.0)

# Extract text from the rectangle area keeping white spaces
text = textExtractor.ExtractText(extractOptions)

# Write text to a txt file 
with open('output/TextOfRectangle.txt', 'w') as file:
    file.write(text)

Python: Extract Text from a PDF Document

Extract Text from a PDF Document Using Simply Extraction Strategy in Python

The above methods extract text line by line. When extracting text using SimpleExtraction strategy, it keeps track of the current Y position of each string and inserts a line break into the output if the Y position has changed. The following are the detailed steps.

  • Create a PdfDocument object.
  • Load a PDF file using PdfDocument.LoadFromFile() method.
  • Get the specific page through PdfDocument.Pages[index] property.
  • Create a PdfTextExtractor object.
  • Create a PdfTextExtractOptions object and set the IsSimpleExtraction property to true.
  • Extract text from the selected page using PdfTextExtractor.ExtractText() method.
  • Write the extracted text to a TXT file.
  • Python
from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/Invoice.pdf')

# Get a specific page
page = doc.Pages[0]

# Create a PdfTextExtractot object
textExtractor = PdfTextExtractor(page)

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()

# Set IsSimpleExtraction to Ture
extractOptions.IsSimpleExtraction = True

# Extract text from the page using SimpleExtraction strategy
text = textExtractor.ExtractText(extractOptions)

# Write text to a txt file 
with open('output/SimplyExtraction.txt', 'w') as file:
    file.write(text)

Python: Extract Text from a PDF Document

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

PDF documents have become a ubiquitous format for sharing and archiving data, including textual content and images. Extracting images from PDF documents can serve a multitude of purposes, from repurposing graphical content for presentations or reports to feeding these visuals into machine learning models for analysis and recognition tasks. What’s more, automating this process with Python offers users a streamlined method to efficiently retrieve images from PDF documents for further manipulation.

This article demonstrates how to leverage Spire.PDF for Python to extract images from PDF documents with step-by-step guides and code examples.

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

pip install Spire.PDF

If you are unsure how to install, please refer to: How to Install Spire.PDF for Python on Windows

Extract All Images from a PDF Page with Python

Spire.PDF for Python provides the PdfImageHelper class to assist users in working with images in PDF documents, including operations such as deleting, replacing, and retrieving images. Developers can use the PdfImageHelper.GetImagesInfo(page: PdfPageBase) method to get the image information collection of a PDF page, and then use PdfImageInfo.Image.Save() method to save the images to files.

The detailed steps for extracting images from a PDF page are as follows:

  • Create an instance of PdfDocument class and load a PDF file using PdfDocument.LoadFromFile() method.
  • Create an instance of PdfImageHelper class.
  • Get the page for extracting images using PdfDocument.Pages.get_Item() method.
  • Get the image information of the page using PdfImageHelper.GetImagesInfo(page: PdfPageBase) method.
  • Iterate through the image information items and save the images to files using PdfImageInfo.Image.Save() method.
  • Python
from spire.pdf import PdfDocument, PdfImageHelper

# Create a PdfDocument instance
pdf = PdfDocument()

# Load a PDF file
pdf.LoadFromFile("Sample.pdf")

# Create a PdfImageHelper instance
imageHelper = PdfImageHelper()

# Get the last page of the document
page = pdf.Pages.get_Item(pdf.Pages.Count - 1)

# Get the image information of the page
imageInfo = imageHelper.GetImagesInfo(page)

# Iterate through the image information
for i in range(0, len(imageInfo)):
    # Save images to file
    imageInfo[i].Image.Save("output/PDFImages/Image" + str(i) + ".png")

# Release resources
pdf.Dispose()

Python: Extract Images from PDF Documents

Extract All Images from a PDF Document with Python

Developers can easily extract all images from a PDF document by iterating through each page and retrieving the images contained therein.

The detailed steps for extracting all images from a PDF document are as follows:

  • Create an instance of PdfDocument class and load a PDF file using PdfDocument.LoadFromFile() method.
  • Create an instance of PdfImageHelper class.
  • Iterate through the pages in the document:
    • Get the current page using PdfDocument.Pages.get_Item() method.
    • Get the image information of the page using PdfImageHelper.GetImagesInfo() method.
    • Iterate through the image information items and save the images to files using PdfImageInfo.Image.Save() method.
  • Python
from spire.pdf import PdfDocument, PdfImageHelper

# Create a PdfDocument instance
pdf = PdfDocument()

# Load a PDF file
pdf.LoadFromFile("Sample.pdf")

# Create a PdfImageHelper instance
imageHelper = PdfImageHelper()

# Iterate through the pages in the document
for i in range(0, pdf.Pages.Count):
    # Get the current page
    page = pdf.Pages.get_Item(i)
    # Get the image information of the page
    imageInfo = imageHelper.GetImagesInfo(page)
    # Iterate through the image information items
    for j in range(0, len(imageInfo)):
        # Save the current image to file
        imageInfo[j].Image.Save(f"output/PDFImages/Image{i}_{j}.png")

# Release resources
pdf.Close()

Python: Extract Images from PDF Documents

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

page