Extract/Read (3)
Python: Get Coordinates of the Specified Text or Image in PDF
2024-05-21 01:58:08 Written by support iceblueRetrieving the coordinates of text or images within a PDF document can quickly locate specific elements, which is valuable for extracting content from PDFs. This capability also enables adding annotations, marks, or stamps to the desired locations in a PDF, allowing for more advanced document processing and manipulation.
In this article, you will learn how to get coordinates of the specified text or image in a PDF document using Spire.PDF for Python.
- Get Coordinates of the Specified Text in PDF in Python
- Get Coordinates of the Specified Image in PDF in Python
Install Spire.PDF for Python
This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.
pip install Spire.PDF
If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows
Coordinate System in Spire.PDF
When using Spire.PDF to process an existing PDF document, the origin of the coordinate system is located at the top left corner of the page. The X-axis extends horizontally from the origin to the right, and the Y-axis extends vertically downward from the origin (shown as below).
Get Coordinates of the Specified Text in PDF in Python
To find the coordinates of a specific piece of text within a PDF document, you must first use the PdfTextFinder.Find() method to locate all instances of the target text on a particular page. Once you have found these instances, you can then access the PdfTextFragment.Positions property to retrieve the precise (X, Y) coordinates for each instance of the text.
The steps to get coordinates of the specified text in PDF are as follows.
- Create a PdfDocument object.
- Load a PDF document from a specified path.
- Get a specific page from the document.
- Create a PdfTextFinder object.
- Specify find options through PdfTextFinder.Options property.
- Search for a string within the page using PdfTextFinder.Find() method.
- Get a specific instance of the search results.
- Get X and Y coordinates of the text through PdfTextFragment.Positions[0].X and PdfTextFragment.Positions[0].Y properties.
- Python
from spire.pdf.common import * from spire.pdf import * # Create a PdfDocument object doc = PdfDocument() # Load a PDF document doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Privacy Policy.pdf") # Get a specific page page = doc.Pages[0] # Create a PdfTextFinder object textFinder = PdfTextFinder(page) # Specify find options findOptions = PdfTextFindOptions() findOptions.Parameter = TextFindParameter.IgnoreCase findOptions.Parameter = TextFindParameter.WholeWord textFinder.Options = findOptions # Search for the string "PRIVACY POLICY" within the page findResults = textFinder.Find("PRIVACY POLICY") # Get the first instance of the results result = findResults[0] # Get X/Y coordinates of the found text x = int(result.Positions[0].X) y = int(result.Positions[0].Y) print("The coordinates of the first instance of the found text are:", (x, y)) # Dispose resources doc.Dispose()
Get Coordinates of the Specified Image in PDF in Python
Spire.PDF for Python provides the PdfImageHelper class, which allows users to extract image details from a specific page within a PDF file. By doing so, you can leverage the PdfImageInfo.Bounds property to retrieve the (X, Y) coordinates of an individual image.
The steps to get coordinates of the specified image in PDF are as follows.
- Create a PdfDocument object.
- Load a PDF document from a specified path.
- Get a specific page from the document.
- Create a PdfImageHelper object.
- Get the image information from the page using PdfImageHelper.GetImagesInfo() method.
- Get X and Y coordinates of a specific image through PdfImageInfo.Bounds property.
- Python
from spire.pdf.common import * from spire.pdf import * # Create a PdfDocument object doc = PdfDocument() # Load a PDF document doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Privacy Policy.pdf") # Get a specific page page = doc.Pages[0] # Create a PdfImageHelper object imageHelper = PdfImageHelper() # Get image information from the page imageInformation = imageHelper.GetImagesInfo(page) # Get X/Y coordinates of a specific image x = int(imageInformation[0].Bounds.X) y = int(imageInformation[0].Bounds.Y) print("The coordinates of the specified image are:", (x, y)) # Dispose resources doc.Dispose()
Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.
Extracting text from a PDF document is a process that allows one to retrieve the textual content within a PDF file. PDFs, or Portable Document Format files, are widely used for their ability to preserve the formatting and layout of documents across different platforms. However, extracting text from a PDF can be necessary when you need to work with the text separately, such as analyzing data, conducting research, or converting it into another format. In this article, you will learn how to extract text from a PDF document in Python using Spire.PDF for Python.
- Extract Text from a Particular Page in Python
- Extract Text from a Rectangle Area in Python
- Extract Text from a PDF Document Using Simply Extraction Strategy in Python
Install Spire.PDF for Python
This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.
pip install Spire.PDF
If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows
Extract Text from a Particular Page in Python
The PdfTextExtractor class in Spire.PDF for Python allows you to extract text from a particular page, while the PdfTextExtractOptions class enables you to control the extraction process and define how the text will be extracted. The following are the steps to extract text from a certain page of a PDF document.
- Create a PdfDocument object.
- Load a PDF file using PdfDocument.LoadFromFile() method.
- Get the specific page through PdfDocument.Pages[index] property.
- Create a PdfTextExtractor object.
- Create a PdfTextExtractOptions object, and set the IsExtractAllText property to true.
- Extract text from the selected page using PdfTextExtractor.ExtractText() method.
- Write the extracted text to a TXT file.
- Python
from spire.pdf.common import * from spire.pdf import * # Create a PdfDocument object doc = PdfDocument() # Load a PDF document doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf') # Get a specific page page = doc.Pages[1] # Create a PdfTextExtractot object textExtractor = PdfTextExtractor(page) # Create a PdfTextExtractOptions object extractOptions = PdfTextExtractOptions() # Set IsExtractAllText to Ture extractOptions.IsExtractAllText = True # Extract text from the page keeping white spaces text = textExtractor.ExtractText(extractOptions) # Write text to a txt file with open('output/TextOfPage.txt', 'w') as file: file.write(text)
Extract Text from a Rectangle Area in Python
The PdfTextExtactOptions.ExtractArea property specifies a rectangle area from which the text will be extracted. The following are the steps to extract text from a rectangle area of a page using Spire.PDF for Python.
- Create a PdfDocument object.
- Load a PDF file using PdfDocument.LoadFromFile() method.
- Get the specific page through PdfDocument.Pages[index] property.
- Create a PdfTextExtractor object.
- Create a PdfTextExtractOptions object, and specify the rectangle area through the ExtractArea property of it.
- Extract text from the rectangle using PdfTextExtractor.ExtractText() method.
- Write the extracted text to a TXT file.
- Python
from spire.pdf.common import * from spire.pdf import * # Create a PdfDocument object doc = PdfDocument() # Load a PDF document doc.LoadFromFile('C:/Users/Administrator/Desktop/Terms of service.pdf') # Get a specific page page = doc.Pages[1] # Create a PdfTextExtractot object textExtractor = PdfTextExtractor(page) # Create a PdfTextExtractOptions object extractOptions = PdfTextExtractOptions() # Set the rectangle area extractOptions.ExtractArea = RectangleF(0.0, 100.0, 890.0, 80.0) # Extract text from the rectangle area keeping white spaces text = textExtractor.ExtractText(extractOptions) # Write text to a txt file with open('output/TextOfRectangle.txt', 'w') as file: file.write(text)
Extract Text from a PDF Document Using Simply Extraction Strategy in Python
The above methods extract text line by line. When extracting text using SimpleExtraction strategy, it keeps track of the current Y position of each string and inserts a line break into the output if the Y position has changed. The following are the detailed steps.
- Create a PdfDocument object.
- Load a PDF file using PdfDocument.LoadFromFile() method.
- Get the specific page through PdfDocument.Pages[index] property.
- Create a PdfTextExtractor object.
- Create a PdfTextExtractOptions object and set the IsSimpleExtraction property to true.
- Extract text from the selected page using PdfTextExtractor.ExtractText() method.
- Write the extracted text to a TXT file.
- Python
from spire.pdf.common import * from spire.pdf import * # Create a PdfDocument object doc = PdfDocument() # Load a PDF document doc.LoadFromFile('C:/Users/Administrator/Desktop/Invoice.pdf') # Get a specific page page = doc.Pages[0] # Create a PdfTextExtractot object textExtractor = PdfTextExtractor(page) # Create a PdfTextExtractOptions object extractOptions = PdfTextExtractOptions() # Set IsSimpleExtraction to Ture extractOptions.IsSimpleExtraction = True # Extract text from the page using SimpleExtraction strategy text = textExtractor.ExtractText(extractOptions) # Write text to a txt file with open('output/SimplyExtraction.txt', 'w') as file: file.write(text)
Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.
PDF documents have become a ubiquitous format for sharing and archiving data, including textual content and images. Extracting images from PDF documents can serve a multitude of purposes, from repurposing graphical content for presentations or reports to feeding these visuals into machine learning models for analysis and recognition tasks. What’s more, automating this process with Python offers users a streamlined method to efficiently retrieve images from PDF documents for further manipulation.
This article demonstrates how to leverage Spire.PDF for Python to extract images from PDF documents with step-by-step guides and code examples.
Install Spire.PDF for Python
This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.
pip install Spire.PDF
If you are unsure how to install, please refer to: How to Install Spire.PDF for Python on Windows
Extract All Images from a PDF Page with Python
Spire.PDF for Python provides the PdfImageHelper class to assist users in working with images in PDF documents, including operations such as deleting, replacing, and retrieving images. Developers can use the PdfImageHelper.GetImagesInfo(page: PdfPageBase) method to get the image information collection of a PDF page, and then use PdfImageInfo.Image.Save() method to save the images to files.
The detailed steps for extracting images from a PDF page are as follows:
- Create an instance of PdfDocument class and load a PDF file using PdfDocument.LoadFromFile() method.
- Create an instance of PdfImageHelper class.
- Get the page for extracting images using PdfDocument.Pages.get_Item() method.
- Get the image information of the page using PdfImageHelper.GetImagesInfo(page: PdfPageBase) method.
- Iterate through the image information items and save the images to files using PdfImageInfo.Image.Save() method.
- Python
from spire.pdf import PdfDocument, PdfImageHelper # Create a PdfDocument instance pdf = PdfDocument() # Load a PDF file pdf.LoadFromFile("Sample.pdf") # Create a PdfImageHelper instance imageHelper = PdfImageHelper() # Get the last page of the document page = pdf.Pages.get_Item(pdf.Pages.Count - 1) # Get the image information of the page imageInfo = imageHelper.GetImagesInfo(page) # Iterate through the image information for i in range(0, len(imageInfo)): # Save images to file imageInfo[i].Image.Save("output/PDFImages/Image" + str(i) + ".png") # Release resources pdf.Dispose()
Extract All Images from a PDF Document with Python
Developers can easily extract all images from a PDF document by iterating through each page and retrieving the images contained therein.
The detailed steps for extracting all images from a PDF document are as follows:
- Create an instance of PdfDocument class and load a PDF file using PdfDocument.LoadFromFile() method.
- Create an instance of PdfImageHelper class.
- Iterate through the pages in the document:
- Get the current page using PdfDocument.Pages.get_Item() method.
- Get the image information of the page using PdfImageHelper.GetImagesInfo() method.
- Iterate through the image information items and save the images to files using PdfImageInfo.Image.Save() method.
- Python
from spire.pdf import PdfDocument, PdfImageHelper # Create a PdfDocument instance pdf = PdfDocument() # Load a PDF file pdf.LoadFromFile("Sample.pdf") # Create a PdfImageHelper instance imageHelper = PdfImageHelper() # Iterate through the pages in the document for i in range(0, pdf.Pages.Count): # Get the current page page = pdf.Pages.get_Item(i) # Get the image information of the page imageInfo = imageHelper.GetImagesInfo(page) # Iterate through the image information items for j in range(0, len(imageInfo)): # Save the current image to file imageInfo[j].Image.Save(f"output/PDFImages/Image{i}_{j}.png") # Release resources pdf.Close()
Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.