PDF documents, while preserving the special formatting and visual style of content, often present difficulties when it comes to editing, copying, or searching for specific information. However, by extracting text and images from PDF files, users can easily process them or save them in other formats for further use, thus solving the difficulties in editing the content of PDF files. This article will explain how to use Spire.PDF for Python to extract text and images from PDF documents in Python programs.
- Extract All Text from a PDF Document
- Extract Text from a Rectangular Area of a PDF Page
- Extract All Images from a PDF Document
Install Spire.PDF for Python
This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.
pip install Spire.PDF
If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows
Extract Text from PDF Documents
Spire.PDF for Python provides the PdfPageBase.ExtractText() method which can be used to extract all text from a PDF page (including blank space), and return it as a string. The detailed steps for extracting all text from a PDF document are as follows:
- Create an object of PdfDocument class.
- Load a PDF document using PdfDocument.LoadFromFile() method.
- Iterate through the pages of the document, extract the text from the pages using PdfPageBase.ExtractText() method, and write it to a text file.
- Python
from spire.pdf import * from spire.pdf.common import * # Create an instance of the PdfDocument class pdf = PdfDocument() # Load the PDF document pdf.LoadFromFile("Sample.pdf") # Create a TXT file to save the extracted text extractedText = open("output/ExtractedText.txt", "w", encoding="utf-8") # Iterate through the pages of the document for i in range(pdf.Pages.Count): # Get the page page = pdf.Pages.get_Item(i) # Extract text from the page text = page.ExtractText() # Write the text to the text file extractedText.write(text + "\n") extractedText.close() pdf.Close()
Extract Text from a Rectangular Area of a PDF Page
The PdfPageBase.ExtractText() method also supports extracting text from a rectangular area on a PDF page. The detailed steps are as follows:
- Create an object of the PdfDocument class.
- Load a PDF document using the PdfDocument.LoadFromFile () method.
- Get a page using PdfDocument.Pages.get_Item() method.
- Extract the text from a rectangular area on the page using PdfPageBase.ExtractText() method.
- Save the extracted text to a text file.
- Python
from spire.pdf import * from spire.pdf.common import * # Create an object of PdfDocument class pdf = PdfDocument() # Load a PDF document pdf.LoadFromFile("Sample.pdf") # Get the first page page = pdf.Pages.get_Item(0) # Extract text from a rectangular area on the page text = page.ExtractText(RectangleF(90.0, 220.0, 770.0, 130.0)) # Save the extracted text to a text file extractedText = open("output/ExtractedTextArea.txt", "w", encoding="utf-8") extractedText.write(text) extractedText.close() pdf.Close()
Extract All the Images from a PDF Document
Spire.PDF for Python also provides the PdfPageBase.ExtractImages() method to extract all the images from a PDF page and return them as a list. The detailed steps for extracting all the images from a PDF document are as follows:
- Create an object of PdfDocument class.
- Load a PDF document using the PdfDocument.LoadFromFile() method.
- Iterate through the pages in the document, extract the images from the pages using PdfPageBase.ExtractImages() method, and put them into a list.
- Save the images in the list as PNG files.
- Python
from spire.pdf import * from spire.pdf.common import * # Create an instance of PdfDocument class pdf = PdfDocument() # Load the PDF document pdf.LoadFromFile("Sample.pdf") # Create a list to store the images images = [] # Iterate through the pages in the document for i in range(pdf.Pages.Count): # Get a page page = pdf.Pages.get_Item(i) # Extract the images from the page and store them in the created list for img in page.ExtractImages(): images.append(img) # Save the images in the list as PNG files i = 0 for image in images: i += 1 image.Save("output/Images/Image-{0:d}.png".format(i), ImageFormat.get_Png()) pdf.Close()
Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.