Convert PDF to Excel in Python

2024-01-16 01:19:47

Converting PDF to Excel is to extract tabular data from a PDF document and converting it into an editable and structured spreadsheet format. This makes it much easier to work with PDF data, perform calculations and analyze information as MS Excel provides advanced data processing features.

When there are a large number of PDF files that need to be converted to Excel, you can implement batch conversion through programming, which helps to automate the PDF to Excel conversion process, thus saving time and effort. This article will guide on how to programmatically convert PDF to Excel in Python.

Python PDF to Excel Converter

To use Python for PDF to Excel conversion, we will need the Spire.PDF for Python library. This Python PDF library offers great potential for developers to work with PDF files in Python programs efficiently. It supports creating PDFs, processing existing PDF files, and converting PDF to Word, PDF to images, PDF to Excel, PDF to HTML, and more.

To install the PDF to Excel converter, simply use the following pip command to install from PyPI:

pip install Spire.PDF

How to Convert PDF to Excel in Python

Before we start, let's take a look at the core classes and methods for converting PDF files to Excel using the Spire.PDF for Python library.

  • PdfDocument class: Represents a PDF document model.
  • XlsxLineLayoutOptions class: Used to specify the conversion options to control how your PDF will be converted to Excel. The XlsxLineLayoutOptions class constructor accepts the following five parameters:
    • convertToMultipleSheet (bool): Specifies whether to convert each page to a different worksheet in the same Excel. If set to False, all pages in a PDF file will be converted to a single Excel worksheet.
    • rotatedText (bool): Specifies whether to display rotated text.
    • splitCell (bool): Specifies whether to convert text in a PDF cell (spanning more than two lines) to one Excel cell or to multiple cells.
    • wrapText (bool): Specifies whether to wrap text in an Excel cell.
    • overlapText (bool): Specifies whether to display overlapping text.
  • PdfDocument.ConvertOptions.SetPdfToXlsxOptions() method: Applies the conversion option.
  • PdfDocument.SaveToFile(string filename, FileFormat.XLSX) method: Saves the PDF to Excel XLSX format.

The following are the main steps showing how to convert PDF to Excel in Python.

  1. Install Spire.PDF for Python.
  2. Import the required modules.
  3. Create an object of the PdfDocument class.
  4. Load a PDF file through the PdfDocument.LoadFromFile() method.
  5. If you need to set the conversion options, create an object of the XlsxLineLayoutOptions class and pass the corresponding parameters to its constructor.
  6. Apply the conversion options through the PdfDocument.ConvertOptions.SetPdfToXlsxOptions() method.
  7. Call PdfDocument.SaveToFile() method to convert PDF to Excel.

Convert PDF to Excel XLSX in Python

It quite easy to convert PDF to Excel using Spire.PDF for Python. We just need to load a PDF file, and then save it in XLSX format. Only three lines of code are required for simple conversion from PDF to Excel in Python.

  • Python
from spire.pdf.common import *
from spire.pdf import *

inputFile = "Invoice.pdf"
outputFile = "PdfToExcel.xlsx"

# Create a PdfDocument object
pdf = PdfDocument()

# Load a PDF document
pdf.LoadFromFile(inputFile)

# Save the PDF file to Excel XLSX format
pdf.SaveToFile(outputFile, FileFormat.XLSX)
pdf.Close()

Convert PDF to Excel in Python

Convert a Multi-Page PDF to One Excel Sheet in Python

In addition to the simple conversion method, Spire.PDF for Python also allows us to customize conversion options through the XlsxLineLayoutOptions class while converting PDF to Excel. As introduced above, we can set the first parameter of its constructor – convertToMultipleSheet - to False to convert multiple PDF pages to one Excel sheet.

  • Python
from spire.pdf.common import *
from spire.pdf import *

inputFile = "Invoice Details.pdf"
outputFile = "PdfToExcelwithOptions.xlsx"

# Create a PdfDocument object
pdf = PdfDocument()

# Load a PDF document
pdf.LoadFromFile(inputFile)

# Create an XlsxLineLayoutOptions object to specify the conversion options
# Parameters: convertToMultipleSheet, rotatedText, splitCell, wrapText, overlapText
pdf.ConvertOptions.SetPdfToXlsxOptions(XlsxLineLayoutOptions(False, True, False, True, False))

# Save the PDF file to Excel xlsx format
pdf.SaveToFile(outputFile, FileFormat.XLSX)
pdf.Close()

Convert PDF to Excel in Python

Free License for PDF to Excel Converter

To use Spire.PDF for Python for PDF to Excel conversion without any watermarks and limitations, request a 1-month free temporary license.

Conclusion

This article gives detailed steps and code examples to demonstrate how to convert PDF to Excel using Python. By using the XlsxLineLayoutOptions class of Spire.PDF for Python, we can customize the PDF to Excel conversion options to achieve the desired conversion effect, such as converting a multi-page PDF to one Excel worksheet, wrap text in the converted Excel cell, show/hide rotated text, etc.

Feel free to explore other PDF processing and conversion features of Spire.PDF for Python library using the documentation. For any issues while testing, reaching out our technical support team via email or forum.

See Also