Python: Extract Form Field Values from PDF

PDF forms are commonly used to collect user information, and extracting form values programmatically allows for automated processing of submitted data, ensuring accurate data collection and analysis. After extraction, you can generate reports based on form field values or migrate them to other systems or databases. In this article, you will learn how to extract form field values from PDF with Python using Spire.PDF for Python.

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Extract Form Field Values from PDF with Python

Spire.PDF for Python supports various types of PDF form fields, including:

  • Text box field (represented by the PdfTextBoxFieldWidget class)
  • Check box field (represented by the PdfCheckBoxWidgetFieldWidget class)
  • Radio button field (represented by the PdfRadioButtonListFieldWidget class)
  • List box field (represented by the PdfListBoxWidgetFieldWidget class)
  • Combo box field (represented by the PdfComboBoxWidgetFieldWidget class)

Before extracting data from the PDF forms, it is necessary to determine the specific type of each form field first, and then you can use the properties of the corresponding form field class to extract their values accurately. The following are the detailed steps.

  • Initialize an instance of the PdfDocument class.
  • Load a PDF document using PdfDocument.LoadFromFile() method.
  • Get the form in the PDF document using PdfDocument.Form property.
  • Create a list to store the extracted form field values.
  • Iterate through all fields in the PDF form.
  • Determine the types of the form fields, then get the names and values of the form fields using the corresponding properties.
  • Write the results to a text file.
  • Python
from spire.pdf.common import *
from spire.pdf import *

inputFile = "Forms.pdf"
outputFile = "GetFormFieldValues.txt"

# Create a PdfDocument instance
pdf = PdfDocument()

# Load a PDF document
pdf.LoadFromFile(inputFile)

# Get PDF forms
pdfform = pdf.Form
formWidget = PdfFormWidget(pdfform)
sb = []

# Iterate through all fields in the form
if formWidget.FieldsWidget.Count > 0:
    for i in range(formWidget.FieldsWidget.Count):
        field = formWidget.FieldsWidget.get_Item(i)

        # Get the name and value of the textbox field
        if isinstance(field, PdfTextBoxFieldWidget):
            textBoxField = field if isinstance(field, PdfTextBoxFieldWidget) else None
            name = textBoxField.Name
            value = textBoxField.Text
            sb.append("Textbox Name: " + name + "\r")
            sb.append("Textbox Name " + value + "\r\n")

        # Get the name of the listbox field    
        if isinstance(field, PdfListBoxWidgetFieldWidget):
            listBoxField = field if isinstance(field, PdfListBoxWidgetFieldWidget) else None
            name = listBoxField.Name
            sb.append("Listbox Name: " + name + "\r")

            # Get the items of the listbox field   
            sb.append("Listbox Items: \r")
            items = listBoxField.Values
            for i in range(items.Count):
                item = items.get_Item(i)
                sb.append(item.Value + "\r")

            # Get the selected item of the listbox field      
            selectedValue = listBoxField.SelectedValue
            sb.append("Listbox Selected Value: " + selectedValue + "\r\n")
        
        # Get the name of the combo box field
        if isinstance(field, PdfComboBoxWidgetFieldWidget):
            comBoxField = field if isinstance(field, PdfComboBoxWidgetFieldWidget) else None
            name = comBoxField.Name
            sb.append("Combobox Name: " + name + "\r");

            # Get the items of the combo box field
            sb.append("Combobox Items: \r");
            items = comBoxField.Values
            for i in range(items.Count):
                item = items.get_Item(i)
                sb.append(item.Value + "\r")
            
            # Get the selected item of the combo box field
            selectedValue = comBoxField.SelectedValue
            sb.append("Combobox Selected Value: " + selectedValue + "\r\n")
        
        # Get the name and selected item of the radio button field
        if isinstance(field, PdfRadioButtonListFieldWidget):
            radioBtnField = field if isinstance(field, PdfRadioButtonListFieldWidget) else None
            name = radioBtnField.Name
            selectedValue = radioBtnField.SelectedValue
            sb.append("Radio Button Name: " + name + "\r");
            sb.append("Radio Button Selected Value: " + selectedValue + "\r\n")
       
       # Get the name and status of the checkbox field
        if isinstance(field, PdfCheckBoxWidgetFieldWidget):
            checkBoxField = field if isinstance(field, PdfCheckBoxWidgetFieldWidget) else None
            name = checkBoxField.Name
            sb.append("Checkbox Name: " + name + "\r")
            
            state = checkBoxField.Checked
            stateValue = "Yes" if state else "No"
            sb.append("If the checkBox is checked: " + stateValue + "\r\n")

# Write the results to a text file
f2=open(outputFile,'w', encoding='UTF-8')
for item in sb:
        f2.write(item)
f2.close()
pdf.Close()

Python: Extract Form Field Values from PDF

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.