Python: Extract Tables from Word Documents

2024-05-30 01:01:30 Written by  support iceblue
Rate this item
(0 votes)

Word documents often contain valuable data in the form of tables, which can be used for reporting, data analysis, and record-keeping. However, manually extracting and transferring these tables to other formats can be a time-consuming and error-prone task. By automating this process using Python, we can save time, ensure accuracy, and maintain consistency. Spire.Doc for Python provides a seamless solution for the table extraction task, making it effortless to create accessible and manageable files with data from Word document tables. This article will demonstrate how to leverage Spire.Doc for Python to extract tables from Word documents and write them into text files and Excel worksheets.

Install Spire.Doc for Python

This scenario requires Spire.Doc for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

pip install Spire.Doc

If you are unsure how to install, please refer to: How to Install Spire.Doc for Python on Windows

Extract Tables from Word Documents to Text Files with Python

Spire.Doc for Python offers the Section.Tables property to retrieve a collection of tables within a section of a Word document. Then, developers can use the properties and methods under the ITable class to access the data in the tables and write it into a text file. This provides a convenient solution for converting Word document tables into text files.

The detailed steps for extracting tables from Word documents to text files are as follows:

  • Create an object of Document class and load a Word document using Document.LoadFromFile() method.
  • Iterate through the sections in the document and get the table collection of each section through Section.Tables property.
  • Iterate through the tables and create a string object for each table.
  • Iterate through the rows in each table and the cells in each row, get the text of each cell through TableCell.Paragraphs[].Text property, and add the cell text to the string.
  • Save each string to a text file.
  • Python
from spire.doc import *
from spire.doc.common import *

# Create an instance of Document
doc = Document()

# Load a Word document
doc.LoadFromFile("Sample.docx")

# Loop through the sections
for s in range(doc.Sections.Count):
    # Get a section
    section = doc.Sections.get_Item(s)
    # Get the tables in the section
    tables = section.Tables
    # Loop through the tables
    for i in range(0, tables.Count):
        # Get a table
        table = tables.get_Item(i)
        # Initialize a string to store the table data
        tableData = ''
        # Loop through the rows of the table
        for j in range(0, table.Rows.Count):
            # Loop through the cells of the row
            for k in range(0, table.Rows.get_Item(j).Cells.Count):
                # Get a cell
                cell = table.Rows.get_Item(j).Cells.get_Item(k)
                # Get the text in the cell
                cellText = ''
                for para in range(cell.Paragraphs.Count):
                    paragraphText = cell.Paragraphs.get_Item(para).Text
                    cellText += (paragraphText + ' ')
                # Add the text to the string
                tableData += cellText
                if k < table.Rows.get_Item(j).Cells.Count - 1:
                    tableData += '\t'
            # Add a new line
            tableData += '\n'
    
        # Save the table data to a text file
        with open(f'output/Tables/WordTable_{s+1}_{i+1}.txt', 'w', encoding='utf-8') as f:
            f.write(tableData)
doc.Close()

Python: Extract Tables from Word Documents

Extract Tables from Word Documents to Excel Workbooks with Python

Developers can also utilize Spire.Doc for Python to retrieve table data and then use Spire.XLS for Python to write the table data into an Excel worksheet, thereby enabling the conversion of Word document tables into Excel workbooks.

Install Spire.XLS for Python via PyPI:

pip install Spire.XLS

The detailed steps for extracting tables from Word documents to Excel workbooks are as follows:

Create an object of Document class and load a Word document using Document.LoadFromFile() method.

  • Create an object of Workbook class and clear the default worksheets using Workbook.Worksheets.Clear() method.
  • Iterate through the sections in the document and get the table collection of each section through Section.Tables property.
  • Iterate through the tables and create a worksheet for each table using Workbook.Worksheets.Add() method.
  • Iterate through the rows in each table and the cells in each row, get the text of each cell through TableCell.Paragraphs[].Text property, and write the text to the worksheet using Worksheet.SetCellValue() method.
  • Save the workbook using Workbook.SaveToFile() method.
  • Python
from spire.doc import *
from spire.doc.common import *
from spire.xls import *
from spire.xls.common import *

# Create an instance of Document
doc = Document()

# Load a Word document
doc.LoadFromFile('Sample.docx')

# Create an instance of Workbook
wb = Workbook()
wb.Worksheets.Clear()

# Loop through sections in the document
for i in range(doc.Sections.Count):
    # Get a section
    section = doc.Sections.get_Item(i)
    # Loop through tables in the section
    for j in range(section.Tables.Count):
        # Get a table
        table = section.Tables.get_Item(j)
        # Create a worksheet
        ws = wb.Worksheets.Add(f'Table_{i+1}_{j+1}')
        # Write the table to the worksheet
        for row in range(table.Rows.Count):
            # Get a row
            tableRow = table.Rows.get_Item(row)
            # Loop through cells in the row
            for cell in range(tableRow.Cells.Count):
                # Get a cell
                tableCell = tableRow.Cells.get_Item(cell)
                # Get the text in the cell
                cellText = ''
                for paragraph in range(tableCell.Paragraphs.Count):
                    paragraph = tableCell.Paragraphs.get_Item(paragraph)
                    cellText = cellText + (paragraph.Text + ' ')
                # Write the cell text to the worksheet
                ws.SetCellValue(row + 1, cell + 1, cellText)

# Save the workbook
wb.SaveToFile('output/Tables/WordTableToExcel.xlsx', FileFormat.Version2016)
doc.Close()
wb.Dispose()

Python: Extract Tables from Word Documents

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Additional Info

  • tutorial_title:
Last modified on Thursday, 30 May 2024 01:11