Spire.PDF is a professional PDF library applied to creating, writing, editing, handling and reading PDF files without any external dependencies. Get free and professional technical support for Spire.PDF for .NET, Java, Android, C++, Python.

Wed Feb 07, 2024 4:02 pm

I was trying to use Python for the following:
I can read a pdf and convert it to xls but I want to read the PDF and do the following:
1. Get position of cells where it finds certain text
2. Extract tables in a PDF

I found examples in C# and Java but not in Python.
Can you please help me in this.

abhijit.deshpande
 
Posts: 4
Joined: Wed Feb 07, 2024 3:27 pm

Thu Feb 08, 2024 2:18 am

Hello,

Thank you for your inquiry.
Our Spire.PDF supports the issue of getting the location of the specified text, but not the cell where the specified text is located. Please refer to the following code to get the text position information:
Code: Select all
from spire.pdf.common import *
from spire.pdf import *

inputFile = "input.pdf"

def AppendAllText(fname: str, text: List[str]):
    fp = open(fname, "w",encoding = "utf-8")
    for s in text:
        fp.write(s + "\n")
    fp.close()

pdf = PdfDocument()
pdf.LoadFromFile(inputFile)
result = None
builder = []
for i in range(pdf.Pages.Count):
    page = pdf.Pages.get_Item(i)
    result = page.FindText("certain text",TextFindParameter.none).Finds
    for find in result:
        builder.append(find.Position.ToString()+"\r\n")

fileName = "Extraction.txt"
AppendAllText(fileName, builder)
pdf.Close()

Also, currently our Spire.PDF for Python does not support extracting text from tables in PDF documents. But this feature is currently on our upgrade list. I will keep you informed once this feature is implemented.

Sincerely,
Annika
E-iceblue support team
User avatar

Annika.Zhou
 
Posts: 1657
Joined: Wed Apr 07, 2021 2:50 am

Thu Feb 08, 2024 7:16 am

Thank you. Noted on the text extraction.
Will wait for the table extraction update.

abhijit.deshpande
 
Posts: 4
Joined: Wed Feb 07, 2024 3:27 pm

Thu Feb 08, 2024 8:44 am

Hello,

You're welcome.
Please be assured that we will let you know as soon as the feature to extract table text is implemented.
Have a nice day.

Sincerely,
Annika
E-iceblue support team
User avatar

Annika.Zhou
 
Posts: 1657
Joined: Wed Apr 07, 2021 2:50 am

Sat May 11, 2024 5:39 am

Hello,

Thank you for your patience.
We just released Spire.PDF for Python Version:10.5.2, which supports extracting table text from PDF files. Please download this version and refer to the sample code below to test it.
Code: Select all
from spire.pdf.common import *
from spire.pdf import *


def AppendAllText(fname: str, text: List[str]):
    fp = open(fname, "w",encoding = "utf-8")
    for s in text:
        fp.write(s + "\n")
    fp.close()

doc = PdfDocument()
doc.LoadFromFile("input.pdf")

builder = []

extractor =  PdfTableExtractor(doc)
for i in range(doc.Pages.Count):
    tableLists = extractor.ExtractTable(i)
    if tableLists is not None and len(tableLists) > 0: 
        for table in tableLists: 
            row = table.Rows
            column = table.Columns 
 
            for j in range(row): 
                for k in range(column): 
                    text = table.GetText(j,k)
                    builder.append(text + " ") 
                builder.append("\r\n")

fileName = "Extraction.txt"
AppendAllText(fileName, builder)


Sincerely,
Annika
E-iceblue support team
User avatar

Annika.Zhou
 
Posts: 1657
Joined: Wed Apr 07, 2021 2:50 am

Return to Spire.PDF