EXtract Tables and Find Text using Python

Wed Feb 07, 2024 4:02 pm

I was trying to use Python for the following:
I can read a pdf and convert it to xls but I want to read the PDF and do the following:
1. Get position of cells where it finds certain text
2. Extract tables in a PDF

I found examples in C# and Java but not in Python.
Can you please help me in this.

Thu Feb 08, 2024 2:18 am

Hello,

Thank you for your inquiry.
Our Spire.PDF supports the issue of getting the location of the specified text, but not the cell where the specified text is located. Please refer to the following code to get the text position information:

Code: Select all: from spire.pdf.common import * from spire.pdf import * inputFile = "input.pdf" def AppendAllText(fname: str, text: List[str]): fp = open(fname, "w",encoding = "utf-8") for s in text: fp.write(s + "\n") fp.close() pdf = PdfDocument() pdf.LoadFromFile(inputFile) result = None builder = [] for i in range(pdf.Pages.Count): page = pdf.Pages.get_Item(i) result = page.FindText("certain text",TextFindParameter.none).Finds for find in result: builder.append(find.Position.ToString()+"\r\n") fileName = "Extraction.txt" AppendAllText(fileName, builder) pdf.Close()

Also, currently our Spire.PDF for Python does not support extracting text from tables in PDF documents. But this feature is currently on our upgrade list. I will keep you informed once this feature is implemented.

Sincerely,
Annika
E-iceblue support team

Thu Feb 08, 2024 7:16 am

Thank you. Noted on the text extraction.
Will wait for the table extraction update.

Thu Feb 08, 2024 8:44 am

Hello,

You're welcome.
Please be assured that we will let you know as soon as the feature to extract table text is implemented.
Have a nice day.

Sincerely,
Annika
E-iceblue support team

Sat May 11, 2024 5:39 am

Hello，

Thank you for your patience.
We just released Spire.PDF for Python Version:10.5.2, which supports extracting table text from PDF files. Please download this version and refer to the sample code below to test it.

Code: Select all: from spire.pdf.common import * from spire.pdf import * def AppendAllText(fname: str, text: List[str]): fp = open(fname, "w",encoding = "utf-8") for s in text: fp.write(s + "\n") fp.close() doc = PdfDocument() doc.LoadFromFile("input.pdf") builder = [] extractor = PdfTableExtractor(doc) for i in range(doc.Pages.Count): tableLists = extractor.ExtractTable(i) if tableLists is not None and len(tableLists) > 0: for table in tableLists: row = table.Rows column = table.Columns for j in range(row): for k in range(column): text = table.GetText(j,k) builder.append(text + " ") builder.append("\r\n") fileName = "Extraction.txt" AppendAllText(fileName, builder)

Sincerely,
Annika
E-iceblue support team