Spire.PDF is a professional PDF library applied to creating, writing, editing, handling and reading PDF files without any external dependencies. Get free and professional technical support for Spire.PDF for .NET, Java, Android, C++, Python.

Mon Dec 16, 2019 9:48 am

Hello,

As part of our workflow we modify almost every page in a ~6000-page PDF.
When profiling, I noticed that a lot of time is spent building PdfPageBase objects.

We have two steps:
1) Use PdfDocument.MergeFiles to merge about ~6000 single page PDF files (this is actually very fast, under 2 minutes!)
2) Do something with each page, for example print the page size, using the indexer: doc.Pages[i]

The first 1000 pages take about 20 seconds to access, and the second 1000 pages take almost a minute.
Each next 1000 keep taking longer and longer.

Is there a faster way to build all page objects?

Regards,
Philipp Lange

Philipp Lange
 
Posts: 11
Joined: Fri Jun 14, 2019 9:24 am

Mon Dec 16, 2019 10:06 am

Hi,

Thanks for your inquiry.
To help us investigate your issue accurately, please offer us the following information.
1. Your input Pdf files.
2. The complete code you were using which could reproduce your issue directly.
3. The OS and Region information, e.g. Win7 64bit, China/Chinese.
4. The RAM information, such as 8GB.

You could upload them here or send us(support@e-iceblue.com) email.

Best wishes,
Amber
E-iceblue support team
User avatar

Amber.Gu
 
Posts: 525
Joined: Tue Jun 04, 2019 3:16 am

Mon Dec 16, 2019 12:19 pm

Hello,

1) Any input PDFs can replicate the behaviour; I attached an example that can be used.
2) Here is a code snipped that replicates the behaviour. The "inputPdf" constant may need to be changed for your local filesystem.
Code: Select all
const string inputPdf = @"simple.pdf";
const int batches = 20;
const int batchSize = 300;
var pages = new PdfPageBase[batches * batchSize];

Console.WriteLine("graphical merge");
var doc = PdfDocument.MergeFiles(Enumerable.Repeat(inputPdf, batches * batchSize).ToArray());

Console.WriteLine("starting first batch");
for (int batch = 0; batch < batches; batch++)
{
    var start = DateTime.UtcNow;
    for (int i = 0; i < batchSize; i++)
        pages[batch * batchSize + i] = doc.Pages[batch * batchSize + i];
    Console.WriteLine($"batch #{batch + 1}: {(DateTime.UtcNow - start).TotalSeconds:0.00s}");
}

Console.ReadKey();


When running the code on my machine (compiled as Debug/x64) I got this partial output:

Code: Select all
graphical merge
starting first batch
batch #1: 2.12s
batch #2: 6.25s
batch #3: 10.19s
batch #4: 14.22s


This means that later pages are getting built much more slowly than earlier ones.

3)
- Windows 10 Enterprise (10.0.17134 Build 17134), 64-bit
- Region setting "Poland", Language setting "English (United States)"
4) 32GB

I hope this is enough information.

Regards,
Philipp Lange

Philipp Lange
 
Posts: 11
Joined: Fri Jun 14, 2019 9:24 am

Tue Dec 17, 2019 2:54 am

Hi Philipp ,

Thanks for your information.
I tested your file and did notice the issue you mentioned, I have logged this issue into our bug tracking system. Once there is any progress, we will inform you ASAP.
Sorry for the inconvenience caused.

Best wishes,
Amber
E-iceblue support team
User avatar

Amber.Gu
 
Posts: 525
Joined: Tue Jun 04, 2019 3:16 am

Tue Dec 06, 2022 4:03 pm

Hello

I have a similar issue as the original poster.

I have to convert a huuuge PDF (1GB, 7'500 pages) to images.
When I do this with a normal loop, the first 109 pages take 0.5 seconds per page.
So this would mean that for 7'500 pages it takes roughly an hour, which would be acceptable, considering the size of the PDF as well.

But unfortunately, the time increases immensely afterwards and it takes 8 seconds per page, leading to 12 hours for 7'500 pages.

I have created a test project using Spire.PDF for .Net V8.11.10 where I can reproduce this behavior.
For that, I first create a PDF with 7000 identical pages.
By measuring the time it takes per 100 pages, I can see that the later pages take longer to convert to images.
It has nothing to do with memory problems, because it doesn't matter if I first convert pages 0-100 and then 600-700 or first 600-700 and then 0-100.
The pages from 600-700 always take way longer than the ones from 0-100.
I also don't store the images, in a real scenario I of course would have to create files from them.

What I noticed is the following curious thing:

If I split the PDF in multiple smaller ones, then splitting pages 0-100 and 600-700 takes the same amount of time!
And then converting the smaller PDF to images also takes consistent time.
So, for a large PDF, it is actually faster to:

1. Read the PDF
2. Split it into multiple smaller ones
3. Read each small PDF and convert all pages into single images.

It seems to me, that the method PdfDocument.InsertPageRange has a more performant way of accessing the pages than PdfDocument.SaveAsImage does!

Could you please try to solve this issue so that it doesn't matter if the image is from the first or the 1134th page?
Because right now it's really ridiculously slow, as you can see from my screenshots in the attachments.

P.S: I have done the tests with a license applied. However, I've not included it in this test project, for obvious reasons ;)

RicoScheller
 
Posts: 37
Joined: Tue Jul 02, 2019 10:34 am

Wed Dec 07, 2022 9:59 am

Hello,

Thank you for describing your issue in such detail.
According the test project you provided, I did notice your issue.
When you directly operate more than 7000 pages of documents, the memory becomes less and less. When the memory becomes less and less, the GC collection mechanism is frequently triggered. During the collection, the garbage collection thread will preemption resources and consume time, resulting in program performance degradation. This is one reason of the converting time become longer.

In addition, I have logged this issue into our bug tracking system with the ticket number SPIREPDF-5659. Our development team will investigate to check if we can optimize the time difference when converting difference page to image. Once there are any updates, I will inform you in time. Sorry for the inconvenience caused.

Sincerely
Abel
E-iceblue support team
User avatar

Abel.He
 
Posts: 1010
Joined: Tue Mar 08, 2022 2:02 am

Wed Dec 07, 2022 2:06 pm

No, memory is not an issue.

As I said, I'm not storing the images returned from the method!

You can see in the attached screenshot that this project uses only 200 MB memory through the whole process.
And you can see neither my CPU nor my memory is fully used.

So, pages taking longer and longer has nothing to do with CPU performance or memory consumption!

RicoScheller
 
Posts: 37
Joined: Tue Jul 02, 2019 10:34 am

Thu Dec 08, 2022 1:54 am

Hello,

Thanks for your feedback.
I did notice that the pages at the end of larges document take longer to convert to image than the pages at the front, and I have reported this issue to our development team, and the issue number is SPIREPDF-5659, they will do further investigation. Once there are any updates, I’ll inform you in time.

Sincerely
Abel
E-iceblue support team
User avatar

Abel.He
 
Posts: 1010
Joined: Tue Mar 08, 2022 2:02 am

Fri Feb 03, 2023 9:33 am

Hello,

Thanks for your patience!
Glad to inform you that we just released Spire.PDF 9.2.2 which fixes the issue with SPIREPDF-5659.
Please download the new version from the following links to test.

Website download link: https://www.e-iceblue.cn/Downloads/Spire-PDF-NET.html
Nuget download link: https://www.nuget.org/packages/Spire.Pdf/9.2.2

Sincerely
Abel
E-iceblue support team
User avatar

Abel.He
 
Posts: 1010
Joined: Tue Mar 08, 2022 2:02 am

Fri Feb 03, 2023 8:14 pm

Wow, that's great, thanks!

I can confirm that this works and helps tremendously.

RicoScheller
 
Posts: 37
Joined: Tue Jul 02, 2019 10:34 am

Mon Feb 06, 2023 2:43 am

Hello,

Thanks for your feedback.
If you have any issue in the future, just feel free to contact us.

Sincerely
Abel
E-iceblue support team
User avatar

Abel.He
 
Posts: 1010
Joined: Tue Mar 08, 2022 2:02 am

Fri Mar 29, 2024 9:47 am

Hello Philipp,

Thanks for your patience!
Glad to inform you that we just released Spire.PDF Pack(Hot Fix) Version:10.3.16 which fixes the issue of long construction time for PDF pages.
Please download the new version from the following links to test.

Website download link: https://www.e-iceblue.cn/Downloads/Spire-PDF-NET.html
Nuget download link: https://www.nuget.org/packages/Spire.Pdf/10.3.16

Sincerely
William
E-iceblue support team
User avatar

William.Zhang
 
Posts: 454
Joined: Mon Dec 27, 2021 2:23 am

Return to Spire.PDF