Displaying items by tag: pdf net Extract Read

Thursday, 12 December 2013 07:44

Extract Image from PDF and Save it to New PDF File in C#

Extracting image from PDF is not a complex operation, but sometimes we need to save the image to new PDF file. So the article describes the simple c# code to extract image from PDF and save it to new PDF file through a professional PDF .NET library Spire.PDF.

First we need to complete the preparatory work before the procedure:

Download the Spire.PDF and install it on your machine.
Add the Spire.PDF.dll files as reference.
Open bin folder and select the three dll files under .NET 4.0.
Right click property and select properties in its menu.
Set the target framework as .NET 4.
Add Spire.PDF as namespace.

The following steps will show you how to do this with ease:

Step 1: Create a PDF document.

[C#]

PdfDocument doc = new PdfDocument();
doc.LoadFromFile(@"..\..\Sample.pdf");

Step 2: Extract the image from PDF document.

[C#]

doc.ExtractImages();

Step 3: Save image file.

[C#]

image.Save("image.png",System.Drawing.Imaging.ImageFormat.Png);
PdfImage image2 = PdfImage.FromFile(@"image.png");
PdfDocument doc2=new PdfDocument ();
PdfPageBase page=doc2.Pages.Add();

Step 4: Draw image to new PDF file.

[C#]

float width = image.Width * 0.75f;
float height = image.Height * 0.75f;
float x = (page.Canvas.ClientSize.Width - width) / 2;
page.Canvas.DrawImage(image2,x,60,width,height);
doc2.SaveToFile(@"..\..\Image.pdf");

Here is the whole code:

[C#]

static void Main(string[] args)
        {
            PdfDocument doc = new PdfDocument();
            doc.LoadFromFile(@"..\..\Sample.pdf");
            doc.ExtractImages();
            image.Save("image.png",System.Drawing.Imaging.ImageFormat.Png);
            PdfImage image2 = PdfImage.FromFile(@"image.png");
            PdfDocument doc2=new PdfDocument ();
            PdfPageBase page=doc2.Pages.Add();
            float width = image.Width * 0.75f;
            float height = image.Height * 0.75f;
            float x = (page.Canvas.ClientSize.Width - width) / 2;
            page.Canvas.DrawImage(image2,x,60,width,height);
            doc2.SaveToFile(@"..\..\Image.pdf");
        }

Here comes to the preview of the effect picture:

extract image from pdf

draw image to new pdf

Published in Extract/Read

C#: Extract Images from PDF Documents

Extracting images from PDFs is a common task for many users, whether it's for repurposing visuals in a presentation, archiving important graphics, or facilitating easier analysis. By mastering image extraction using C#, developers can enhance resource management and streamline their workflow.

In this article, you will learn how to extract images from individual PDF pages as well as from entire documents using C# and Spire.PDF for .NET.

Extract Images from a Specific PDF Page
Extract All Images from an Entire PDF Document

Install Spire.PDF for .NET

To begin with, you need to add the DLL files included in the Spire.PDF for.NET package as references in your .NET project. The DLL files can be either downloaded from this link or installed via NuGet.

Package Manager

PM> Install-Package Spire.PDF

Extract Images from a Specific PDF Page

The PdfImageHelper class in Spire.PDF for .NET is designed to help users manage images within PDF documents. It allows for various operations, such as deleting, replacing, and retrieving images.

To obtain information about the images on a specific PDF page, developers can utilize the PdfImageHelper.GetImagesInfo(PdfPageBase page) method. Once they have the image information, they can save the images to files using the PdfImageInfo.Image.Save() method.

The steps to extract images from a specific PDF page are as follows:

Create a PdfDocument object.
Load a PDF file using PdfDocument.LoadFromFile() method.
Get a specific page using PdfDocument.Pages[index] property.
Create a PdfImageHelper object.
Get the image information collection from the page using PdfImageHelper.GetImagesInfo() method.
Iterate through the image information collection and save each instance as a PNG file using PdfImageInfo.Image.Save() method.

C#

using Spire.Pdf;
using Spire.Pdf.Utilities;
using System.Drawing;

namespace ExtractImagesFromSpecificPage
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a PdfDocument object
            PdfDocument doc = new PdfDocument();

            // Load a PDF document
            doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf");

            // Get a specific page
            PdfPageBase page = doc.Pages[0];

            // Create a PdfImageHelper object
            PdfImageHelper imageHelper = new PdfImageHelper();

            // Get all image information from the page 
            PdfImageInfo[] imageInfos = imageHelper.GetImagesInfo(page);

            // Iterate through the image information
            for (int i = 0; i < imageInfos.Length; i++)
            {
                // Get a specific image information
                PdfImageInfo imageInfo = imageInfos[i];

                // Get the image
                Image image = imageInfo.Image;

                // Save the image to a png file
                image.Save("C:\\Users\\Administrator\\Desktop\\Extracted\\Image-" + i + ".png");
            }

            // Dispose resources
            doc.Dispose();
        }
    }
}

Extract images from a specific PDF page

Extract All Images from an Entire PDF Document

Now that you know how to extract images from a specific page, you can iterate through the pages in a PDF document and extract images from each page. This allows you to collect all the images contained in the document.

The steps to extract all images throughout an entire PDF document are as follows:

Create a PdfDocument object.
Load a PDF file using PdfDocument.LoadFromFile() method.
Create a PdfImageHelper object.
Iterate through the pages in the document.
- Get a specific page using PdfDocument.Pages[index] property.
- Get the image information collection from the page using PdfImageHelper.GetImagesInfo() method.
- Iterate through the image information collection and save each instance as a PNG file using PdfImageInfo.Image.Save() method.

C#

using Spire.Pdf;
using Spire.Pdf.Utilities;
using System.Drawing;

namespace ExtractAllImages
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a PdfDocument object
            PdfDocument doc = new PdfDocument();

            // Load a PDF document
            doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf");

            // Create a PdfImageHelper object
            PdfImageHelper imageHelper = new PdfImageHelper();

            // Declare an int variable
            int m = 0;

            // Iterate through the pages
            for (int i = 0; i < doc.Pages.Count; i++)
            {
                // Get a specific page
                PdfPageBase page = doc.Pages[i];

                // Get all image information from the page 
                PdfImageInfo[] imageInfos = imageHelper.GetImagesInfo(page);

                // Iterate through the image information
                for (int j = 0; j < imageInfos.Length; j++)
                {
                    // Get a specific image information
                    PdfImageInfo imageInfo = imageInfos[j];

                    // Get the image
                    Image image = imageInfo.Image;

                    // Save the image to a png file
                    image.Save("C:\\Users\\Administrator\\Desktop\\Extracted\\Image-" + m + ".png");
                    m++;
                }

            }

            // Dispose resources
            doc.Dispose();
        }
    }
}

Extract all images from an entire PDF document

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Extract/Read

C#/VB.NET: Extract Text from PDF Documents

PDF documents are fixed in layout and do not allow users to perform modifications in them. To make the PDF content editable again, you can convert PDF to Word or extract text from PDF. In this article, you will learn how to extract text from a specific PDF page, how to extract text from a particular rectangle area, and how to extract text by SimpleTextExtractionStrategy in C# and VB.NET using Spire.PDF for .NET.

Extract Text from a Specified Page
Extract Text from a Rectangle
Extract Text using SimpleTextExtractionStrategy

Install Spire.PDF for .NET

To begin with, you need to add the DLL files included in the Spire.PDF for.NET package as references in your .NET project. The DLL files can be either downloaded from this link or installed via NuGet.

Package Manager

PM> Install-Package Spire.PDF

Extract Text from a Specified Page

The following are the steps to extract text from a certain page of a PDF document using Spire.PDF for .NET.

Create a PdfDocument object.
Load a PDF file using PdfDocument.LoadFromFile() method.
Get the specific page through PdfDocument.Pages[index] property.
Create a PdfTextExtractor object.
Create a PdfTextExtractOptions object, and set the IsExtractAllText property to true.
Extract text from the selected page using PdfTextExtractor.ExtractText() method.
Write the extracted text to a TXT file.

C#
VB.NET

using System;
using System.IO;
using Spire.Pdf;
using Spire.Pdf.Texts;

namespace ExtractTextFromPage
{
    class Program
    {
        static void Main(string[] args)
        {
            //Create a PdfDocument object
            PdfDocument doc = new PdfDocument();

            //Load a PDF file
            doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");

            //Get the second page
            PdfPageBase page = doc.Pages[1];
      
            //Create a PdfTextExtractot object
            PdfTextExtractor textExtractor = new PdfTextExtractor(page);

            //Create a PdfTextExtractOptions object
            PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();

            //Set isExtractAllText to true
            extractOptions.IsExtractAllText = true;

            //Extract text from the page
            string text = textExtractor.ExtractText(extractOptions);

            //Write to a txt file
            File.WriteAllText("Extracted.txt", text);
        }
    }
}

C#/VB.NET: Extract Text from PDF Documents

Extract Text from a Rectangle

The following are the steps to extract text from a rectangle area of a page using Spire.PDF for .NET.

Create a PdfDocument object.
Load a PDF file using PdfDocument.LoadFromFile() method.
Get the specific page through PdfDocument.Pages[index] property.
Create a PdfTextExtractor object.
Create a PdfTextExtractOptions object, and specify the rectangle area through the ExtractArea property of it.
Extract text from the rectangle using PdfTextExtractor.ExtractText() method.
Write the extracted text to a TXT file.

C#
VB.NET

using Spire.Pdf;
using Spire.Pdf.Texts;
using System.IO;
using System.Drawing;

namespace ExtractTextFromRectangleArea
{
    class Program
    {
        static void Main(string[] args)
        {
            //Create a PdfDocument object
            PdfDocument doc = new PdfDocument();

            //Load a PDF file
            doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Terms of Service.pdf");

            //Get the second page
            PdfPageBase page = doc.Pages[1];

            //Create a PdfTextExtractot object
            PdfTextExtractor textExtractor = new PdfTextExtractor(page);

            //Create a PdfTextExtractOptions object
            PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();

            //Set the rectangle area
            extractOptions.ExtractArea = new RectangleF(0, 0, 890, 170);

            //Extract text from the rectangle 
            string text = textExtractor.ExtractText(extractOptions);

            //Write to a txt file
            File.WriteAllText("Extracted.txt", text);
        }
    }
}

C#/VB.NET: Extract Text from PDF Documents

Extract Text using SimpleTextExtractionStrategy

The above methods extract text line by line. When extracting text using SimpleTextExtractionStrategy, it keeps track of the current Y position of each string and inserts a line break into the output if the Y position has changed. The following are the detailed steps.

Create a PdfDocument object.
Load a PDF file using PdfDocument.LoadFromFile() method.
Get the specific page through PdfDocument.Pages[index] property.
Create a PdfTextExtractor object.
Create a PdfTextExtractOptions object and set the IsSimpleExtraction property to true.
Extract text from the selected page using PdfTextExtractor.ExtractText() method.
Write the extracted text to a TXT file.

C#
VB.NET

using System.IO;
using Spire.Pdf;
using Spire.Pdf.Texts;

namespace SimpleExtraction
{
    class Program
    {
        static void Main(string[] args)
        {
            //Create a PdfDocument object
            PdfDocument doc = new PdfDocument();

            //Load a PDF file
            doc.LoadFromFile(@"C:\Users\Administrator\Desktop\Invoice.pdf");

            //Get the first page
            PdfPageBase page = doc.Pages[0];

            //Create a PdfTextExtractor object
            PdfTextExtractor textExtractor = new PdfTextExtractor(page);

            //Create a PdfTextExtractOptions object
            PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();

            //Set IsSimpleExtraction to true
            extractOptions.IsSimpleExtraction = true;

            //Extract text from the selected page 
            string text = textExtractor.ExtractText(extractOptions);

            //Write to a txt file
            File.WriteAllText("Extracted.txt", text);
        }
    }
}

C#/VB.NET: Extract Text from PDF Documents

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Extract/Read