Java: Extract Images from PDF Documents
Extracting images from PDF documents is a highly valuable skill for anyone dealing with digital files. This capability is particularly beneficial for graphic designers who need to source visuals, content creators looking to repurpose images for blogs or social media, and data analysts who require specific graphics for reports. By efficiently retrieving images from PDFs, users can enhance their productivity and streamline their workflows, saving both time and effort.
In this article, you will learn how to extract images from an individual PDF page as well as from an entire PDF document, using Spire.PDF for Java.
Install Spire.PDF for Java
First of all, you're required to add the Spire.Pdf.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.
<repositories> <repository> <id>com.e-iceblue</id> <name>e-iceblue</name> <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url> </repository> </repositories> <dependencies> <dependency> <groupId>e-iceblue</groupId> <artifactId>spire.pdf</artifactId> <version>10.10.7</version> </dependency> </dependencies>
Extract Images from a Specific PDF Page in Java
The PdfImageHelper class in Spire.PDF for Java is designed to facilitate image management within PDF documents. It enables users to perform several operations, such as deleting, replacing, and retrieving images.
To get information about the images on a specific PDF page, developers can use the PdfImageHelper.getImagesInfo(PdfPageBase page) method. Once they have this information, they can export the image data in widely used formats such as PNG and JPEG.
The steps to extract images from a specific PDF page using Java are as follows:
- Create a PdfDocument object.
- Load a PDF file using the PdfDocument.loadFromFile() method.
- Get a specific page using the PdfDocument.getPages().get(index) method.
- Create a PdfImageHelper object.
- Get the image information collection from the page using the PdfImageHelper.getImagesInfo() method.
- Iterate through the image information collection.
- Get a specific piece of image information.
- Get the image data from the image information using the PdfImageInfo.getImage() method.
- Write the image data as a PNG file using the ImageIO.write() method.
The following code demonstrates how to extract images from a particular page in a PDF document and save them in a specified folder.
- Java
import com.spire.pdf.*; import com.spire.pdf.utilities.*; import javax.imageio.ImageIO; import java.awt.image.BufferedImage; import java.io.File; import java.io.IOException; public class ExtractImagesFromPage { public static void main(String[] args) throws IOException { // Create a PdfDocument object PdfDocument doc = new PdfDocument(); // Load a PDF document doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf"); // Get a specific page PdfPageBase page = doc.getPages().get(0); // Create a PdfImageHelper object PdfImageHelper imageHelper = new PdfImageHelper(); // Get all image information from the page PdfImageInfo[] imageInfos = imageHelper.getImagesInfo(page); // Iterate through the image information for (int i = 0; i < imageInfos.length; i++) { // Get a specific piece of image information PdfImageInfo imageInfo = imageInfos[i]; // Get the image BufferedImage image = imageInfo.getImage(); File file = new File(String.format("C:\\Users\\Administrator\\Desktop\\Extracted\\Image-%d.png",i)); // Save the image file in PNG format ImageIO.write(image, "PNG", file); } // Dispose resources doc.dispose(); } }
Extract Images from an Entire PDF Document in Java
From the example above, you learned how to extract images from a specific page. By iterating through each page in the document and performing image extraction on every one, you can easily gather all images from the entire document.
The steps to extract images from an entire PDF document using Java are as follows:
- Create a PdfDocument object.
- Load a PDF file using the PdfDocument.loadFromFile() method.
- Create a PdfImageHelper object.
- Iterate through the pages in the document.
- Get a specific page using the PdfDocument.getPages().get(index) method.
- Get the image information collection from the page using PdfImageHelper.getImagesInfo() method.
- Iterate through the image information collection and save each instance as a PNG file using the ImageIO.write() method.
The following code illustrates how to extract all images from a PDF document and save them in a specified folder.
- Java
import com.spire.pdf.PdfDocument; import com.spire.pdf.PdfPageBase; import com.spire.pdf.utilities.PdfImageHelper; import com.spire.pdf.utilities.PdfImageInfo; import javax.imageio.ImageIO; import java.awt.image.BufferedImage; import java.io.File; import java.io.IOException; public class ExtractAllImages { public static void main(String[] args) throws IOException { // Create a PdfDocument object PdfDocument doc = new PdfDocument(); // Load a PDF document doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf"); // Create a PdfImageHelper object PdfImageHelper imageHelper = new PdfImageHelper(); // Declare an int variable int m = 0; // Iterate through the pages for (int i = 0; i < doc.getPages().getCount(); i++) { // Get a specific page PdfPageBase page = doc.getPages().get(i); // Get all image information from the page PdfImageInfo[] imageInfos = imageHelper.getImagesInfo(page); // Iterate through the image information for (int j = 0; j < imageInfos.length; j++) { // Get a specific image information PdfImageInfo imageInfo = imageInfos[j]; // Get the image BufferedImage image = imageInfo.getImage(); File file = new File(String.format("C:\\Users\\Administrator\\Desktop\\Extracted\\Image-%d.png",m)); m++; // Save the image file in PNG format ImageIO.write(image, "PNG", file); } } // Dispose resources doc.dispose(); } }
Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.
Java: Extract Text from a PDF Document
Getting text out of PDFs can be a challenge, especially if you receive hundreds of PDF documents on a daily basis. Automating data extraction through programs becomes necessary because the program can process documents in bulk and ensure that the extracted content is 100% accurate. In this article, you will learn how to extract text from a searchable PDF document in Java using Spire.PDF for Java.
- Extract All Text from a Specified Page
- Extract Text from a Rectangle Area
- Extract Text Using SimpleTextExtractionStrategy
Install Spire.PDF for Java
First of all, you're required to add the Spire.Pdf.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.
<repositories> <repository> <id>com.e-iceblue</id> <name>e-iceblue</name> <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url> </repository> </repositories> <dependencies> <dependency> <groupId>e-iceblue</groupId> <artifactId>spire.pdf</artifactId> <version>10.10.7</version> </dependency> </dependencies>
Extract All Text from a Specified Page
Spire.PDF for Java provides the PdfTextExtractor class to extract text from a searchable PDF and the PdfTextExtractOptions class to manage the extract options. By default, the PdfTextExtractor.extract() method will extract all text from a specified page without needing to specify a certain extract option. The detailed steps are as follows.
- Create a PdfDocument object.
- Load a PDF file using PdfDocument.loadFromFile() method.
- Get the specific page using PdfDocument.getPages().get() method.
- Create a PdfTextExtractor object.
- Extract text from the selected page using PdfTextExtractor.extract() method.
- Write the extracted text to a TXT file.
- Java
import com.spire.pdf.PdfDocument; import com.spire.pdf.PdfPageBase; import com.spire.pdf.texts.PdfTextExtractOptions; import com.spire.pdf.texts.PdfTextExtractor; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; public class ExtractTextFromPage { public static void main(String[] args) throws IOException { //Create a PdfDocument object PdfDocument doc = new PdfDocument(); //Load a PDF file doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\Terms of Service.pdf"); //Get the second page PdfPageBase page = doc.getPages().get(1); //Create a PdfTextExtractor object PdfTextExtractor textExtractor = new PdfTextExtractor(page); //Create a PdfTextExtractOptions object PdfTextExtractOptions extractOptions = new PdfTextExtractOptions(); //Extract text from the page String text = textExtractor.extract(extractOptions); //Write to a txt file Files.write(Paths.get("output/Extracted.txt"), text.getBytes()); } }
Extract Text from a Rectangle Area
To specify a rectangle area for extraction, use the setExtractArea() method under PdfTextExtractOptions class. The following steps show you how to extract text from a rectangle area of a page using Spire.PDF for Java.
- Create a PdfDocument object.
- Load a PDF file using PdfDocument.loadFromFile() method.
- Get the specific page using PdfDocument.getPages().get() method.
- Create a PdfTextExtractor object.
- Create a PdfTextExtractOptions object, and specify a rectangle area using setExtractArea() method of it.
- Extract text from the rectangle area using PdfTextExtractor.extract() method.
- Write the extracted text to a TXT file.
- Java
import com.spire.pdf.PdfDocument; import com.spire.pdf.PdfPageBase; import com.spire.pdf.texts.PdfTextExtractOptions; import com.spire.pdf.texts.PdfTextExtractor; import java.awt.geom.Rectangle2D; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; public class ExtractFromRectangleArea { public static void main(String[] args) throws IOException { //Create a PdfDocument object PdfDocument doc = new PdfDocument(); //Load a PDF file doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\Terms of Service.pdf"); //Get the second page PdfPageBase page = doc.getPages().get(1); //Create a PdfTextExtractor object PdfTextExtractor textExtractor = new PdfTextExtractor(page); //Create a PdfTextExtractOptions object PdfTextExtractOptions extractOptions = new PdfTextExtractOptions(); //Set the option to extract text from a rectangle area Rectangle2D rectangle2D = new Rectangle2D.Float(0, 0, 890, 170); extractOptions.setExtractArea(rectangle2D); //Extract text from the specified area String text = textExtractor.extract(extractOptions); //Write to a txt file Files.write(Paths.get("output/Extracted.txt"), text.getBytes()); } }
Extract Text Using SimpleTextExtractionStrategy
The above methods extract text line by line. When extracting text using SimpleTextExtractionStrategy, it keeps track of the current Y position of each string and inserts a line break into the output if the Y position has changed. The following are the detailed steps.
- Create a PdfDocument object.
- Load a PDF file using PdfDocument.loadFromFile() method.
- Get the specific page using PdfDocument.getPages().get() method.
- Create a PdfTextExtractor object.
- Create a PdfTextExtractOptions object, and set to use SimpleTextExtractionStrategy using setSimpleExtraction() method of it.
- Extract text using the strategy using PdfTextExtractor.ExtractText() method.
- Write the extracted text to a TXT file.
- Java
import com.spire.pdf.PdfDocument; import com.spire.pdf.PdfPageBase; import com.spire.pdf.texts.PdfTextExtractOptions; import com.spire.pdf.texts.PdfTextExtractor; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; public class ExtractUsingSimpleTextStrategy { public static void main(String[] args) throws IOException { //Create a PdfDocument object PdfDocument doc = new PdfDocument(); //Load a PDF file doc.loadFromFile("C:\\Users\\Administrator\\Desktop\\Invoice.pdf"); //Get the first page PdfPageBase page = doc.getPages().get(0); //Create a PdfTextExtractor object PdfTextExtractor textExtractor = new PdfTextExtractor(page); //Create a PdfTextExtractOptions object PdfTextExtractOptions extractOptions = new PdfTextExtractOptions(); //Set the option to extract text using SimpleExtraction strategy extractOptions.setSimpleExtraction(true); //Extract text from the specified area String text = textExtractor.extract(extractOptions); //Write to a txt file Files.write(Paths.get("output/Extracted.txt"), text.getBytes()); } }
Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.