Java: Extract Text from HTML

HTML (Hypertext Markup Language) has become one of the most commonly used text markup languages on the Internet, and nearly all web pages are created using HTML. While HTML contains numerous tags and formatting information, the most valuable content is typically the visible text. It is important to know how to extract the text content from an HTML file when users intend to utilize it for tasks such as editing, AI training, or storing in databases. This article will demonstrate how to extract text from HTML using Spire.Doc for Java within Java programs.

Install Spire.Doc for Java

First of all, you're required to add the Spire.Doc.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.doc</artifactId>
        <version>12.11.0</version>
    </dependency>
</dependencies>
    

Extract Text from HTML File

Spire.Doc for Java supports loading HTML files using the Document.loadFromFile(fileName, FileFormat.Html) method. Then, users can use Document.getText() method to get the text that is visible in browsers and write it to a TXT file. The detailed steps are as follows:

  • Create an object of Document class.
  • Load an HTML file using Document.loadFromFile(fileName, FileFormat.Html) method.
  • Get the text of the HTML file using Document.getText() method.
  • Write the text to a TXT file.
  • Java
import com.spire.doc.Document;
import com.spire.doc.FileFormat;

import java.io.FileWriter;
import java.io.IOException;

public class ExtractTextFromHTML {
    public static void main(String[] args) throws IOException {

        //Create an object of Document class
        Document doc = new Document();

        //Load an HTML file
        doc.loadFromFile("Sample.html", FileFormat.Html);

        //Get text from the HTML file
        String text = doc.getText();

        //Write the text to a TXT file
        FileWriter fileWriter = new FileWriter("HTMLText.txt");
        fileWriter.write(text);
        fileWriter.close();
    }
}

HTML Web Page:

Java: Extract Text from HTML

Extracted Text:

Java: Extract Text from HTML

Extract Text from URL

To extract text from a URL, users need to create a custom method to retrieve the HTML file from the URL and then extract the text from it. The detailed steps are as follows:

  • Create an object of Document class.
  • Use the custom method readHTML() to get the HTML file from a URL and return the file path.
  • Load the HTML file using Document.loadFromFile(filename, FileFormat.Html) method.
  • Get the text from the HTML file using Document.getText() method.
  • Write the text to a TXT file.
  • Java
import com.spire.doc.Document;
import com.spire.doc.FileFormat;

import java.io.*;
import java.net.URL;
import java.net.URLConnection;

public class ExtractTextFromURL {
    public static void main(String[] args) throws IOException {
        //Create an object of Document
        Document doc = new Document();

        //Call the custom method to load the HTML file from a URL
        doc.loadFromFile(readHTML("https://aeon.co/essays/how-to-face-the-climate-crisis-with-spinoza-and-self-knowledge", "output.html"), FileFormat.Html);

        //Get the text from the HTML file
        String urlText = doc.getText();

        //Write the text to a TXT file
        FileWriter fileWriter = new FileWriter("URLText.txt");
        fileWriter.write(urlText);
    }

    public static String readHTML(String urlString, String saveHtmlFilePath) throws IOException {

        //Create an object of URL class
        URL url = new URL(urlString);

        //Open the URL
        URLConnection connection = url.openConnection();

        //Save the url as an HTML file
        BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
        BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(saveHtmlFilePath), "UTF-8"));
        String line;
        while ((line = reader.readLine()) != null) {
            writer.write(line);
            writer.newLine();
        }

        reader.close();
        writer.close();

        //Return the file path of the saved HTML file
        return saveHtmlFilePath;
    }
}

URL Web Page:

Java: Extract Text from HTML

Extracted Text:

Java: Extract Text from HTML

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.