When receiving or downloading a Word document from the Internet, you may sometimes need to extract content from the document for other purposes. In this article, you will learn how to programmatically extract text and images from a Word document using Spire.Doc for .NET.
Install Spire.Doc for .NET
To begin with, you need to add the DLL files included in the Spire.Doc for.NET package as references in your .NET project. The DLL files can be either downloaded from this link or installed via NuGet.
PM> Install-Package Spire.Doc
Extract Text from a Word Document
Below are detailed steps on how to extract text from a Word document and save in a TXT file.
- Create a Document instance.
- Load a sample Word document using Document.LoadFromFile() method.
- Create a StringBuilder instance.
- Get each paragraph of each section in the document.
- Get the text of a specified paragraph using Paragraph.Text property, and then append the extracted text to the StringBuilder instance using StringBuilder.AppendLine() method.
- Create a new txt file and write the extracted text to the file using File.WriteAllText() method.
- C#
- VB.NET
using Spire.Doc; using Spire.Doc.Documents; using System.Text; using System.IO; namespace ExtractTextfromWord { class ExtractText { static void Main(string[] args) { //Create a Document instance Document document = new Document(); //Load a sample Word document document.LoadFromFile("input.docx"); //Create a StringBuilder instance StringBuilder sb = new StringBuilder(); //Extract text from Word and save to StringBuilder instance foreach (Section section in document.Sections) { foreach (Paragraph paragraph in section.Paragraphs) { sb.AppendLine(paragraph.Text); } } //Create a new txt file to save the extracted text File.WriteAllText("Extract.txt", sb.ToString()); } } }
Extract Images from a Word Document
Below are detailed steps on how to extract all images from a Word document.
- Create a Document instance and load a sample Word document.
- Get each paragraph of each section in the document.
- Get each document object of a specific paragraph.
- Determine whether the document object type is picture. If yes, save the image out of the document using DocPicture.Image.Save(String, ImageFormat) method.
- C#
- VB.NET
using Spire.Doc; using Spire.Doc.Documents; using Spire.Doc.Fields; using System; namespace ExtractImage { class Program { static void Main(string[] args) { //Load a Word document Document document = new Document("input.docx"); int index = 0; //Get each section of document foreach (Section section in document.Sections) { //Get each paragraph of section foreach (Paragraph paragraph in section.Paragraphs) { //Get each document object of a specific paragraph foreach (DocumentObject docObject in paragraph.ChildObjects) { //If the DocumentObjectType is picture, save it out of the document if (docObject.DocumentObjectType == DocumentObjectType.Picture) { DocPicture picture = docObject as DocPicture; picture.Image.Save(string.Format("image_{0}.png", index), System.Drawing.Imaging.ImageFormat.Png); index++; } } } } } } }
Apply for a Temporary License
If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.