Python: Extract Comments from Word

Comments in Word documents are often used for collaborative review and feedback purposes. They may contain text and images that provide valuable information to guide document improvements. Extracting the text and images from comments allows you to analyze and evaluate the feedback provided by reviewers, helping you gain a comprehensive understanding of the strengths, weaknesses, and suggestions related to the document. In this article, we will demonstrate how to extract text and images from Word comments in Python using Spire.Doc for Python.

Install Spire.Doc for Python

This scenario requires Spire.Doc for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

pip install Spire.Doc

If you are unsure how to install, please refer to this tutorial: How to Install Spire.Doc for Python on Windows

Extract Text from Word Comments in Python

You can easily retrieve the author and text of a Word comment using the Comment.Format.Author and Comment.Body.Paragraphs[index].Text properties provided by Spire.Doc for Python. The detailed steps are as follows.

  • Create an object of the Document class.
  • Load a Word document using the Document.LoadFromFile() method.
  • Create a list to store the extracted comment data.
  • Iterate through the comments in the document.
  • For each comment, iterate through the paragraphs of the comment body.
  • For each paragraph, get the text using the Comment.Body.Paragraphs[index].Text property.
  • Get the author of the comment using the Comment.Format.Author property.
  • Add the text and author of the comment to the list.
  • Save the content of the list to a text file.
  • Python
from spire.doc import *
from spire.doc.common import *

# Create an object of the Document class
document = Document()
# Load a Word document containing comments
document.LoadFromFile("Comments.docx")

# Create a list to store the extracted comment data
comments = []

# Iterate through the comments in the document
for i in range(document.Comments.Count):
    comment = document.Comments[i]
    comment_text = ""

    # Iterate through the paragraphs in the comment body
    for j in range(comment.Body.Paragraphs.Count):
        paragraph = comment.Body.Paragraphs[j]
        comment_text += paragraph.Text + "\n"

    # Get the comment author
    comment_author = comment.Format.Author

    # Append the comment data to the list
    comments.append({
        "author": comment_author,
        "text": comment_text
    })

# Write the comment data to a file
with open("comment_data.txt", "w", encoding="utf-8") as file:
    for comment in comments:
        file.write(f"Author: {comment['author']}\nText: {comment['text']}\n\n")

Python: Extract Comments from Word

Extract Images from Word Comments in Python

To extract images from Word comments, you need to iterate through the child objects in the paragraphs of the comments to find the DocPicture objects, then get the image data using DocPicture.ImageBytes property, finally save the image data to image files.

  • Create an object of the Document class.
  • Load a Word document using the Document.LoadFromFile() method.
  • Create a list to store the extracted image data.
  • Iterate through the comments in the document.
  • For each comment, iterate through the paragraphs of the comment body.
  • For each paragraph, iterate through the child objects of the paragraph.
  • Check if the object is a DocPicture object.
  • If the object is a DocPicture, get the image data using the DocPicture.ImageBytes property and add it to the list.
  • Save the image data in the list to individual image files.
  • Python
from spire.doc import *
from spire.doc.common import *
 
# Create an object of the Document class
document = Document()
# Load a Word document containing comments
document.LoadFromFile("Comments.docx")
 
# Create a list to store the extracted image data
images = []
 
# Iterate through the comments in the document
for i in range(document.Comments.Count):
    comment = document.Comments[i]
    # Iterate through the paragraphs in the comment body
    for j in range(comment.Body.Paragraphs.Count):
        paragraph = comment.Body.Paragraphs[j]
        # Iterate through the child objects in the paragraph
        for o in range(paragraph.ChildObjects.Count):
            obj = paragraph.ChildObjects[o]
            # Find the images
            if isinstance(obj, DocPicture):
                picture = obj
                # Get the image data and add it to the list
                data_bytes = picture.ImageBytes
                images.append(data_bytes)
 
# Save the image data to image files
for i, image_data in enumerate(images):
    file_name = f"CommentImage-{i}.png"
    with open(os.path.join("CommentImages/", file_name), 'wb') as image_file:
        image_file.write(image_data)

Python: Extract Comments from Word

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.