Displaying items by tag: pdf Python Attachment

Monday, 19 February 2024 03:31

Python: Extract Attachments from a PDF Document

In addition to text and images, PDF files can also contain various types of attachments, such as documents, images, audio files, or other multimedia elements. Extracting attachments from PDF files allows users to retrieve and save the embedded content, enabling easy access and manipulation outside of the PDF environment. This process proves especially useful when dealing with PDFs that contain important supplementary materials, such as reports, spreadsheets, or legal documents.

In this article, you will learn how to extract attachments from a PDF document in Python using Spire.PDF for Python.

Extract Document-Level Attachments from PDF in Python
Extract Annotation Attachments from PDF in Python

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Prerequisite Knowledge

There are generally two categories of attachments in PDF files: document-level attachments and annotation attachments. Below, you can find a table outlining the disparities between these two types of attachments and how they are represented in Spire.PDF.

Attachment type	Represented by	Definition
Document level attachment	PdfAttachment class	A file attached to a PDF at the document level won't appear on a page, but can be viewed in the "Attachments" panel of a PDF reader.
Annotation attachment	PdfAnnotationAttachment class	A file attached as an annotation can be found on a page or in the "Attachments" panel. An annotation attachment is shown as a paper clip icon on the page; reviewers can double-click the icon to open the file.

Extract Document-Level Attachments from PDF in Python

To retrieve document-level attachments in a PDF document, you can use the PdfDocument.Attachments property. Each attachment has a PdfAttachment.FileName property, which provides the name of the specific attachment, including the file extension. Additionally, the PdfAttachment.Data property allows you to access the attachment's data. To save the attachment to a specific folder, you can utilize the PdfAttachment.Data.Save() method.

The steps to extract document-level attachments from a PDF using Python are as follows.

Create a PdfDocument object.
Load a PDF file using PdfDocument.LoadFromFile() method.
Get a collection of attachments using PdfDocument.Attachments property.
Iterate through the attachments in the collection.
Get a specific attachment from the collection, and get the file name and data of the attachment using PdfAttachment.FileName property and PdfAttachment.Data property.
Save the attachment to a specified folder using PdfAttachment.Data.Save() method.

Python

from spire.pdf import *
from spire.pdf.common import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF file
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Attachments.pdf")

# Get the attachment collection from the document
collection = doc.Attachments

# Loop through the collection
if collection.Count > 0:
    for i in range(collection.Count):

        # Get a specific attachment
        attactment = collection.get_Item(i)

        # Get the file name and data of the attachment
        fileName= attactment.FileName
        data = attactment.Data

        # Save it to a specified folder
        data.Save("Output\\ExtractedFiles\\" + fileName)

doc.Close()

Python: Extract Attachments from a PDF Document

Extract Annotation Attachments from PDF in Python

The Annotations attachment is a page-based element. To retrieve annotations from a specific page, use the PdfPageBase.AnnotationsWidget property. You then need to determine if a particular annotation is an attachment. If it is, save it to the specified folder while retaining its original filename.

The following are the steps to extract annotation attachments from a PDF using Python.

Create a PdfDocument object.
Load a PDF file using PdfDocument.LoadFromFile() method.
Iterate though the pages in the document.
Get the annotations from a particular page using PdfPageBase.AnnotationsWidget property.
Iterate though the annotations, and determine if a specific annotation is an attachment annotation.
If it is, get the file name and data of the annotation using PdfAttachmentAnnotation.FileName property and PdfAttachmentAnnotation.Data property.
Save the annotated attachment to a specified folder.

Python

from spire.pdf import *
from spire.pdf.common import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF file
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\AnnotationAttachment.pdf")

# Iterate through the pages in the document
for i in range(doc.Pages.Count):

    # Get a specific page
    page = doc.Pages.get_Item(i)

    # Get the annotation collection of the page
    annotationCollection = page.AnnotationsWidget

    # If the page has annotations
    if annotationCollection.Count > 0:

        # Iterate through the annotations
        for j in range(annotationCollection.Count):

            # Get a specific annotation
            annotation = annotationCollection.get_Item(j)

            # Determine if the annotation is an attachment annotation
            if isinstance(annotation, PdfAttachmentAnnotationWidget):
              
                # Get the file name and data of the attachment
                fileName = annotation.FileName
                byteData = annotation.Data
                streamMs = Stream(byteData)

                # Save the attachment into a specified folder 
                streamMs.Save("Output\\ExtractedFiles\\" + fileName)

Python: Extract Attachments from a PDF Document

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Attachment

Python: Remove Attachments from a PDF Document

The inclusion of attachments in a PDF can be useful for sharing related files or providing additional context and resources alongside the main document. However, there may be instances when you need to remove attachments from a PDF for reasons like reducing file size, protecting sensitive information, or simply decluttering the document. In this article, you will learn how to remove attachments from a PDF document in Python using Spire.PDF for Python.

Remove Document-Level Attachments from PDF in Python
Remove Annotation Attachments from PDF in Python

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Prerequisite Knowledge

There are typically two types of attachments in PDF, document-level attachments and annotation attachments. The following table lists the differences between them and their representations in Spire.PDF.

Attachment type	Represented by	Definition
Document level attachment	PdfAttachment class	A file attached to a PDF at the document level won't appear on a page, but can be viewed in the "Attachments" panel of a PDF reader.
Annotation attachment	PdfAnnotationAttachment class	A file attached as an annotation can be found on a page or in the "Attachment" panel. An annotation attachment is shown as a paper clip icon on the page; reviewers can double-click the icon to open the file.

Remove Document-Level Attachments from PDF in Python

To obtain all document-level attachments of a PDF document, use the PdfDocument.Attachments property. Then, you can remove all of them using the Clear() method or selectively remove a specific attachment using the RemoveAt() method. The following are the steps to remove document-level attachments from PDF in Python.

Create a PdfDocument object.
Load a PDF document using PdfDocument.LoadFromFile() method.
Get the attachment collection from the document using PdfDocument.Attachments property.
Remove all attachments using PdfAttachmentCollection.Clear() method. To remove a specific attachment, use PdfAttachmentCollection.RemoveAt() method.
Save the changes to a different PDF file using PdfDocument.SaveToFile() method.

Python

from spire.pdf import *
from spire.pdf.common import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF file
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Attachments.pdf")

# Get the attachment collection from the document
attachments = doc.Attachments

# Remove all attachments
attachments.Clear()

# Remove a specific attachment
# attachments.RemoveAt(0)

# Save the changes to file
doc.SaveToFile("output/DeleteAttachments.pdf")

# Close the document
doc.Close()

Remove Annotation Attachments from PDF in Python

Annotations are page-based elements, and to retrieve all annotations from a document, you need to iterate through the pages and obtain the annotations from each page. Next, identify if a particular annotation is an attachment annotation, and finally remove it from the annotation collection using the RemoveAt() method.

The following are the steps to remove annotation attachments from PDF in Python.

Create a PdfDocument object.
Load a PDF document using PdfDocument.LoadFromFile() method.
Iterate through the pages in the document
Get the annotation collection from a specific page through PdfPageBase.AnnotationsWidget property.
Iterate through the annotations in the collection.
Determine if a specific annotation is an instance of PdfAttachmentAnnotationWidget.
Remove the attachment annotation using PdfAnnotationCollection.RemoveAt() method.
Save the changes to a different PDF file using PdfDocument.SaveToFile() method.

Python

from spire.pdf import *
from spire.pdf.common import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF file
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\AnnotationAttachment.pdf")

# Iterate through the pages in the document
for i in range(doc.Pages.Count):

    # Get annotation collection from a certain page
    annotationCollection = doc.Pages.get_Item(i).AnnotationsWidget

    if annotationCollection.Count > 0:

        # Iterate through the annotation in the collection
        for j in range(annotationCollection.Count):

            # Get a specific annotation
            annotation = annotationCollection.get_Item(j)

            # Determine if it is an attachment annotation
            if isinstance(annotation, PdfAttachmentAnnotationWidget):

                # Remove the annotation
                annotationCollection.RemoveAt(j)

# Save the changes to file
doc.SaveToFile("output/DeleteAnnotationAttachment.pdf")

# Close the document
doc.Close()

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Attachment

Python: Attach Files to a PDF Document

By incorporating supplemental resources directly into the PDF, it consolidates all relevant information in a single file, making it easier to organize, share, and archive. This feature enables users to seamlessly share supporting documents, images, or multimedia elements, eliminating the need for separate file transfers or external links. It streamlines communication, improves efficiency, and ensures that recipients have convenient access to all the necessary resources within the PDF itself. In this article, you will learn how to attach files to a PDF document in Python with the help of Spire.PDF for Python.

Attach Files to a PDF Document in Python
Attach Files as Annotations in PDF in Python

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

Package Manager

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Background Knowledge

There are generally two types of attachments in PDF, namely document level attachment and annotation attachment. Both are supported by Spire.PDF for Python. The table below lists the differences between them.

Attachment type	Represented by	Definition
Document level attachment	PdfAttachment class	A file attached to a PDF at the document level won't appear on a page, but can be viewed in the "Attachments" panel of a PDF reader.
Annotation attachment	PdfAnnotationAttachment class	A file attached as an annotation can be found on a page or in the "Attachment" panel. An Annotation attachment is shown as a paper clip icon on the page; reviewers can double-click the icon to open the file.

Attach Files to a PDF Document in Python

To attach files to PDFs at the document level, you first need to create a PdfAttachment object based on an external file, and then you can add it to a PDF document using the PdfDocument.Attachments.Add() method. The following are the detailed steps.

Create a PdfDocument object.
Load a PDF document using PdfDocument.LoadFromFile() method.
Create a PdfAttachment object based on an external file.
Add the attachment to the document using PdfDocument.Attachments.Add() method.
Save the document to a different PDF file using PdfDocument.SaveToFile() method.

Python

from spire.pdf import *
from spire.pdf.common import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a sample PDF file
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf")

# Create PdfAttachment objects based on external files
attachment_one = PdfAttachment("C:\\Users\\Administrator\\Desktop\\Data.xlsx")
attachment_two = PdfAttachment("C:\\Users\\Administrator\\Desktop\\logo.png")

# Add the attachments to PDF
doc.Attachments.Add(attachment_one)
doc.Attachments.Add(attachment_two)

# Save to file
doc.SaveToFile("output/Attachment.pdf")

Python: Attach Files to a PDF Document

Attach Files as Annotations in PDF in Python

An annotation attachment is represented by the PdfAttachmentAnnotation class. Create an instance of this class, specify its attributes such as bounds, flag and text, and then add it to a specified page using the PdfPageBase.AnnotationsWidget.Add() method.

Below are the steps to attach files as annotations in PDF using Spire.PDF for Python.

Create a PdfDocument object.
Load a PDF document using PdfDocument.LoadFromFile() method.
Get a specific page to add annotation through PdfDocument.Pages[] property.
Create a PdfAttachmentAnnotation object based on an external file.
Add the annotation attachment to the page using PdfPageBase.AnnotationsWidget.Add() method.
Save the document to a different PDF file using PdfDocument.SaveToFile() method.

Python

from spire.pdf import *
from spire.pdf.common import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a sample PDF file
doc.LoadFromFile("C:\\Users\\Administrator\\Desktop\\Input.pdf")

# Get a specific page
page = doc.Pages[1]

# Draw a string on PDF
str = "Here is the report:"
font = PdfTrueTypeFont("Times New Roman", 16.0, PdfFontStyle.Bold, True)
x = 50.0
y = doc.Pages[0].ActualSize.Height - 300.0
page.Canvas.DrawString(str, font, PdfBrushes.get_Blue(), x, y)

# Create a PdfAttachmentAnnotation object based on an external file
data = Stream("C:\\Users\\Administrator\\Desktop\\Data.xlsx")
size = font.MeasureString(str);
bounds = RectangleF((x + size.Width + 5.0), y, 10.0, 15.0)
annotation = PdfAttachmentAnnotation(bounds, "Report.docx", data);

# Set color, flag, icon and text of the annotation
annotation.Color = PdfRGBColor(Color.get_Blue())
annotation.Flags = PdfAnnotationFlags.Default
annotation.Icon = PdfAttachmentIcon.Graph
annotation.Text = "Click to open the file"

# Add the attachment annotation to PDF
page.AnnotationsWidget.Add(annotation)

# Save to file
doc.SaveToFile("output/AnnotationAttachment.pdf")

Python: Attach Files to a PDF Document

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

Published in Attachment