Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: Adds further details and examples for redactions. #3259

Merged
merged 2 commits into from
Mar 13, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 50 additions & 41 deletions docs/page.rst
Original file line number Diff line number Diff line change
Expand Up @@ -272,9 +272,14 @@ In a nutshell, this is what you can do with PyMuPDF:
:rtype: :ref:`Annot`
:returns: the created annotation. It is drawn with line (stroke) color red = (1, 0, 0), line width 1, fill color is supported.

---------

Redactions
~~~~~~~~~~~

.. method:: add_redact_annot(quad, text=None, fontname=None, fontsize=11, align=TEXT_ALIGN_LEFT, fill=(1, 1, 1), text_color=(0, 0, 0), cross_out=True)

PDF only: Add a redaction annotation. A redaction annotation identifies content to be removed from the document. Adding such an annotation is the first of two steps. It makes visible what will be removed in the subsequent step, :meth:`Page.apply_redactions`.
**PDF only**: Add a redaction annotation. A redaction annotation identifies content to be removed from the document. Adding such an annotation is the first of two steps. It makes visible what will be removed in the subsequent step, :meth:`Page.apply_redactions`.

:arg quad_like,rect_like quad: specifies the (rectangular) area to be removed which is always equal to the annotation rectangle. This may be a :data:`rect_like` or :data:`quad_like` object. If a quad is specified, then the enveloping rectangle is taken.

Expand Down Expand Up @@ -316,6 +321,50 @@ In a nutshell, this is what you can do with PyMuPDF:

|history_end|


.. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS, graphics=PDF_REDACT_LINE_ART_IF_TOUCHED)

**PDF only**: Remove all **content** contained in any redaction rectangle on the page.

**This method applies and then deletes all redactions from the page.**

:arg int images: How to redact overlapping images. The default (2) blanks out overlapping pixels. `PDF_REDACT_IMAGE_NONE` (0) ignores, and `PDF_REDACT_IMAGE_REMOVE` (1) completely removes images overlapping any redaction annotation. Option `PDF_REDACT_IMAGE_REMOVE_UNLESS_INVISIBLE` (3) only removes images that are actually visible.

:arg int graphics: How to redact overlapping vector graphics (also called "line art" or "drawings"). The default (2) removes any overlapping vector graphics. `PDF_REDACT_LINE_ART_NONE` (0) ignores, and `PDF_REDACT_LINE_ART_IF_COVERED` (1) removes graphics fully contained in a redaction annotation.


:returns: `True` if at least one redaction annotation has been processed, `False` otherwise.

.. note::
* Text contained in a redaction rectangle will be **physically** removed from the page (assuming :meth:`Document.save` with a suitable garbage option) and will no longer appear in e.g. text extractions or anywhere else. All redaction annotations will also be removed. Other annotations are unaffected.

* All overlapping links will be removed. If the rectangle of the link was covering text, then only the overlapping part of the text is being removed. Similar applies to images covered by link rectangles.

* The overlapping parts of **images** will be blanked-out for default option `PDF_REDACT_IMAGE_PIXELS` (changed in v1.18.0). Option 0 does not touch any images and 1 will remove any image with an overlap. Please be aware that there is a bug for option *PDF_REDACT_IMAGE_PIXELS = 2*: transparent images will be incorrectly handled!

* For option `images=PDF_REDACT_IMAGE_REMOVE` only this page's **references to the images** are removed - not necessarily the images themselves. Images are completely removed from the file only, if no longer referenced at all (assuming suitable garbage collection options).

* For option `images=PDF_REDACT_IMAGE_PIXELS` a new image of format PNG is created, which the page will use in place of the original one. The original image is not deleted or replaced as part of this process, so other pages may still show the original. In addition, the new, modified PNG image currently is **stored uncompressed**. Do keep these aspects in mind when choosing the right garbage collection method and compression options during save.

* **Text removal** is done by character: A character is removed if its bbox has a **non-empty overlap** with a redaction rectangle (changed in MuPDF v1.17). Depending on the font properties and / or the chosen line height, deletion may occur for undesired text parts. Using :meth:`Tools.set_small_glyph_heights` with a *True* argument before text search may help to prevent this.

* Redactions are a simple way to replace single words in a PDF, or to just physically remove them. Locate the word "secret" using some text extraction or search method and insert a redaction using "xxxxxx" as replacement text for each occurrence.

- Be wary if the replacement is longer than the original -- this may lead to an awkward appearance, line breaks or no new text at all.

- For a number of reasons, the new text may not exactly be positioned on the same line like the old one -- especially true if the replacement font was not one of CJK or :ref:`Base-14-Fonts`.

|history_begin|

* New in v1.16.11
* Changed in v1.16.12: The previous *mark* parameter is gone. Instead, the respective rectangles are filled with the individual *fill* color of each redaction annotation. If a *text* was given in the annotation, then :meth:`insert_textbox` is invoked to insert it, using parameters provided with the redaction.
* Changed in v1.18.0: added option for handling images that overlap redaction areas.
* Changed in v1.23.27: added option for removing graphics as well.

|history_end|

---------

.. method:: add_polyline_annot(points)

.. method:: add_polygon_annot(points)
Expand Down Expand Up @@ -511,46 +560,6 @@ In a nutshell, this is what you can do with PyMuPDF:

|history_end|

.. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS, graphics=PDF_REDACT_LINE_ART_IF_TOUCHED)

PDF only: Remove all **content** contained in any redaction rectangle on the page.

**This method applies and then deletes all redactions from the page.**

:arg int images: How to redact overlapping images. The default (2) blanks out overlapping pixels. `PDF_REDACT_IMAGE_NONE` (0) ignores, and `PDF_REDACT_IMAGE_REMOVE` (1) completely removes images overlapping any redaction annotation. Option `PDF_REDACT_IMAGE_REMOVE_UNLESS_INVISIBLE` (3) only removes images that are actually visible.

:arg int graphics: How to redact overlapping vector graphics (also called "line art" or "drawings"). The default (2) removes any overlapping vector graphics. `PDF_REDACT_LINE_ART_NONE` (0) ignores, and `PDF_REDACT_LINE_ART_IF_COVERED` (1) removes graphics fully contained in a redaction annotation.


:returns: `True` if at least one redaction annotation has been processed, `False` otherwise.

.. note::
* Text contained in a redaction rectangle will be **physically** removed from the page (assuming :meth:`Document.save` with a suitable garbage option) and will no longer appear in e.g. text extractions or anywhere else. All redaction annotations will also be removed. Other annotations are unaffected.

* All overlapping links will be removed. If the rectangle of the link was covering text, then only the overlapping part of the text is being removed. Similar applies to images covered by link rectangles.

* The overlapping parts of **images** will be blanked-out for default option `PDF_REDACT_IMAGE_PIXELS` (changed in v1.18.0). Option 0 does not touch any images and 1 will remove any image with an overlap. Please be aware that there is a bug for option *PDF_REDACT_IMAGE_PIXELS = 2*: transparent images will be incorrectly handled!

* For option `images=PDF_REDACT_IMAGE_REMOVE` only this page's **references to the images** are removed - not necessarily the images themselves. Images are completely removed from the file only, if no longer referenced at all (assuming suitable garbage collection options).

* For option `images=PDF_REDACT_IMAGE_PIXELS` a new image of format PNG is created, which the page will use in place of the original one. The original image is not deleted or replaced as part of this process, so other pages may still show the original. In addition, the new, modified PNG image currently is **stored uncompressed**. Do keep these aspects in mind when choosing the right garbage collection method and compression options during save.

* **Text removal** is done by character: A character is removed if its bbox has a **non-empty overlap** with a redaction rectangle (changed in MuPDF v1.17). Depending on the font properties and / or the chosen line height, deletion may occur for undesired text parts. Using :meth:`Tools.set_small_glyph_heights` with a *True* argument before text search may help to prevent this.

* Redactions are a simple way to replace single words in a PDF, or to just physically remove them. Locate the word "secret" using some text extraction or search method and insert a redaction using "xxxxxx" as replacement text for each occurrence.

- Be wary if the replacement is longer than the original -- this may lead to an awkward appearance, line breaks or no new text at all.

- For a number of reasons, the new text may not exactly be positioned on the same line like the old one -- especially true if the replacement font was not one of CJK or :ref:`Base-14-Fonts`.

|history_begin|

* New in v1.16.11
* Changed in v1.16.12: The previous *mark* parameter is gone. Instead, the respective rectangles are filled with the individual *fill* color of each redaction annotation. If a *text* was given in the annotation, then :meth:`insert_textbox` is invoked to insert it, using parameters provided with the redaction.
* Changed in v1.18.0: added option for handling images that overlap redaction areas.
* Changed in v1.23.27: added option for removing graphics as well.

|history_end|

.. method:: delete_link(linkdict)

Expand Down
103 changes: 103 additions & 0 deletions docs/the-basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -993,6 +993,9 @@ Tables can be found and extracted from any document :ref:`Page`.
There is also the `pdf2docx extract tables method`_ which is capable of table extraction if you prefer.


--------------------------


.. _The_Basics_Get_Page_Links:

Getting Page Links
Expand Down Expand Up @@ -1024,6 +1027,9 @@ Links can be extracted from a :ref:`Page` to return :ref:`Link` objects.
- :meth:`Page.first_link`


-----------------------------


.. _The_Basics_Get_All_Annotations:

Getting All Annotations from a Document
Expand All @@ -1050,6 +1056,103 @@ Annotations (:ref:`Annot`) on pages can be retrieved with the `page.annots()` me
- :meth:`Page.annots`


--------------------------



.. _The_Basics_Redacting:

Redacting content from a **PDF**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Redactions are special types of annotations which can be marked onto a document page to denote an area on the page which should be securely removed. After marking an area with a rectangle then this area will be marked for *redaction*, once the redaction is *applied* then the content is securly removed.

For example if we wanted to redact all instances of the name "Jane Doe" from a document we could do the following:

.. raw:: html

<pre>
<code class="language-python" data-prismjs-copy="Copy">
import fitz

# Open the PDF document
doc = fitz.open('test.pdf')

# Iterate over each page of the document
for page in doc:
# Find all instances of "Jane Doe" on the current page
instances = page.search_for("Jane Doe")

# Redact each instance of "Jane Doe" on the current page
for inst in instances:
page.add_redact_annot(inst)

# Apply the redactions to the current page
page.apply_redactions()

# Save the modified document
doc.save('redacted_document.pdf')

# Close the document
doc.close()
</code>
</pre>

Another example could be redacting an area of a page, but not to redact any line art (i.e. vector graphics) within the defined area, by setting a parameter flag as follows:


.. raw:: html

<pre>
<code class="language-python" data-prismjs-copy="Copy">
import fitz

# Open the PDF document
doc = fitz.open('test.pdf')

# Get the first page
page = doc[0]

# Add an area to redact
rect = [0,0,200,200]

# Add a redacction annotation which will have a red fill color
page.add_redact_annot(rect, fill=(1,0,0))

# Apply the redactions to the current page, but ignore vector graphics
page.apply_redactions(graphics=0)

# Save the modified document
doc.save('redactied_document.pdf')

# Close the document
doc.close()
</code>
</pre>


.. warning::

Once a redacted version of a document is saved then the redacted content in the **PDF** is *irretrievable*.


.. note::

**Taking it further**

The are a few options for creating and applying redactions to a page, for the full API details to understand the parameters to control these options refer to the API reference.

**API reference**

- :meth:`Page.add_redact_annot`

- :meth:`Page.apply_redactions`


--------------------------



.. _The Basics_Coverting_PDF_Documents:

Converting PDF Documents
Expand Down
Loading