This repository shows the result of my recent attempt to remove an annoying watermark that the professor placed over each page of the slides. I didn't really want to print them along with the watermark.
Watermarks in PDFs can be applied mainly in two ways:
- As a separate overlay or image: The watermark is added as an overlay over the PDF content.
- Embedded within the content: The watermark is embedded directly into the text or image layers, making them harder to remove.
If the watermark is an overlay, it can potentially be removed using PDF editing tools like pdftk
, qpdf
, pdfcpu
, or graphical tools like LibreOffice Draw
.
Let's review some methods to remove the watermark.
You should check if the following does the job:
$ pdfcpu watermark remove [-p(ages) selectedPages] inFile [outFile]
One easy-to-go strategy is:
- Rasterize the PDF into JPG or PNG images, using one of the many online/offline tools,
- Sample the watermark colour with
Paint
,GIMP
or similar tools. - Use a tool like
XnView
to bulk replace the watermark colour with the background colour, with a small tolerance, for all the page images, - Merge the images back into a PDF using one of the many online/offline tools.
Although this is a very simple technique, it only works if the watermark is placed behind the content, not as a semi-transparent stamp on top of the content. In the latter case, there wouldn't be a consistent colour for the watermark, and even increasing the tolerance would affect the PDF content.
Another drawback of this method is that we lose information when rasterising the PDF in point 1. This means, for example, that the text in the resulting PDF can no longer be selected or copied.
If automated tools don't remove the watermark, you have to manually analyse the PDF content and identify the object responsible for the watermark.
First and foremost, you could uncompress the PDF to improve the readability of the text, undoing the flate, ZIP or other PDF compressions. This can be achieved with one of the following commands:
$ qpdf --qdf --object-streams=disable inFile [outFile]
$ pdftk inFile output outFile uncompress
A PDF object is structured as follows:
<object_id> 0 obj
<<
/ColorSpace 9 0 R
/ShadingType 2
...
>>
endobj
We need to identify the watermark obj
and replace its content with the empty dictionary << >>
.
This part is trial and error, but we are not going to go blind as there are tricks that can be used:
- Work on a 1-page sample, possibly a page without images or other large binary objects, to get a clearer view of the PDF structure and identify the watermark more quickly,
- The string "watermark" or the actual watermark content may appear in the uncompressed version of the PDF,
- A watermark usually has a wide width and a high height. In my case, the entries
/Width 720
and/Height 540
identified the watermark.
In my case, the culprit was the following:
In my experience, if you try to edit the PDF content with text editors like VSC, nano, Mousepad etc..., the file gets corrupted, probably due to the wrong text encoding the editor inferred on a binary file like the PDF. The only text editor I found that doesn't change the file encoding was Vim.
You could either use Vim
, sed
or related tools to manually patch the PDFs, or use a Python script to automate the job.
I have attached two scripts to this repository:
- cleaner.py - Patch all the PDF files in the CWD. Before launching the script be sure to have placed the entries for your watermark found in the previous step inside the
watermark_features
list. This script uses the regular expression<<[^>]+>>
to identify all the PDF dictionaries, i.e. object arguments, and filter them by keeping only those having all the watermark features. - unlocker.py - Unlock all the permission for all the PDF files in the CWD. This requires
pdftk
to work.
If you've got this far, you should see some results. However, as in my case, different PDF files had slightly different watermark objects, so I reran the script several times with different watermark_features
values for those files.
I also happened to find a file where the watermark object couldn't be removed without corrupting the whole file. What I did was to set the /Width
property to 0
instead of removing the object.