Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support filling text fields & rendering corresponding annots accordingly #35

Open
fredericoschardong opened this issue Sep 15, 2021 · 25 comments
Labels
enhancement New feature or request

Comments

@fredericoschardong
Copy link
Contributor

Hi Matthias,

I would like to kindly ask for your guidance once again. I have a similar objective as reported in #6. Let me explain.

I have a PDF whose entire content is signed with pyHanko. Now, I would like to: (i) add a few strings to the signed pdf by x/y coordinates; and (ii) sign only whatever was added after the first signature.

Regarding (i), after looking at the documentation and source code, there doesn't seem to be a readily available class/method for writing strings to the pdf. If that is indeed the case, what would be the straightforward way of doing so? What comes to mind is to mimic what is done in this method.

Regarding (ii) I am not sure where to start :-)

@fredericoschardong
Copy link
Contributor Author

Let me clarify a little further.

Each signature in a PDF can contain only a single signing certificate, but there can be as many signature dictionaries as one wishes in a PDF, each one with its own associated ByteRange.

from the PADEs specification.

I understand that the usual flow is to sign the ByteRange encompassing the entire document, including any previous signatures. For instance:
fH1ZE

I would like to abuse the specification's lack of detail regarding the ByteRange. As it stands, one can assume that the ByteRange can be anything. In my case, it should encompass only what was added (and perhaps the previous signature). Hopefully I am interpreting the specification correctly.

Thanks!

@MatthiasValvekens
Copy link
Owner

MatthiasValvekens commented Sep 15, 2021

Hi Frederico, thank you for your interest in this project!

Let me first address your question about inserting text. Text processing in PDF is complicated, and there's no simple way to "just add some strings", unfortunately. Adding text to a PDF file involves making many choices, managing (potentially several) font resources, handling glyph positioning etc. PyHanko has the facilities to do (most) of that, but those APIs are pretty low-level and require some knowledge of the PDF spec to use. After all, pyHanko is not a general-purpose PDF manipulation library. But if you tell me what kind of thing you want to typeset, I might be able to help you along.

As for the problem of signing an update only: you're right that ByteRange can theoretically span any range of bytes. But that's arguably a design error in the specification, and all decent validators will reject signatures where the ByteRange doesn't conform to expectations. In fact, in ISO 32000-2, messing with ByteRanges is explicitly banned in PAdES signatures, and discouraged in general.

What's the problem that you're trying to solve here, if I may ask? There might be a more conventional way to accomplish what you want :)

@fredericoschardong
Copy link
Contributor Author

Hi Frederico, thank you for your interest in this project!

Let me first address your question about inserting text. Text processing in PDF is complicated, and there's no simple way to "just add some strings", unfortunately. Adding text to a PDF file involves making many choices, managing (potentially several) font resources, handling glyph positioning etc. PyHanko has the facilities to do (most) of that, but those APIs are pretty low-level and require some knowledge of the PDF spec to use. After all, pyHanko is not a general-purpose PDF manipulation library. But if you tell me what kind of thing you want to typeset, I might be able to help you along.

There are no hard constraints for the typsetting. Currently, I am using reportlab's drawString with a single font for the entire thing, namely setFont("Times-Roman", 12).

As for the problem of signing an update only: you're right that ByteRange can theoretically span any range of bytes. But that's arguably a design error in the specification, and all decent validators will reject signatures where the ByteRange doesn't conform to expectations. In fact, in ISO 32000-2, messing with ByteRanges is explicitly banned in PAdES signatures, and discouraged in general.

Could you please point me to where in the ISO that prohibition is? I only have access to a Portuguese translation of ISO 32000-1 at the moment, but should get my hands on a copy of ISO 32000-2 shortly.

What's the problem that you're trying to solve here, if I may ask? There might be a more conventional way to accomplish what you want :)

We have a large (few megabytes) single-page template PDF file, which is digitally signed with pyHanko. Then, we would like to fill the blanks in this template and digitally sign the added phrases. We hope to store the byte-range with the strings and signature bytes for millions of documents and re-create the final PDF file (template with signature + byte-range with strings and signature bytes regarding the byte-range with strings) on-demand. Our hope with this approach is to store a few bytes or Kbytes (byte-range + signature) instead of a few Mbytes (signed template + signed byte-range) per document.

@MatthiasValvekens
Copy link
Owner

MatthiasValvekens commented Sep 16, 2021 via email

@fredericoschardong
Copy link
Contributor Author

fredericoschardong commented Sep 16, 2021

Anyway, allowing partial ByteRanges has security implications, so most validators don’t allow you to get away with that, regardless of whether the specification would theoretically permit it or not.

I see.

  • Sign the filled template using an incremental update in the usual way, but discard the “common” part when writing to disk. When retrieving the file later, you can simply concatenate the streams.

I haven't thought about this. Would it just be a straightforward binary concatenation?

Nonetheless, if we disregard the partial ByteRange thing, we could sign the entire ByteRange of the signed template with our incremental updates and discard the signed template for storage. Thanks for the idea.

  • Given the above, do you still need the template portion to be signed separately?

My beef with form is the aesthetics of the form fields. Could the looks of form filling possibly be changed so that the filled form fields look like strings on the pdf? Moreover, how does signature over form data works? From your explanation, I understand that only whatever is filled is signed, i.e., not the entire pdf document. If I got it right, then is there a considerable size difference between template filling using direct page content modification vs form filling? Since our goal is to reduce storage, perhaps we can live with ugly form-filled PDFs if they require just a few bytes of storage.

@MatthiasValvekens
Copy link
Owner

I haven't thought about this. Would it just be a straightforward binary concatenation?

Yes. Incremental updates work by simply appending the updated objects + an updated xref table to the end of the base file.

Could the looks of form filling possibly be changed so that the filled form fields look like strings on the pdf?

I'm not 100% sure to what degree this is implementation dependent, but if the last signature disables form field editing, I think it should be fine. Reason being that form field widgets (like pretty much all annotations) have an appearance stream (mandatory in PDF 2.0!) that can contain arbitrary graphics/text, etc. Typically, a viewer would only need to regenerate a form field appearance when editing form fields, so while the viewer might highlight the form field content in some way, you can choose how you want to render the actual field contents.

Moreover, how does signature over form data works? From your explanation, I understand that only whatever is filled is signed, i.e., not the entire pdf document.

No. In a typical situation, the PDF processor that fills in the form updates the form field values using a regular incremental update (or a full save, depends on context) to override the actual form field objects. The digest for the new signature is computed over the entire file. Again, signing only parts of a PDF isn't really a thing, everything is done with incremental updates. Policing what is and isn't allowed in an incremental update to a signed document is the job of the signature validator. In fact, pyHanko implements a version of that too: see here.

If I got it right, then is there a considerable size difference between template filling using direct page content modification vs form filling?

Not really. It's on the order of a few KBs at most, with the form-based ones taking up a tiny bit more space because they need to deal with form field objects as well as the new rendered text.

Anyway, you can use direct text editing to fill out your template in an incremental update if and only if the template itself isn't signed. Direct page content modification (even in an incremental update!) will cause validators (including Acrobat and pyHanko itself) to invalidate all earlier signatures during incremental update validation. Form field updates are typically permitted by default (unless the signature on the base file explicitly disallows them).

Complicating matters further: there's no clear standard (yet) defining what is and is not allowed in a post-signing incremental update, but page content edits are rejected by pretty much all validators that do incremental update diffs.


So, long story short:

  • If you don't require a signature on the template, but only on the final filled version, feel free to use direct page content edits if you prefer that.
  • If you require a signature on the template and on the filled version, you have no choice: using forms is effectively required.

In either case: both methods would use PDF incremental updates, so the trick of not writing the "common" template part to disk when creating the final signature works just fine either way.

@MatthiasValvekens MatthiasValvekens added the question Further information is requested label Sep 17, 2021
@fredericoschardong
Copy link
Contributor Author

Dear Matthias,

Thank you! I will have to use forms then. Since PyHanko does not support form filling, do you have any recommendations for a Python library for that? Actually, anything that runs on Linux and I could call on shell would do the job.

@MatthiasValvekens
Copy link
Owner

That's a good question... I believe there are a couple of Python libs out there capable of doing form filling, but the problem is in finding one that supports writing incremental updates (because otherwise our little scheme would fall apart). Outside the Python space, you certainly have options (iText, PDFBox, ...), but I'm not aware of any "batteries included" solution in Python.

Now, since signature fields are a special kind of form field, pyHanko does handle form fields at some basic level. Odds are that you can recycle some of that to handle text fields as well, but you'd still have to implement quite a few things yourself. Regardless, this is probably a good starting point.

Sorry that I don't have a better answer :(


Anyway, I'll add text field filling to my backlog. It's arguably a little out of scope, but it's probably not super hard to implement given what's there already (at least, I think I know what to do). I can't promise when I'll get to it, though :)

@fredericoschardong
Copy link
Contributor Author

Sorry for the late reply.

Anyway, I'll add text field filling to my backlog. It's arguably a little out of scope, but it's probably not super hard to implement given what's there already (at least, I think I know what to do). I can't promise when I'll get to it, though :)

That would be amazing! I will be the first one to test it :-)

@MatthiasValvekens MatthiasValvekens added enhancement New feature or request and removed question Further information is requested labels Oct 1, 2021
@MatthiasValvekens MatthiasValvekens changed the title Adding strings to pdf and signing them Support filling text fields & rendering corresponding annots accordingly Oct 1, 2021
@MatthiasValvekens
Copy link
Owner

Alright. It's on my TODO list, and I've reworded/reclassified this issue accordingly. Thanks!

@stale
Copy link

stale bot commented Nov 30, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label Nov 30, 2021
@MatthiasValvekens
Copy link
Owner

Nope, not stale, I still plan to work on this (actually we probably don't need the stale bot anymore now that we have a discussion forum... I should something about that at some point).

@stale
Copy link

stale bot commented Jan 29, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label Jan 29, 2022
@stale
Copy link

stale bot commented Mar 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label Mar 31, 2022
@stale
Copy link

stale bot commented May 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label May 31, 2022
@stale
Copy link

stale bot commented Jul 30, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label Jul 30, 2022
@stale stale bot closed this as completed Aug 13, 2022
@fredericoschardong
Copy link
Contributor Author

fredericoschardong commented Aug 14, 2022 via email

@MatthiasValvekens
Copy link
Owner

It is, it's just that there's always something more urgent to work on ;).

(Next time the stale bot complains, feel free to leave a comment to show you're still interested. That's basically the only reason why I have it active on enhancement requests. I should probably increase the timeouts, though...)

@stale stale bot removed the stale label Aug 14, 2022
@stale
Copy link

stale bot commented Oct 14, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label Oct 14, 2022
@stale
Copy link

stale bot commented Dec 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label Dec 13, 2022
@alex-eri
Copy link

I have a doc with 2 parts on one page. First part signed then car go out of park: template + fuel level+ odometer value. Second part signed then car arrives back: first part + new values.

I can do versioning like #35 (comment)

If I cant - I will put "new values" inside signature widget as hack))

@stale
Copy link

stale bot commented Mar 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label Mar 22, 2023
@unloder
Copy link

unloder commented May 10, 2024

Hi @MatthiasValvekens, first of all, marvelous work on the pyHanko module! It has been a godsend for our project that greatly relies on digital signatures.

I am also interested if anything like a system described in the comment #35
#35 (comment)

We are trying to create a system where multiple signers have the ability to do some incremental changes to a pdf, like adding text or putting check-marks, on a pdf and signing the file with their digital signature afterwards, via incremental updates. Overall similar to what what DocuSign does, but as a part of our project and integrated to other functionality. We do have an option to just make all of the changes and put our own digital signature at the end, but we are still experimenting and exploring the possibilities of giving each signer his own digital signature.

As I understand this can be done with pdf forms, but we are not using built in pdf forms in favor of our own forms module that we use in the other pars of the project, so first we are trying to do this by updating the pages of the pdf directly, by using pyHanko and PyPDF2.

  1. If we try stacking digital signatures on top of each other, without adding any other changes to the pdf, it works as expected, and all signatures show up as valid in Adobe.

  2. If we add anything via pdf_content.add_to_page and IncrementalPdfFileWriter.write_in_place, this does not pass the adobe's validation for the signatures except the last one, but adobe does show the versions and changes done to the file correctly for each preceding signature.

  3. If we use the "hack" @alex-eri mentioned and add the same changes as a signature stamp for each digital signature, making the stamp cover the whole page, adobe does read all signatures as valid and versions correct. But this looks to be a very sketchy approach and it also ads a border and overlay which does not look as intended. We want to achieve something similar to this but writing the changes separately and rendering a stamp visually separately on the page.

I understand that this is not the intended use-case for this module, but judging how well Adobe handles the cases 3 and to some extent 2 it "feels" like there can be a way to correctly achieve this.

May be there now is a more strait forward way to achieve something similar to this?

Or may be @alex-eri has some success in achieving the desired outcome?

Edit: I understand how this works a bit better now after some experimentation and reading up, stamps are added as Form XObjects, and the border around the stamp can be removed. The only thing that remains is that the xobject overlays the whole page, changes the cursor, and prevents the text behind it from being selected. Is this the inherent behavior of Form XObjects or can this be changed somehow?

@MatthiasValvekens
Copy link
Owner

Hi @unloder,

Good points, let me clarify some things.

f we add anything via pdf_content.add_to_page and IncrementalPdfFileWriter.write_in_place, this does not pass the adobe's validation for the signatures except the last one, but adobe does show the versions and changes done to the file correctly for each preceding signature.

This is expected, and will trip any validator that checks incremental changes. The reason why is boring, but instructive: PDF's graphics model is a page description language at heart, and the link between PDF graphics operators and what a human sees is not always 100% clear. In the general case, it is practically impossible to distinguish between "additions" and any other type of change to direct page content. Clearly, nobody wants to accept arbitrary changes to page content, so because of the nature of the PDF graphics model, they end up having to reject all page content changes.

As you note, using annotations (in particular, form fields) partially circumvents this. That's because annotations live "outside" the page content, so it's easier to treat them separately (this is not without its own set of risks, though: see https://pdf-insecurity.org/#attacks-on-pdf-certification-may-2021; and also https://itextpdf.com/blog/itext-news-technical-notes/attacks-pdf-certification-and-what-you-can-do-about-them). Signers can (to a degree) influence this behaviour by setting a DocMDP level.

While we can discuss the merits of allowing/disallowing this kind of thing in a validator until the cows come home, it's a fact of life that this part of the PDF spec is extremely vague, and there's no reason to believe that that will change in the near future. Basically everyone in the industry agrees that the current situation is crappy, but there's no real consensus on how to fix things.... :)

On the generating side of things: pyHanko currently only supports stamping directly on the page (which breaks prior signatures) or generating signature appearances. It's in principle straightforward to allow it to fill in text fields as well, but there are some sharp edges there (relating to font handling and the way PDF deals with "variable text"). This makes it a tough feature to land, since I really don't have the bandwidth anymore to do systematic tests with different viewers etc. to ensure that they all consume pyHanko's files as intended. Nonetheless, I put some experimental code on this branch: https://github.com/MatthiasValvekens/pyHanko/tree/feature/basic-form-filling. Feel free to give that fill_text_field implementation a spin. Fair warning: the code breaks the variable text support on purpose, so editing the output could produce unexpected results.

Extending this to support things like checkboxes is possible but will involve more work (because checkboxes are a bit special in PDF forms). In principle, supporting annotations is also possible, but that will take significantly more time to land due to the sheer number of possible combinations.

To this point:

Edit: I understand how this works a bit better now after some experimentation and reading up, stamps are added as Form XObjects, and the border around the stamp can be removed. The only thing that remains is that the xobject overlays the whole page, changes the cursor, and prevents the text behind it from being selected. Is this the inherent behavior of Form XObjects or can this be changed somehow?

This is not a property of form XObjects, no. Most of the observations you make are actually more about annotations vs. page content. If you include a form XObject directly on the page, it will behave (modulo transformations & resource handling) as if you'd directly injected the graphics stream into the page.

Hope that helps.

@unloder
Copy link

unloder commented May 13, 2024

Hi @MatthiasValvekens,
Thank you very much for such a quick and detailed response. I will need to do more research about the underlying structure of pdfs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants