-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: NameObject doesn't handle all values correctly #2601
base: main
Are you sure you want to change the base?
Conversation
@Rak424 The 8bits only encoding will be done with Latin1 |
@Rak424 |
Hey @Rak424 and @pubpub-zz, We've been running the benches on this branch, and it seems these changes are significantly improving the To obtain those, we installed CodSpeed on a fork synced with this repo. You can have a look at the full report here. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2601 +/- ##
==========================================
- Coverage 94.92% 94.92% -0.01%
==========================================
Files 50 50
Lines 8316 8309 -7
Branches 1667 1665 -2
==========================================
- Hits 7894 7887 -7
Misses 261 261
Partials 161 161 ☔ View full report in Codecov by Sentry. |
If I read the detailed results correctly, this is a 4% difference on 1s. Can we really consider this a "significant improvement"? |
The title identifies that this PR is dealing with an invalid handling. I'm concerned about getting the good result and preventing regressions. |
Sorry to not explain directly what I've done, but I've seen different problems to solve in that part of code and some cases where clearly badly handled, here is a simple example, but not the only one:
The way it was done was not optimal, so I've decided to try to refactor it and I'm not surprised about perf improvement. A correct implementation should be able to do something like this, since from specs encoding is not defined on Name object itself, but I know that's not easy to make that change:
utf-8 is only for the specific case of creating a name from a text, but doesn't allow to handle all possible values, so I've added a fallback with latin1 who handle all values on a byte. I often need some days/week and multiple tries to reach an optimal solution, tell me if you see some problems or changes to do. |
for me there is no error.
remember thatthe current code provides with NameObject.CHARSETS a way to add / remove charsets from the possible solutions and the default, based on other PRs uses the utf-8, gbk and then latin-1. all these cases come from actual PDFs. have you experienced any actual issues with some PDF? |
|
you have to encode it with utf8, but utf8 doesn't allow to use all chars from 1 to 255and current implementation of name object only allows str and not bytes in constructor. NameObjects are actually string like : the proof is about using the encoding with the '#' character. The byte encoding is just a matter of file encoding. so my workaround is to fallback to an encoding who allows all values in one byte, latin1 seems a good choice for me. I don t know why gbk is present here but it seems useless. This fix was introduced years ago. I' trying some archeology to find it I don't have an example to show but the current implementation doesn't respect specifications, I'm not talking about handling wrong pdf. Don't hesitate to tell me if I've misunderstood something, pdf format is very very very bad. I agree that the current implementation does not match the specification: The historical approach in pypdf is more permissive and as we are rewriting all the names with PdfWriter, there is no issue with readers. @MartinThoma / @MasterOdin / @stefan6419846 |
I have to admit that it is hard to follow this for me. If I understand it correctly, there are some different interpretations of the PDF specs regarding name objects. In my understanding, names are just an arbitrary list of bytes. An application has to ensure a consistent handling of them for matching purposes, but as they are mostly irrelevant to the user and primarily about the internal structure, there are no further rules except for predefined sequences, for example as part of dictionary keys. The newly introduced test sequences from ISO 32000-2:2020, Table 4 pass with the old implementation as well, thus I do not see any real need for changing anything about this unless there is a real example where the current implementation fails and the new one does not. |
to be more clear, here is two examples that can lead to problems hard to detect:
|
@Rak424 can you provide actual PDF and code that shows some issues and not unitary.
|
hi about example 2, from spec:
It's not a huge problem since it's easy to any parser to handle it, but could be a problem with an extremely strict parser. The main problem is on example 1: So, it means that if you append a pdf, it will creates a modification on these kinds of values. On a signed pdf where you add revocation infos, it will create a modification even if you don't modify the document. |
@Rak424 If we do a change to keep the bytes internally, b'DocuSign\xae' will be stored, but without changing the arguments, NameObject('DocuSign®') will get internally b'DocuSign\xc2\xae' and matching will not work anymore. up to now you have not reported high level PDF that are failing. without I would consider your report as a acceptable deviation of pypdf to the PDF specification |
do you understand that b'DocuSign\xc2\xae' is not b'DocuSignx\ae' ? |
I do if you consider a full compliance to the spec. however as said I do not consider that both can be met at the same time. Also we need a way to build a NameObject from an Please provide a PDF file showing some issue |
Name object encodes all characters from 1 to 255, so it's not correct to use utf-8.