UTF-8 Encoding #133

nim-odoo · 2017-11-28T15:05:52Z

Hello,

I am a bit confused by these lines: https://github.com/jseutter/ofxparse/blob/master/ofxparse/ofxparse.py#L85-L95
These are used to determine the headers of the file, and more particularly the encoding.

However, a typical header is:

<?xml version="1.0" encoding="UTF-8"?>
<?OFX OFXHEADER="200" VERSION="202" SECURITY="NONE" OLDFILEUID="NONE" NEWFILEUID="NONE"?>

This method won't parse the encoding="UTF-8" header. Therefore, the encoding will (always?) be considered as ASCII in https://github.com/jseutter/ofxparse/blob/master/ofxparse/ofxparse.py#L111-L117

You can imagine that if the OFX file contains non-ASCII characters, the reading will fail. As a workaround, I added encoding:UTF-8 on top of the file. But that seems odd...

Thanks
cc @qdp-odoo

The text was updated successfully, but these errors were encountered:

jseutter · 2017-11-29T02:55:22Z

I think you are correct. I don’t see any test cases that test this behavior.

Back when I started this library pretty much no producers were generating Ofx 2.2 and the code shows this. I wonder if nowadays it would be better to try parsing as xml first and if that errors out, switch to the sgml parsing. This would give us proper encoding detection for free for 2.2 and newer files.

At any rate, thanks for bringing this up. :)

-Jerry

alexis-via · 2021-05-17T21:52:37Z

I confirm that the bug is still present today in ofxparse 0.20 and in master.

My OFX files starts with

<?xml version="1.0" encoding="utf-8"?>
<?OFX OFXHEADER="200" VERSION="220" SECURITY="NONE" OLDFILEUID="NONE" NEWFILEUID="NONE"?>
<OFX>
  <SIGNONMSGSRSV1>

When I read the OFX 2.2.0 spec here https://www.ofx.net/downloads/OFX%202.2.pdf
it says in section 2.2 page 34-35:

"The standard XML declaration must come first. This Processing Instruction (PI) includes an option to specify the version of XML being used, and may include options to show such things as the encoding declaration, and the standalone status of the document."

"The OFX declaration must come next in the file. This PI identifies the contents as an Open Financial Exchange file and provides the version number of the Open Financial Exchange declaration itself (not the version number of the contents). The Open Financial Exchange PI contains the following attributes:
OFXHEADER
VERSION
SECURITY
OLDFILEUID
NEWFILEUID"

So it's clear that, with OFX 2.2, the encoding of the file must be read from the first line that should look like

<?xml version="1.0" encoding="UTF-8"?>

and not from the second line "<?OFX OFXHEADER=" that doesn't have an ENCODING attribute any more.

With the current version of ofxparse, my OFX file fails because it contains

<NAME>Direction Générale des Finances </NAME>

but, as the file is encoded in utf-8 but read as ascii by ofxparse, I get:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3636: ordinal not in range(128)

I'm ready to contribute to fix this bug. Any advice for the best way to fix it would be welcomed.

alexis-via · 2021-05-17T22:12:35Z

When looking at the test suite under the directory "tests/fixtures", I can find 4 files in OFX 2.x that start with

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

(anzcc.ofx, error_message.ofx, multiple_accounts2.ofx and multiple_accounts.ofx)
but they only contain ASCII caracters, there aren't any accentuated character in them. That's why the tests work.

The first problem is to fix the method read_headers() https://github.com/jseutter/ofxparse/blob/master/ofxparse/ofxparse.py#L86
On an OFX 2.x file, this second line of code of this method will set head_data as an empty byte ; so, at the end of this method, self.headers is an empty OrderedDict. Then, the method handle_encoding() uses self.headers to read the encoding and fallsback on ascii.

My ideas: the current methods read_headers() and handle_encoding() work well for OFX 1.x but not for OFX 2.x. Should we try to patch them to work on both, or should we have separate methods for OFX 2.x and use some code to detect if the file is OFX 1.x or 2.x ?

alexis-via · 2021-05-17T23:22:39Z

I made a first prototype here: https://people.via.ecp.fr/~alexis/ofxparse-proto-bug-133.diff

This patch only changes the method read_headers(): if the file starts with '<?xml ', I use lxml to parse the ofx file and get the encoding. @jseutter What do you think about this approach ?

jseutter · 2021-05-31T02:53:08Z

I'm okay with this approach. Although I try to avoid adding dependencies, I think it is warranted in this case. I haven't run your code but the way I read it, it should still work for 1.x files. Is this the case?

If you have time to add a test case that highlights the broken (and now fixed) behavior I would be grateful. If not, I'll try to add one myself.

alexis-via · 2021-05-31T16:39:01Z

@jseutter Thanks for your feedback. It should work with ofx v1 and v2. I'll prepare a PR with a test with a v2 file.

dev590t · 2023-07-06T21:47:34Z

the PR is merged to master? I have the same problem

alexis-via · 2023-07-06T22:23:56Z

You can try my patch: https://people.via.ecp.fr/~alexis/ofxparse-proto-bug-133.diff

dev590t · 2023-07-08T16:16:52Z

The header of my ofx file is:

OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:UTF-8
CHARSET:NONE
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>...

and ofxparser fails on é char.

Thank you @alexis-via, After patching your patch, that give me other errors. I'm not sure if it is from my ofx file or no.

After modify the header to:

OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>
...

The ofxparser runs without the patch.

Why don't submit your PR if the patch solve the issue for some case? The project seems unmaintained since 2021.

dev590t mentioned this issue Jul 6, 2023

Cannot process UTF-8 files with characters outside the 256 range #169

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 Encoding #133

UTF-8 Encoding #133

nim-odoo commented Nov 28, 2017

jseutter commented Nov 29, 2017

alexis-via commented May 17, 2021

alexis-via commented May 17, 2021

alexis-via commented May 17, 2021

jseutter commented May 31, 2021

alexis-via commented May 31, 2021

dev590t commented Jul 6, 2023

alexis-via commented Jul 6, 2023

dev590t commented Jul 8, 2023 •

edited

Loading

UTF-8 Encoding #133

UTF-8 Encoding #133

Comments

nim-odoo commented Nov 28, 2017

jseutter commented Nov 29, 2017

alexis-via commented May 17, 2021

alexis-via commented May 17, 2021

alexis-via commented May 17, 2021

jseutter commented May 31, 2021

alexis-via commented May 31, 2021

dev590t commented Jul 6, 2023

alexis-via commented Jul 6, 2023

dev590t commented Jul 8, 2023 • edited Loading

dev590t commented Jul 8, 2023 •

edited

Loading