Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special characters in GEF-file raise UnicodeDecodeError #6

Open
martijnkriebel opened this issue Aug 5, 2022 · 4 comments
Open

Special characters in GEF-file raise UnicodeDecodeError #6

martijnkriebel opened this issue Aug 5, 2022 · 4 comments

Comments

@martijnkriebel
Copy link

Dutch GEF-files may contain special characters, for example the umlaut in the word "coördinatensysteem". This raises the UnicodeDecodeError below when parsing the file, which traces back to codecs.py. Replacing the "ö" with a regular "o" solves the issue.

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Input In [19], in <cell line: 5>()

      4 cpt_gef = GefCpt()
----> 5 cpt_gef.read(path)
      6 cpt_gef.coordinates

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\site-packages\geolib_plus\cpt_base_model.py:220, in AbstractCPT.read(self, filepath)
    [217](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=216)     raise FileNotFoundError(filepath)
    [219](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=218) cpt_reader = self.get_cpt_reader()
--> [220](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=219) cpt_data = cpt_reader.read_file(filepath)
    [221](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=220) for cpt_key, cpt_value in cpt_data.items():
    [222](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=221)     setattr(self, cpt_key, cpt_value)

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\site-packages\geolib_plus\gef_cpt\gef_file_reader.py:165, in GefFileReader.read_file(self, filepath)
    [164](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=163) def read_file(self, filepath: Path) -> dict:
--> [165](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=164)     return self.read_gef(gef_file=filepath)

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\site-packages\geolib_plus\gef_cpt\gef_file_reader.py:174, in GefFileReader.read_gef(self, gef_file, fct_a)
    [172](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=171) # read gef file
    [173](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=172) with open(gef_file, "r") as f:
--> [174](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=173)     data = f.readlines()
    [176](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=175) # search NAP
    [177](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=176) idx_nap = GefFileReader.get_line_index_from_data_starts_with(
    [178](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=177)     code_string=r"#ZID=", data=data
    [179](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=178) )

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    [319](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=318) def decode(self, input, final=False):
    [320](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=319)     # decode input (taking the buffer into account)
    [321](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=320)     data = self.buffer + input
--> [322](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=321)     (result, consumed) = self._buffer_decode(data, self.errors, final)
    [323](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=322)     # keep undecoded input until the next call
    [324](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=323)     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 626: invalid start byte
@EleniSmyrniou
Copy link
Collaborator

EleniSmyrniou commented Aug 26, 2022

I am not sure if gef file with Dutch characters would work. For the gef reading we are using fields that are described in the "Geotechnical exchange format for cpt-data".
GEF-CPT.pdf
If you attach the original gef file then I can take a closer look.

@ghost
Copy link

ghost commented Dec 23, 2022

Hi Martijn,

The CUR standard clearly states that the GEF file should only consist of characters in the ASCII charachter set (only 128 characters found here).

The GEF file is parsed using utf-8, which is the most used encoding on the web with all possible charachters (in all languages), the original 128 characters from ASCII are mapped to the same bytes in 'utf-8). For obvious compatibility reasons.

Your GEF file is probably encoded in cp1252 (ANSI) encoding. Which is an extension that adds some extra characters to the set which are used in westen european languages. Unfortunally these special characters map to different byte(s) in utf-8 and cp1252. (because cp1252 is a single byte encoding and utf-8 a multiple byte encoding). Actually the byte of ö in 'windows-1252' (0xf6) is not a valid byte used in 'utf-8'. That is what is causing the problem, otherwise you would just get the wrong character out instead of an error.

Easy fix for you is to open de gef in notepad (kladblok) and save the file in 'UTF-8'. The GEF file wil probably parse correct including the ö.
afbeelding

Another fix to try (in pyhton) is to try to decode the file using 'utf-8', id this fails, catch the error en decode the file using cp1252 and then re-encode the file using utf-8.

with open('file.gef', 'rb') as fp:
    try:
        file_as_string = fp.read().decode('utf-8')
        # everything alright send file to GEOLIB+
    except UnicodeDecodeError:
        # File is probably cp1252 with special character, convert to utf-8
        file_as_string = fp.read().decode('cp1252')
        file_as_bytes_utf_8 = file_as_string.encode('utf-8')

@martijnkriebel
Copy link
Author

Hi Maarten,

Thanks for the detailed explanation! The funny part is that the #DATAFORMAT header of the GEF file says it's ASCII-encoded like specified in the standard, even though it's clearly not 😄

I remember trying to change the file encoding, but failed back then and switched to a different approach for the project that didn't involve this code. Somehow I currently cannot reproduce the error I initially got, even though I'm parsing the same GEF file which is ANSI-encoded and contains the ö-character. If I encounter the same problem another time I'll try your solutions!

@MattBrst
Copy link
Contributor

MattBrst commented Oct 1, 2024

I encounter the same problem with GEF files created with the software of A.P. van den Berg. These may contain the signs for 'degree' Celcius and the character ö in the dutch word coordinate. When I use chardet to determine the encoding, it points to ISO-8859-1.

The code snippet above does not work in my case. I rewrote it to the following to get it working properly:

try:
    with open('file.gef', 'rb') as fp:
        file_as_string = fp.read().decode('utf-8')
        # everything alright send file to GEOLIB+
except UnicodeDecodeError:
    # File is probably cp1252 with special character, convert to utf-8
    with open('file.gef', 'rb') as fp:
        file_as_string = fp.read().decode('cp1252')
        file_as_bytes_utf_8 = file_as_string.encode('utf-8')

MattBrst added a commit to MattBrst/imod-qgis that referenced this issue Oct 1, 2024
If the GEF file cannot be read with the default encoding (UTF8) it will fall back onto the cp1252 encoding. This helps to accept GEF files as commonly produced by dutch suppliers.

For more description of the problem, see: Deltares/GEOLib-Plus#6 (comment)
JoerivanEngelen pushed a commit to Deltares/imod-qgis that referenced this issue Oct 28, 2024
If the GEF file cannot be read with the default encoding (UTF8) it will fall back onto the cp1252 encoding. This helps to accept GEF files as commonly produced by dutch suppliers.

For more description of the problem, see: Deltares/GEOLib-Plus#6 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants