Special characters in GEF-file raise UnicodeDecodeError #6

martijnkriebel · 2022-08-05T08:13:52Z

Dutch GEF-files may contain special characters, for example the umlaut in the word "coördinatensysteem". This raises the UnicodeDecodeError below when parsing the file, which traces back to codecs.py. Replacing the "ö" with a regular "o" solves the issue.

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Input In [19], in <cell line: 5>()

      4 cpt_gef = GefCpt()
----> 5 cpt_gef.read(path)
      6 cpt_gef.coordinates

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\site-packages\geolib_plus\cpt_base_model.py:220, in AbstractCPT.read(self, filepath)
    [217](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=216)     raise FileNotFoundError(filepath)
    [219](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=218) cpt_reader = self.get_cpt_reader()
--> [220](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=219) cpt_data = cpt_reader.read_file(filepath)
    [221](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=220) for cpt_key, cpt_value in cpt_data.items():
    [222](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/cpt_base_model.py?line=221)     setattr(self, cpt_key, cpt_value)

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\site-packages\geolib_plus\gef_cpt\gef_file_reader.py:165, in GefFileReader.read_file(self, filepath)
    [164](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=163) def read_file(self, filepath: Path) -> dict:
--> [165](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=164)     return self.read_gef(gef_file=filepath)

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\site-packages\geolib_plus\gef_cpt\gef_file_reader.py:174, in GefFileReader.read_gef(self, gef_file, fct_a)
    [172](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=171) # read gef file
    [173](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=172) with open(gef_file, "r") as f:
--> [174](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=173)     data = f.readlines()
    [176](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=175) # search NAP
    [177](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=176) idx_nap = GefFileReader.get_line_index_from_data_starts_with(
    [178](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=177)     code_string=r"#ZID=", data=data
    [179](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/site-packages/geolib_plus/gef_cpt/gef_file_reader.py?line=178) )

File c:\ProgramData\Anaconda3\envs\geolib_new\lib\codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    [319](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=318) def decode(self, input, final=False):
    [320](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=319)     # decode input (taking the buffer into account)
    [321](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=320)     data = self.buffer + input
--> [322](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=321)     (result, consumed) = self._buffer_decode(data, self.errors, final)
    [323](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=322)     # keep undecoded input until the next call
    [324](file:///c%3A/ProgramData/Anaconda3/envs/geolib_new/lib/codecs.py?line=323)     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 626: invalid start byte

EleniSmyrniou · 2022-08-26T08:27:44Z

I am not sure if gef file with Dutch characters would work. For the gef reading we are using fields that are described in the "Geotechnical exchange format for cpt-data".
GEF-CPT.pdf
If you attach the original gef file then I can take a closer look.

ghost · 2022-12-23T13:24:36Z

Hi Martijn,

The CUR standard clearly states that the GEF file should only consist of characters in the ASCII charachter set (only 128 characters found here).

The GEF file is parsed using utf-8, which is the most used encoding on the web with all possible charachters (in all languages), the original 128 characters from ASCII are mapped to the same bytes in 'utf-8). For obvious compatibility reasons.

Your GEF file is probably encoded in cp1252 (ANSI) encoding. Which is an extension that adds some extra characters to the set which are used in westen european languages. Unfortunally these special characters map to different byte(s) in utf-8 and cp1252. (because cp1252 is a single byte encoding and utf-8 a multiple byte encoding). Actually the byte of ö in 'windows-1252' (0xf6) is not a valid byte used in 'utf-8'. That is what is causing the problem, otherwise you would just get the wrong character out instead of an error.

Easy fix for you is to open de gef in notepad (kladblok) and save the file in 'UTF-8'. The GEF file wil probably parse correct including the ö.

Another fix to try (in pyhton) is to try to decode the file using 'utf-8', id this fails, catch the error en decode the file using cp1252 and then re-encode the file using utf-8.

with open('file.gef', 'rb') as fp:
    try:
        file_as_string = fp.read().decode('utf-8')
        # everything alright send file to GEOLIB+
    except UnicodeDecodeError:
        # File is probably cp1252 with special character, convert to utf-8
        file_as_string = fp.read().decode('cp1252')
        file_as_bytes_utf_8 = file_as_string.encode('utf-8')

martijnkriebel · 2022-12-27T10:14:11Z

Hi Maarten,

Thanks for the detailed explanation! The funny part is that the #DATAFORMAT header of the GEF file says it's ASCII-encoded like specified in the standard, even though it's clearly not 😄

I remember trying to change the file encoding, but failed back then and switched to a different approach for the project that didn't involve this code. Somehow I currently cannot reproduce the error I initially got, even though I'm parsing the same GEF file which is ANSI-encoded and contains the ö-character. If I encounter the same problem another time I'll try your solutions!

MattBrst · 2024-10-01T09:21:44Z

I encounter the same problem with GEF files created with the software of A.P. van den Berg. These may contain the signs for 'degree' Celcius and the character ö in the dutch word coordinate. When I use chardet to determine the encoding, it points to ISO-8859-1.

The code snippet above does not work in my case. I rewrote it to the following to get it working properly:

try:
    with open('file.gef', 'rb') as fp:
        file_as_string = fp.read().decode('utf-8')
        # everything alright send file to GEOLIB+
except UnicodeDecodeError:
    # File is probably cp1252 with special character, convert to utf-8
    with open('file.gef', 'rb') as fp:
        file_as_string = fp.read().decode('cp1252')
        file_as_bytes_utf_8 = file_as_string.encode('utf-8')

If the GEF file cannot be read with the default encoding (UTF8) it will fall back onto the cp1252 encoding. This helps to accept GEF files as commonly produced by dutch suppliers. For more description of the problem, see: Deltares/GEOLib-Plus#6 (comment)

MattBrst mentioned this issue Oct 1, 2024

Add secondary encoding (cp1252) to GEF reader #36

Merged

MattBrst mentioned this issue Oct 1, 2024

Add secondary encoding (cp1252) to GEF reader Deltares/imod-qgis#83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special characters in GEF-file raise UnicodeDecodeError #6

Special characters in GEF-file raise UnicodeDecodeError #6

martijnkriebel commented Aug 5, 2022

EleniSmyrniou commented Aug 26, 2022 •

edited

Loading

ghost commented Dec 23, 2022 •

edited by ghost

Loading

martijnkriebel commented Dec 27, 2022

MattBrst commented Oct 1, 2024

Special characters in GEF-file raise UnicodeDecodeError #6

Special characters in GEF-file raise UnicodeDecodeError #6

Comments

martijnkriebel commented Aug 5, 2022

EleniSmyrniou commented Aug 26, 2022 • edited Loading

ghost commented Dec 23, 2022 • edited by ghost Loading

martijnkriebel commented Dec 27, 2022

MattBrst commented Oct 1, 2024

EleniSmyrniou commented Aug 26, 2022 •

edited

Loading

ghost commented Dec 23, 2022 •

edited by ghost

Loading