Character encoding problems in full-text indexing #1089

ojlyytinen · 2017-05-30T10:07:17Z

I created a simple Word document with this content "Aaa bbb åäö ÅÄÖ – aaa bbb." and tried to upload that to standard Hyrax 1.0.1 installation. I got an error with a lengthy stack trace which I've copied to https://gist.github.com/ojlyytinen/80d678701abbcc15a003a3f3386c7439.

See also the discussion at hydra-tech at https://groups.google.com/forum/#!topic/hydra-tech/G61sJUym8VA.

I did some research into the cause of this and I believe the problem is that file_set.to_solr['all_text_timv'].encoding is ASCII-8BIT. When serialising that to json, the conversion to UTF-8 fails. The contents are actually UTF-8 encoded, it's just that it's got the wrong encoding set. If you do .force_encoding('UTF-8'), you get the correct string back with all the international characters intact. (See output of a Rails console session below)

Looking further into this, the reason the encoding is ASCII-8BIT is that files read from Fedora get that encoding if they contain any characters not in 7-bit ASCII. The full-text file contains the international characters which comes back from Fedora as ASCII-8BIT. That then gets added to the Solr document and converting that to UTF-8 json fails and produces the error.

I think the way to fix this is to set the encoding correctly in ActiveFedora when loading the file from Fedora, based on what charset the mime_type has. Then Hyrax could explicitly set the mime_type of the full-text document to "text/plain; charset=utf-8' and get the contents back in the right encoding.

f=ActiveFedora::File.new
f.content = 'abc'
f.save
f.reload
# 7-bit ASCII contents work fine and come back as UTF-8
f.content # => "abc"
f.content.encoding # => #<Encoding:UTF-8> 
f.content = 'åäö'
f.content.encoding # => #<Encoding:UTF-8>
f.save
f.reload
# With international characters, the contents come back as ASCII-8BIT but the bytes are the UTF-8 encoding of the string
f.content # => "\xC3\xA5\xC3\xA4\xC3\xB6"
f.content.encoding # => #<Encoding:ASCII-8BIT>
f.content.force_encoding('UTF-8') # => "åäö"
f.mime_type # => "text/plain"
# mime_type could be set to contain the charset
f.mime_type = 'text/plain; charset=utf-8'
f.content = 'öäå'
f.save
f.reload
f.content # => "\xC3\xB6\xC3\xA4\xC3\xA5"
f.content.encoding # => #<Encoding:ASCII-8BIT>
f.mime_type # => "text/plain;charset=UTF-8"

The text was updated successfully, but these errors were encountered:

ghost · 2017-05-30T12:55:04Z

Just to add, this is also a problem when editing the metadata of an existing fileset (demonstrated in Hyku). So the file may upload without problem, but subsequent changes to the FileSet are causing something to be retrieved / saved that cause the error. Presumably the extracted text in solr as Olli notes above.

jcoyne · 2017-06-15T16:38:22Z

Related:
samvera/ldp#84
rsolr/rsolr#184
samvera/active_fedora#1258
samvera/hydra-derivatives#166

Specifically for the extracted text. Fixes #1089

mjgiarlo · 2017-06-16T21:59:26Z

@geekscruff @ojlyytinen Can y'all confirm the fix from #1227 works for you?

ghost · 2017-06-19T09:50:14Z

@mjgiarlo I was having problems editing filesets where the attached file was PDF, even though they had uploaded fine. I can confirm that I'm not seeing that with the latest hyrax 👍

ojlyytinen · 2017-06-19T13:53:46Z

Yes, this seems to be working now. .content in ActiveFedora::File has the right encoding if it's set in mime_type and the full_text_timv field gets set correctly in Solr.

mjgiarlo · 2017-06-19T15:24:16Z

@geekscruff @ojlyytinen Thank you both!

Specifically for the extracted text. Fixes samvera#1089

mjgiarlo added bug dependency HyBox ready ready labels May 31, 2017

mjgiarlo added this to the 2.0.0 milestone May 31, 2017

aaron-collier mentioned this issue Jun 12, 2017

Add configuration option to disable full text extract #1205

Merged

jcoyne added in progress HyBox ready and removed HyBox ready in progress labels Jun 13, 2017

jcoyne self-assigned this Jun 15, 2017

jcoyne added in progress and removed HyBox ready labels Jun 15, 2017

jcoyne added a commit that referenced this issue Jun 15, 2017

Update dependencies which saves and retreives charset

031912d

Specifically for the extracted text. Fixes #1089

jcoyne mentioned this issue Jun 15, 2017

Update dependencies which saves and retreives charset #1227

Merged

jcoyne added ready and removed ready labels Jun 15, 2017

jcoyne added a commit that referenced this issue Jun 15, 2017

Update dependencies which saves and retreives charset

cabb295

Specifically for the extracted text. Fixes #1089

mjgiarlo closed this as completed in #1227 Jun 16, 2017

mjgiarlo removed in progress ready labels Jun 16, 2017

dheles pushed a commit to dheles/hyrax that referenced this issue Jul 5, 2017

Update dependencies which saves and retreives charset

a349c86

Specifically for the extracted text. Fixes samvera#1089

moseshll mentioned this issue Mar 6, 2018

Character encoding problem in Aboutware mlibrary/heliotrope#1560

Closed

conorom mentioned this issue Jun 8, 2022

All file original_name values report encoding as ASCII-8BIT #5670

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character encoding problems in full-text indexing #1089

Character encoding problems in full-text indexing #1089

ojlyytinen commented May 30, 2017

ghost commented May 30, 2017

jcoyne commented Jun 15, 2017 •

edited

Loading

mjgiarlo commented Jun 16, 2017

ghost commented Jun 19, 2017

ojlyytinen commented Jun 19, 2017

mjgiarlo commented Jun 19, 2017

Character encoding problems in full-text indexing #1089

Character encoding problems in full-text indexing #1089

Comments

ojlyytinen commented May 30, 2017

ghost commented May 30, 2017

jcoyne commented Jun 15, 2017 • edited Loading

mjgiarlo commented Jun 16, 2017

ghost commented Jun 19, 2017

ojlyytinen commented Jun 19, 2017

mjgiarlo commented Jun 19, 2017

jcoyne commented Jun 15, 2017 •

edited

Loading