Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encoding problems in full-text indexing #1089

Closed
ojlyytinen opened this issue May 30, 2017 · 6 comments
Closed

Character encoding problems in full-text indexing #1089

ojlyytinen opened this issue May 30, 2017 · 6 comments
Assignees
Milestone

Comments

@ojlyytinen
Copy link
Contributor

I created a simple Word document with this content "Aaa bbb åäö ÅÄÖ – aaa bbb." and tried to upload that to standard Hyrax 1.0.1 installation. I got an error with a lengthy stack trace which I've copied to https://gist.github.com/ojlyytinen/80d678701abbcc15a003a3f3386c7439.

See also the discussion at hydra-tech at https://groups.google.com/forum/#!topic/hydra-tech/G61sJUym8VA.

I did some research into the cause of this and I believe the problem is that file_set.to_solr['all_text_timv'].encoding is ASCII-8BIT. When serialising that to json, the conversion to UTF-8 fails. The contents are actually UTF-8 encoded, it's just that it's got the wrong encoding set. If you do .force_encoding('UTF-8'), you get the correct string back with all the international characters intact. (See output of a Rails console session below)

Looking further into this, the reason the encoding is ASCII-8BIT is that files read from Fedora get that encoding if they contain any characters not in 7-bit ASCII. The full-text file contains the international characters which comes back from Fedora as ASCII-8BIT. That then gets added to the Solr document and converting that to UTF-8 json fails and produces the error.

I think the way to fix this is to set the encoding correctly in ActiveFedora when loading the file from Fedora, based on what charset the mime_type has. Then Hyrax could explicitly set the mime_type of the full-text document to "text/plain; charset=utf-8' and get the contents back in the right encoding.

f=ActiveFedora::File.new
f.content = 'abc'
f.save
f.reload
# 7-bit ASCII contents work fine and come back as UTF-8
f.content # => "abc"
f.content.encoding # => #<Encoding:UTF-8> 
f.content = 'åäö'
f.content.encoding # => #<Encoding:UTF-8>
f.save
f.reload
# With international characters, the contents come back as ASCII-8BIT but the bytes are the UTF-8 encoding of the string
f.content # => "\xC3\xA5\xC3\xA4\xC3\xB6"
f.content.encoding # => #<Encoding:ASCII-8BIT>
f.content.force_encoding('UTF-8') # => "åäö"
f.mime_type # => "text/plain"
# mime_type could be set to contain the charset
f.mime_type = 'text/plain; charset=utf-8'
f.content = 'öäå'
f.save
f.reload
f.content # => "\xC3\xB6\xC3\xA4\xC3\xA5"
f.content.encoding # => #<Encoding:ASCII-8BIT>
f.mime_type # => "text/plain;charset=UTF-8"
@ghost
Copy link

ghost commented May 30, 2017

Just to add, this is also a problem when editing the metadata of an existing fileset (demonstrated in Hyku). So the file may upload without problem, but subsequent changes to the FileSet are causing something to be retrieved / saved that cause the error. Presumably the extracted text in solr as Olli notes above.

@jcoyne
Copy link
Member

jcoyne commented Jun 15, 2017

jcoyne added a commit that referenced this issue Jun 15, 2017
Specifically for the extracted text.

Fixes #1089
@jcoyne jcoyne added ready and removed ready labels Jun 15, 2017
jcoyne added a commit that referenced this issue Jun 15, 2017
Specifically for the extracted text.

Fixes #1089
@mjgiarlo
Copy link
Member

@geekscruff @ojlyytinen Can y'all confirm the fix from #1227 works for you?

@ghost
Copy link

ghost commented Jun 19, 2017

@mjgiarlo I was having problems editing filesets where the attached file was PDF, even though they had uploaded fine. I can confirm that I'm not seeing that with the latest hyrax 👍

@ojlyytinen
Copy link
Contributor Author

Yes, this seems to be working now. .content in ActiveFedora::File has the right encoding if it's set in mime_type and the full_text_timv field gets set correctly in Solr.

@mjgiarlo
Copy link
Member

@geekscruff @ojlyytinen Thank you both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants