Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start replacing our custom stringex with our own class #4613

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions app/models/orangelight/sort_normalize.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# frozen_string_literal: true

# Stringex (upstream):
# It normalizes and romanizers everything

# Stringex/forked:
# Alphabetic Presentation Forms (Latin ligatures) FB00–FB06
# Halfwidth and Fullwidth Forms (fullwidth Latin letters) FF00–FF5E
# ##### OTHER SCRIPTS #####
# Combining Diacritical Marks, 0300-036F
# Greek, 0384-03CE
# Cyrillic, 0400-045F
# Armenian, 0531-0587

# It normalizes the latin characters and Greek Cyrillic Armenian.
# It does not romanize Greek, Cyrillic, Armenian.

# For all other languages it does not normalize (for example chinese)

# ============================

# Unidecode:
# It normalizes everything and romanizes everything.

# ============================

# Orangelight::SortNormalize:
# For lating characters it normalizes some of them. We haven't covered all of them.
# - Maybe we will use Unidecode for latin if needed
# It normalizes Greek and not romanizing.
# It normalizes Cyrillic but has a bug.
# It normalizes Armenian but we're not very confident.

# For all other languages we keep them as they are.

class Orangelight::SortNormalize
def normalize(string)
normalize_greek_characters remove_diacritics(string)
.gsub(/—/, ' ')
.gsub(/[\p{P}\p{S}]/, '')
.downcase(:fold)
end

private

def remove_diacritics(string)
diacritic_combining_characters = [*0x1DC0..0x1DFF, *0x0300..0x036F, *0xFE20..0xFE2F].pack('U*')
decomposed_version = string.unicode_normalize(:nfd)
decomposed_version.tr(diacritic_combining_characters, '')
end

def normalize_greek_characters(string)
string.tr('ς', 'σ')
end
end
30 changes: 30 additions & 0 deletions spec/models/orangelight/sort_normalize_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# frozen_string_literal: true

require 'rails_helper'

RSpec.describe Orangelight::SortNormalize do
it 'removes punctuation and spaces' do
normalizer = described_class.new
expect(normalizer.normalize('World War, 1939-1945—Occupied territories—Pictorial works')).to eq 'world war 19391945 occupied territories pictorial works'
expect(normalizer.normalize('Ζουργός, Ισίδωρος, 1964-')).to eq 'ζουργοσ ισιδωροσ 1964' # Note that this uses the incorrect sigma (if we were displaying, we would use Iσίδωρος)
expect(normalizer.normalize('دراسات. علوم الادارية.')).to eq 'دراسات علوم الادارية'
end
it 'folds the German double s into two lower case s characters' do
normalizer = described_class.new
expect(normalizer.normalize('程士廉. 帝妃春ßK')).to eq '程士廉 帝妃春ssk'
end
it 'removes latin diacritics' do
normalizer = described_class.new
expect(normalizer.normalize('Şengönül, Fatma Betül. Kent diplomasisi')).to eq 'sengonul fatma betul kent diplomasisi'
expect(normalizer.normalize('Vilaça, Aparecida, 1958-. Ficções amazônicas')).to eq 'vilaca aparecida 1958 ficcoes amazonicas'
expect(normalizer.normalize('Ødegård, Guro. Ungdommen')).to eq 'odegard guro ungdommen'
end
it "normalizes Cyrillic characters" do
normalizer = described_class.new
expect(normalizer.normalize('Қайранбай, Жұмабай Қожақынұлы. Жұлдызжирен')).to eq 'қайранбай жұмабай қожақынұлы жұлдызжирен'
end
it "normalizes Armenian characters" do
normalizer = described_class.new
expect(normalizer.normalize('Քոչար՝ Երվանդ, 1899-1979. Works')).to eq 'քոչար երվանդ 18991979 works'
end
end
Loading