-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize punctuation on input #1599
Comments
If it helps, the list of the characters we systematically correct in our systems, because we have found them in our records, is this one (still in Python2; maybe copy and paste hasn't respected some of them, but the comment may help): bad_chars = {
'\t': u' ',
' ': u' ',
u'': '', # Macintosh newline char?
u' ': u' ', # Unicode 0xA0, NO-BREAK SPACE
u' ': u' ', # Unicode 0x200E, LEFT-TO-RIGHT MARK
u'‘': u"'", # Unicode 0xA0, LEFT SINGLE QUOTATION MARK
u'’': u"'", # Unicode 0x2019, RIGHT SINGLE QUOTATION MARK
u'´': u"'", # Unicode 0xB4, ACUTE ACCENT
u'′': u"'", # Unicode 0x2032, PRIME
u'`': u"'", # Unicode 0x60, GRAVE ACCENT
u'\222': u"'", # Unicode 0x92: PRIVATE USE TWO
u'“': u'"',
u'”': u'"',
u'<<': u'«',
u'<<': u'«',
u'>>': u'»',
u'>>': u'»',
u'l.l': u'l·l',
u'l•l': u'l·l',
u'l\225l': u'l·l',
u'': u'·',
u'–': u'-', # Unicode 0x2013, EN DASH
u'—': u'-', # Unicode 0x2014, EM DASH
u'‐': u'-', # Unicode 0x2010, HYPHEN
}
def replace_bad_chars(line):
for bad_char in bad_chars:
line = line.replace(bad_char, bad_chars[bad_char])
return line |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Some characters need to be normalized (smart quotes vs. apostrophes) but some need to be allowed (u vs ü). Currently,
’
and'
are read as different punctuation marks. This causes misalignment in city names in Institutions:La Seu d’Urgell https://rism.online/institutions/30079707
La Seu d'Urgell https://rism.online/institutions/30005481
and duplicates in Titles/Texts:
https://muscat.rism.info/admin/standard_titles?utf8=%E2%9C%93&q%5Btitle_equals%5D=Au+sein+des+alarmes+l%E2%80%99amour+a+des+charmes&commit=Filter&order=id_desc
Au sein des alarmes l’amour a des charmes
Au sein des alarmes l'amour a des charmes
This arises especially when copying from websites or data imports. The problem has been solved with searching (see #622 ) but not on the input side.
I can think of the following:
’
and'
" "
and“ ”
-
–
—
(dash, n-dash, m-dash)For the dashes, only one is needed (the dash I think?) in the standardized fields.
What about spaces? Sometimes that acts strangely (Excel doesn't always read the spaces as spaces) but I can't describe it further than that.
This is most important the fields that are linked to authority files, not everywhere (like in notes fields).
The text was updated successfully, but these errors were encountered: