Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean_names inserts unwanted underscore "_" in between NON-ASCII characters #268

Closed
hidekoji opened this issue Feb 9, 2019 · 8 comments · Fixed by #340
Closed

clean_names inserts unwanted underscore "_" in between NON-ASCII characters #268

hidekoji opened this issue Feb 9, 2019 · 8 comments · Fixed by #340
Labels
bug seeking comments Users and any interested parties should please weigh in - this is in a discussion phase!
Milestone

Comments

@hidekoji
Copy link

hidekoji commented Feb 9, 2019

Bug reports

Version

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)

janitor 1.1.1

Description

When you have non-ascii column names, clean_name inserts unwanted underscore in between NON-ASCII characters.

Below explains how unwanted underscore is inserted in between NON-ASCII characters.

df <- data.frame(`介護_看護_女` =  c(1))
colnames(df)
#> [1] "介護_看護_女"
df_cleaned <- janitor::clean_names(df)
colnames(df_cleaned)
#> [1] "介_護_看_護_女"

Created on 2019-02-08 by the reprex package (v0.2.1)

For this example, clean_names shoule not put "_" in between characters like "介_護_看_護_女" but should return "介護_看護_女".

@hidekoji
Copy link
Author

I looked into https://github.com/sfirke/janitor/blob/master/R/make_clean_names.R and it seems transliterations argument is hardcoded as c("Latin-ASCII"). Could it be a reason for the issue?

new_names <- old_names %>%
    gsub("'", "", .) %>% # remove single quotation marks
    gsub("\"", "", .) %>% # remove double quotation marks
    gsub("%", ".percent_", .) %>% # starting with "." as a workaround, to make
    # ".percent" a valid name. The "." will be replaced in the call to to_any_case
    # via the preprocess argument anyway.
    gsub("#", ".number_", .) %>%
    gsub("^[[:space:][:punct:]]+", "", .) %>% # remove leading spaces & punctuation
    make.names(.) %>%
    # Handle dots, multiple underscores, case conversion, string transliteration
    # Parsing option 4 removes underscores around numbers, #153
    snakecase::to_any_case(.,
      case = case, sep_in = "\\.",
      transliterations = c("Latin-ASCII"), parsing_option = 1,
      numerals = "asis"
    )
  
  # Handle duplicated names - they mess up dplyr pipelines
  # This appends the column number to repeated instances of duplicate variable names
  dupe_count <- vapply(seq_along(new_names), function(i) {
    sum(new_names[i] == new_names[1:i])
  }, integer(1))

@sfirke
Copy link
Owner

sfirke commented Feb 14, 2019

Thanks for reporting! I agree that underscores should not be inserted as in your example. I'm not sure the best way to fix this, but welcome ideas. I think we put transliterations = c("Latin-ASCII") to ensure the conversion of say, ü to u.

@sfirke sfirke added bug seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! labels Feb 14, 2019
@billdenney
Copy link
Collaborator

It looks like setting parsing_option=0 of snakecase::to_any_case() fixes the issue, so exposing that could help:

snakecase::to_any_case("介護_看護_女",
                       case = "snake", sep_in = "\\.",
                       transliterations = "Latin-ASCII", parsing_option = 0,
                       numerals = "asis"
)

@sfirke
Copy link
Owner

sfirke commented Mar 15, 2019

What would a fix for this look like? Bill, I agree that parsing_option = 0 fixes this, though I expect the typical user prefers the parsing of CamelCase text they get with parsing_option = 1. If there's a way to meet user needs like @hidekoji points out while preserving current behavior by default, I'd like that.

I thought about creating some interface to the parsing_option argument like parsing = TRUE but I think we might as well just let people pass through to that argument directly? In case they want to do parsing_option = 3 or something? And then document parsing_option along the lines of:

This is passed through as an argument to snakecase::to_any_case to allow for finer control of how breaks between words are parsed. See ?snakecase::to_any_case for complete documentation. parsing_option = 1 preserves the historical behavior of janitor. To avoid the insertion of underscores between non-ASCII characters, consider parsing_option = 0.

@Tazinho, do you have advice or preference, as to how we help users avoid this behavior?

@Tazinho
Copy link
Contributor

Tazinho commented Mar 15, 2019

@sfirke I think the parsing_option is currently the only option. Note that I am also planning a general overhaul of this argument to allow low level specification of the parsing. There is already an issue describing the details Tazinho/snakecase#163 However, the implementation might be a bit technical and it might take longer till I find the time for this change.

@sfirke
Copy link
Owner

sfirke commented Mar 7, 2020

Hrm does #340 close this too? Or if it doesn't or can't, and we determine we can't address this issue, we could close this. Of course if this one could be addressed and we just are unable to do it right now, it can stay open. @billdenney I'll stop tagging you in stuff now 😆

@sfirke sfirke added this to the v2.0 milestone Mar 7, 2020
@billdenney
Copy link
Collaborator

In #340, I'm working to expose all of the options from snakecase::to_any_case(). It is currently there with the version on my computer but not the version in the PR.

It's difficult to test this in detail because make.names() which is used in the process of clean_names() is locale-dependent. So, I get Unicode-to-ASCII transliteration with make.names(). That points to an underlying issue in base R with the fact that Unicode characters are considered to be eligible as part of variable names in locale-specific ways. I'll also make the make.names() part optional. I think that will be the closest that we can get to a real fix for this with the locale-specific nature of make.names().

@sfirke
Copy link
Owner

sfirke commented Mar 11, 2020

That sounds great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug seeking comments Users and any interested parties should please weigh in - this is in a discussion phase!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants