clean_names inserts unwanted underscore "_" in between NON-ASCII characters #268

hidekoji · 2019-02-09T01:25:35Z

Bug reports

Version

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)

janitor 1.1.1

Description

When you have non-ascii column names, clean_name inserts unwanted underscore in between NON-ASCII characters.

Below explains how unwanted underscore is inserted in between NON-ASCII characters.

df <- data.frame(`介護_看護_女` =  c(1))
colnames(df)
#> [1] "介護_看護_女"
df_cleaned <- janitor::clean_names(df)
colnames(df_cleaned)
#> [1] "介_護_看_護_女"

^{Created on 2019-02-08 by the reprex package (v0.2.1)}

For this example, clean_names shoule not put "_" in between characters like "介_護_看_護_女" but should return "介護_看護_女".

The text was updated successfully, but these errors were encountered:

hidekoji · 2019-02-10T04:47:39Z

I looked into https://github.com/sfirke/janitor/blob/master/R/make_clean_names.R and it seems transliterations argument is hardcoded as c("Latin-ASCII"). Could it be a reason for the issue?

new_names <- old_names %>%
    gsub("'", "", .) %>% # remove single quotation marks
    gsub("\"", "", .) %>% # remove double quotation marks
    gsub("%", ".percent_", .) %>% # starting with "." as a workaround, to make
    # ".percent" a valid name. The "." will be replaced in the call to to_any_case
    # via the preprocess argument anyway.
    gsub("#", ".number_", .) %>%
    gsub("^[[:space:][:punct:]]+", "", .) %>% # remove leading spaces & punctuation
    make.names(.) %>%
    # Handle dots, multiple underscores, case conversion, string transliteration
    # Parsing option 4 removes underscores around numbers, #153
    snakecase::to_any_case(.,
      case = case, sep_in = "\\.",
      transliterations = c("Latin-ASCII"), parsing_option = 1,
      numerals = "asis"
    )
  
  # Handle duplicated names - they mess up dplyr pipelines
  # This appends the column number to repeated instances of duplicate variable names
  dupe_count <- vapply(seq_along(new_names), function(i) {
    sum(new_names[i] == new_names[1:i])
  }, integer(1))

sfirke · 2019-02-14T18:38:42Z

Thanks for reporting! I agree that underscores should not be inserted as in your example. I'm not sure the best way to fix this, but welcome ideas. I think we put transliterations = c("Latin-ASCII") to ensure the conversion of say, ü to u.

billdenney · 2019-02-14T19:31:14Z

It looks like setting parsing_option=0 of snakecase::to_any_case() fixes the issue, so exposing that could help:

snakecase::to_any_case("介護_看護_女",
                       case = "snake", sep_in = "\\.",
                       transliterations = "Latin-ASCII", parsing_option = 0,
                       numerals = "asis"
)

sfirke · 2019-03-15T02:31:05Z

What would a fix for this look like? Bill, I agree that parsing_option = 0 fixes this, though I expect the typical user prefers the parsing of CamelCase text they get with parsing_option = 1. If there's a way to meet user needs like @hidekoji points out while preserving current behavior by default, I'd like that.

I thought about creating some interface to the parsing_option argument like parsing = TRUE but I think we might as well just let people pass through to that argument directly? In case they want to do parsing_option = 3 or something? And then document parsing_option along the lines of:

This is passed through as an argument to snakecase::to_any_case to allow for finer control of how breaks between words are parsed. See ?snakecase::to_any_case for complete documentation. parsing_option = 1 preserves the historical behavior of janitor. To avoid the insertion of underscores between non-ASCII characters, consider parsing_option = 0.

@Tazinho, do you have advice or preference, as to how we help users avoid this behavior?

Tazinho · 2019-03-15T09:26:04Z

@sfirke I think the parsing_option is currently the only option. Note that I am also planning a general overhaul of this argument to allow low level specification of the parsing. There is already an issue describing the details Tazinho/snakecase#163 However, the implementation might be a bit technical and it might take longer till I find the time for this change.

sfirke · 2020-03-07T05:11:33Z

Hrm does #340 close this too? Or if it doesn't or can't, and we determine we can't address this issue, we could close this. Of course if this one could be addressed and we just are unable to do it right now, it can stay open. @billdenney I'll stop tagging you in stuff now 😆

billdenney · 2020-03-11T01:36:02Z

In #340, I'm working to expose all of the options from snakecase::to_any_case(). It is currently there with the version on my computer but not the version in the PR.

It's difficult to test this in detail because make.names() which is used in the process of clean_names() is locale-dependent. So, I get Unicode-to-ASCII transliteration with make.names(). That points to an underlying issue in base R with the fact that Unicode characters are considered to be eligible as part of variable names in locale-specific ways. I'll also make the make.names() part optional. I think that will be the closest that we can get to a real fix for this with the locale-specific nature of make.names().

sfirke · 2020-03-11T01:39:04Z

That sounds great.

sfirke added bug seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! labels Feb 14, 2019

billdenney mentioned this issue Feb 20, 2019

Function to just make everything plain text #273

Closed

Tazinho mentioned this issue May 20, 2019

Small improvements on make_clean_names() and clean_names() #302

Closed

sfirke added this to the v2.0 milestone Mar 7, 2020

billdenney mentioned this issue Mar 11, 2020

clean_names rewrite #340

Merged

billdenney closed this as completed in #340 Mar 12, 2020

sfirke mentioned this issue Mar 12, 2020

Tweaks to clean_names documentation and maybe tests #344

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean_names inserts unwanted underscore "_" in between NON-ASCII characters #268

clean_names inserts unwanted underscore "_" in between NON-ASCII characters #268

hidekoji commented Feb 9, 2019 •

edited

Loading

hidekoji commented Feb 10, 2019

sfirke commented Feb 14, 2019

billdenney commented Feb 14, 2019

sfirke commented Mar 15, 2019

Tazinho commented Mar 15, 2019 •

edited

Loading

sfirke commented Mar 7, 2020

billdenney commented Mar 11, 2020

sfirke commented Mar 11, 2020

clean_names inserts unwanted underscore "_" in between NON-ASCII characters #268

clean_names inserts unwanted underscore "_" in between NON-ASCII characters #268

Comments

hidekoji commented Feb 9, 2019 • edited Loading

Bug reports

Version

Description

hidekoji commented Feb 10, 2019

sfirke commented Feb 14, 2019

billdenney commented Feb 14, 2019

sfirke commented Mar 15, 2019

Tazinho commented Mar 15, 2019 • edited Loading

sfirke commented Mar 7, 2020

billdenney commented Mar 11, 2020

sfirke commented Mar 11, 2020

hidekoji commented Feb 9, 2019 •

edited

Loading

Tazinho commented Mar 15, 2019 •

edited

Loading