Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean_names rewrite #340

Merged
merged 30 commits into from
Mar 12, 2020
Merged

Conversation

billdenney
Copy link
Collaborator

@billdenney billdenney commented Mar 4, 2020

I'm vying to win the award for most issues closed with a single PR! :)

Fixes #268
Fixes #271
Fixes #283
Fixes #331
Fixes #339
Fixes #316
Fixes #204
Fixes #252
Replaces #338
Replaces #254
Replaces #303

This rewrites make_clean_names() and clean_names() to address many issues, with the general goal of increasing flexibility and decreasing locale-dependence.

Description

  • All arguments to make_clean_names() are now accessible from clean_names()
  • The tbl_graph method is now included in clean_names()
  • General replacements are available for any text rather than just having the fixed replacements for "'%#
  • stringi and stringr were used with Unicode-aware regular expression matching to be less locale-dependent.
  • Hopefully, the only locale-dependent part is the use of make.names(), and that could possibly become optional or be removed to remove locale-dependence entirely. (Removing it would remove functionality, so if removed, it would need to be replaced with a locale-independent version.)

JosiahParry and others added 24 commits October 26, 2018 12:23
no error message I see, behavior is bizarre, removing from testing for now
# Conflicts:
#	NEWS.md
#	tests/testthat/test-clean-names.R
@codecov
Copy link

codecov bot commented Mar 4, 2020

Codecov Report

Merging #340 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master   #340   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          26     26           
  Lines         977    996   +19     
=====================================
+ Hits          977    996   +19
Impacted Files Coverage Δ
R/clean_names.R 100% <100%> (ø) ⬆️
R/make_clean_names.R 100% <100%> (ø) ⬆️

@sfirke
Copy link
Owner

sfirke commented Mar 4, 2020

Awesome, I will review as soon as I can. If make.names() is locale dependent then I'd love to achieve the same thing in a more locale-independent way. I appreciate you calling out guiding principles here, that change would feel in line with those too.

I just merged #338 (a) because they submitted it already before we got to this point (b) to welcome a new contributor. But it looks like that causes a conflict with your pull request, sorry to make more work for you Bill.

@Tazinho
Copy link
Contributor

Tazinho commented Mar 4, 2020

@billdenney @sfirke Hi guys, I believe some parts of the body of make_clean_names() can be even shorter (if you want to break some backword compatibility) and it could probably close some more issues. As I was not following the whole discussion regarding the rewrite of make_clean_names() the following code peace might need to be adjusted at some smaller parts like the "Latin-ASCII"as default transliteration (note snakecase also allows the same tranliteration chains as stringi does). Anyway, I wanted to contribute this as I believe it would help a lot for maintenance when snakecase::to_any_case() handles as much as possible of make_clean_names() features (and it also gives me much more user feedback for snakecase :).

Some of the advantages of the following would be

  • no make.names() is involved. (though one might need to test this for the special character behaviour as well before publishing this. Note I was not reading the whole discussion about the ASCII translation and reasons for this, but understood that it's somehow related to the locale and snakecase uses stringr internally and therefore typically only english locale.) EDIT: I always forget that we really need make.names() or another mechanism if we want syntacticly correct names (from help(make.names)"A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ".2way" are not valid, and neither are the reserved words.").
  • the boilerplate code for the translation is included in the snakecase call and the transliterations argument would fix feature request: custom translation of symbols such as % # in clean_names() #316
  • the unique_sep would fix clean_names() creates duplicate names #251 and also reduce the boilerplate code for adding the suffixes. (Note that this is a small breaking change as the counter for duplicates is different in make.unique (which is what snakecase uses internally) compared to the current implementation in janitor).
library(snakecase)

string <- c("", "", "", 
            "bla #", "bla %", "so'so",
            "tatsächlich\"Liebe")

make_clean_names <- function(string, case = "snake", transliterations = NULL) {
  
  if(is.null(transliterations)) {
    transliterations <- c("#" = "number", "%" = "percentage", "'" = "", "\"" = "")
  }
  
  snakecase::to_any_case(string, 
                       case = case, 
                       sep_in = "[^[:alnum:]]",
                       parsing_option = 1L,
                       abbreviations = names(transliterations),
                       transliterations = c("Latin-ASCII", transliterations),
                       numerals = "asis",
                       unique_sep = "duplicated",
                       empty_fill = "V"
                      )
}

make_clean_names(string)
# [1] "V"                 "V_duplicated_1"    "V_duplicated_2"    "bla_number"       
# [5] "bla_percentage"    "soso"         "tatsachlich_liebe"

If you think this is interesting, especially the transliterations, just let me know, because then I would look rather soon into solving Tazinho/snakecase#186 and Tazinho/snakecase#181 as these might be dealbreakers.

@billdenney
Copy link
Collaborator Author

@Tazinho, thanks for the suggestions!

If we can punt some of the additional work to snakecase, I'd be all for it! You're correct about the issue with iconv() relating to the fact that it is locale-dependent. And, for general ASCII conversion, I want to make it not locale-dependent.

There are a few features of current make_clean_names() that your suggested version does not include such as dropping initial spaces and punctuation, and ensuring that the resulting name is usable without quoting it. (I.e. being able to use my_data$clean_name instead of my_data$` clean_name` if the input had a name of " clean name"-- note the initial space).

While I like transliterations occurring within snakecase, an underlying issue comes with things like an input of "% decrease" where if we do the transliteration after dropping initial punctuation and spaces, we would get "decrease" instead of "percent_decrease". So, I'm not sure that we can move transliterations to snakecase as we have a few different goals here.

Overall, I'm not sure that we can really shorten it as much as you suggest while maintaining all the features that we have. That said-- any amount we can simplify is good, so let's think about it a bit more. (Though we have to have it tied up by the 14th, so that CRAN can be appeased.)

@Tazinho
Copy link
Contributor

Tazinho commented Mar 4, 2020

Hmm, I believe these are very strong points that you mention. I always underestimate how tricky this can become.

If we can punt some of the additional work to snakecase, I'd be all for it! You're correct about the issue with iconv() relating to the fact that it is locale-dependent. And, for general ASCII conversion, I want to make it not locale-dependent.

If you have some specific testcases I'd be happy to add them to snakecase. This might be nice if we can get at some point around make.names(). (Not sure if this has potential to help then, but maybe yes, I would be happy to try).

There are a few features of current make_clean_names() that your suggested version does not include such as dropping initial spaces and punctuation, and ensuring that the resulting name is usable without quoting it. (I.e. being able to use my_data$clean_name instead of my_data$` clean_name` if the input had a name of " clean name"-- note the initial space).

I think when we have sep_in = "[^[:alnum:]]" the issue with punctuation in the beginning should be handled. (snakecase trims it's output and the resulting string will contain only alphanumerics, underscores and possibly any unvalid things that the user might supply in transliterations, so one should also add a test to allow only alphanumerics and underscores in the element slots of the transliterations argument). The only problem I see here is that it doesn't safe us from digits in the beginning and reserverd words (help(Reserved)) so one still might need some boilerplate code in the end, i.e. like

string[string != make.names(string)] <- paste("x_", string[string != make.names(string)])

The case of x_ could depend on the supplied value for the case argument (and the value x_ could be itself an argument to make_clean_names()).

While I like transliterations occurring within snakecase, an underlying issue comes with things like an input of "% decrease" where if we do the transliteration after dropping initial punctuation and spaces, we would get "decrease" instead of "percent_decrease". So, I'm not sure that we can move transliterations to snakecase as we have a few different goals here.

I am not sure how mature the abbreviations argument in snakecase is but here it works as it should and if we drop the preprocessing step in make_clean_names() this would also work in make_clean_names() (if this feature is really wanted). See the last element of string.

string <- c("", "", "", 
            "bla #", "bla %", "so'so",
            "tatsächlich\"Liebe", "# decrease")
make_clean_names(string)
# [1] "V"                 "V_DUPLICATED_1"    "V_DUPLICATED_2"    "bla_number"       
# [5] "bla_percentage"    "soso"         "tatsachlich_liebe" "number_decrease" 

In summary, I'd change the main part of the suggestion to

make_clean_names <- function(string, case = "snake", transliterations = NULL) {
  
  # check if transliterations contains unallowed characters (non alphanumeric or non "_")
  if(!is.null(transliterations) & any(str_detect(transliterations, pattern = "[^[:alnum:]^_]"))) {
    stop("`transliterations` must only contain characters, digits and underscores")
  }
  
  if(is.null(transliterations)) {
    transliterations <- c("#" = "number", "%" = "percentage", "'" = "", "\"" = "")
  }
  
  string <- snakecase::to_any_case(string, 
                                   case = case, 
                                   sep_in = "[^[:alnum:]]",
                                   parsing_option = 1L,
                                   abbreviations = names(transliterations),
                                   transliterations = c("Latin-ASCII", transliterations),
                                   numerals = "asis",
                                   # unique_sep = "_duplicated_",
                                   empty_fill = to_any_case("V", case = case) 
  )
  
  # the case conversion for empty_fill, prefix and possibly sep in `make.unique` possibly needs a bit
  # more thought...
  prefix <- to_any_case("x", case = case, postfix = "_")
  string[string != make.names(string)] <- paste0(prefix, string[string != make.names(string)])
  # As the last line can introduce duplications, I switched the duplications part again out of
  # snakecase::to_any_case().
  make.unique(string, sep = "_duplicated_")
}
string <- string <- c("", "", "", 
                      "bla #", "bla %", "so'so",
                      "tatsächlich\"Liebe",
                      "# decrease", "x_123", "123")

make_clean_names(string)
# [1] "v"                  "v_duplicated_1"     "v_duplicated_2"     "bla_number"        
# [5] "bla_percentage"     "soso"               "tatsachlich_liebe"  "number_decrease"   
# [9] "x_123"              "x_123_duplicated_1"

Overall, I'm not sure that we can really shorten it as much as you suggest while maintaining all the features that we have. That said-- any amount we can simplify is good, so let's think about it a bit more. (Though we have to have it tied up by the 14th, so that CRAN can be appeased.)

I am also not sure. Also about the timeline. But if you like the approach and have some feedback I will be happy to put some more time in this (and try to fix some issues in snakecase with higher priority).

@Tazinho
Copy link
Contributor

Tazinho commented Mar 5, 2020

I already see another issue with my suggestion popping up. The "so'so" gets split by snakecase (this is correct behaviour which I oversaw) and so, we get an unwanted case conversion here. I think without overthinking this, my suggestion probably doesn't have potential to help ATM.

make_clean_names(string, case = "small_camel")
# [1] "v"                "v_duplicated_1"   "v_duplicated_2"   "blaNumber"       
# [5] "blaPercentage"    "soSo"             "tatsachlichLiebe" "numberDecrease"  
# [9] "x123"             "x_123"  

Copy link
Owner

@sfirke sfirke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I asked some questions, everything looks good I just want to make sure I get it all and that all the i's are dotted and t's crossed (feels like a funny phrase for this function - I should say all the umlauts are un-dotted).

NEWS.md Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
R/clean_names.R Show resolved Hide resolved
R/make_clean_names.R Show resolved Hide resolved
R/make_clean_names.R Outdated Show resolved Hide resolved
tests/testthat/test-clean-names.R Show resolved Hide resolved
tests/testthat/test-clean-names.R Outdated Show resolved Hide resolved
tests/testthat/test-clean-names.R Outdated Show resolved Hide resolved
tests/testthat/test-clean-names.R Outdated Show resolved Hide resolved
@billdenney
Copy link
Collaborator Author

Other issues that will be resolved with this PR:
Fix #268
Fix #271
Fix #283

…-rewrite

# Conflicts:
#	DESCRIPTION
#	NAMESPACE
#	NEWS.md
#	R/clean_names.R
#	tests/testthat/test-clean-names.R
…anitor into clean_names-rewrite

# Conflicts:
#	tests/testthat/test-clean-names.R
@billdenney
Copy link
Collaborator Author

The tests that are now failing are, I think locale-specific. That gives me pause for how testable these are overall.

@billdenney
Copy link
Collaborator Author

@sfirke, I'm not sure what else I can do for the locale-specific tests that I've now separated out in test-clean-names.R. If this version doesn't work on Travis, we may need to skip these tests and try to fix them in 2.0.1.

I've spent a good chunk of time going down Unicode/locale rabbit holes and my main take-away is that the interaction of Unicode and locales is a horrible mess (and no two OSes have the same locales available).

@sfirke
Copy link
Owner

sfirke commented Mar 12, 2020

You are a saint for digging into that mess! I will take a look at this in its current state and then yeah, we can always comment out tests failing for locale-specific reasons 😁

Copy link
Owner

@sfirke sfirke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@sfirke
Copy link
Owner

sfirke commented Mar 12, 2020

Hey-hey, this version passes on Travis! 🎉 I'm good with merging it in, please go ahead.

Did you end up withholding any tests? If there are any tricky ones that you want to retain with this code, you could add them but comment them out.

@sfirke
Copy link
Owner

sfirke commented Mar 12, 2020

Not about the actual code: I could see adding a test each for the issues being closed like #268 , #271 , #283 etc. and for the ones that are new features, like #271 and #283 , add that functionality to the examples of clean_names and/or make_clean_names and describe in NEWS. So people know about and use the new powers! 🤺

Let's not have that stop the merging of this PR though, and if you've had enough of clean_names the last couple of weeks I could take over this aspect 😄

@billdenney
Copy link
Collaborator Author

I think that I added tests for all but #204, but all to_any_case arguments are now available, so that should work, too.

I will admit, I'm tiring of the morass of Unicode and locales that I've been working through on this. If you can take on the examples and other parts, that would definitely be appreciated.

Merging now!

@billdenney billdenney merged commit f0425ee into sfirke:master Mar 12, 2020
@billdenney billdenney deleted the clean_names-rewrite branch March 7, 2022 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment