-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean_names rewrite #340
clean_names rewrite #340
Conversation
per advice from Gabor
no error message I see, behavior is bizarre, removing from testing for now
Merge branch 'master' of https://github.com/sfirke/janitor
# Conflicts: # NEWS.md # tests/testthat/test-clean-names.R
Codecov Report
@@ Coverage Diff @@
## master #340 +/- ##
=====================================
Coverage 100% 100%
=====================================
Files 26 26
Lines 977 996 +19
=====================================
+ Hits 977 996 +19
|
Awesome, I will review as soon as I can. If I just merged #338 (a) because they submitted it already before we got to this point (b) to welcome a new contributor. But it looks like that causes a conflict with your pull request, sorry to make more work for you Bill. |
@billdenney @sfirke Hi guys, I believe some parts of the body of Some of the advantages of the following would be
library(snakecase)
string <- c("", "", "",
"bla #", "bla %", "so'so",
"tatsächlich\"Liebe")
make_clean_names <- function(string, case = "snake", transliterations = NULL) {
if(is.null(transliterations)) {
transliterations <- c("#" = "number", "%" = "percentage", "'" = "", "\"" = "")
}
snakecase::to_any_case(string,
case = case,
sep_in = "[^[:alnum:]]",
parsing_option = 1L,
abbreviations = names(transliterations),
transliterations = c("Latin-ASCII", transliterations),
numerals = "asis",
unique_sep = "duplicated",
empty_fill = "V"
)
}
make_clean_names(string)
# [1] "V" "V_duplicated_1" "V_duplicated_2" "bla_number"
# [5] "bla_percentage" "soso" "tatsachlich_liebe" If you think this is interesting, especially the transliterations, just let me know, because then I would look rather soon into solving Tazinho/snakecase#186 and Tazinho/snakecase#181 as these might be dealbreakers. |
@Tazinho, thanks for the suggestions! If we can punt some of the additional work to snakecase, I'd be all for it! You're correct about the issue with There are a few features of current While I like transliterations occurring within snakecase, an underlying issue comes with things like an input of Overall, I'm not sure that we can really shorten it as much as you suggest while maintaining all the features that we have. That said-- any amount we can simplify is good, so let's think about it a bit more. (Though we have to have it tied up by the 14th, so that CRAN can be appeased.) |
Hmm, I believe these are very strong points that you mention. I always underestimate how tricky this can become.
If you have some specific testcases I'd be happy to add them to snakecase. This might be nice if we can get at some point around
I think when we have string[string != make.names(string)] <- paste("x_", string[string != make.names(string)]) The case of
I am not sure how mature the abbreviations argument in snakecase is but here it works as it should and if we drop the preprocessing step in string <- c("", "", "",
"bla #", "bla %", "so'so",
"tatsächlich\"Liebe", "# decrease")
make_clean_names(string)
# [1] "V" "V_DUPLICATED_1" "V_DUPLICATED_2" "bla_number"
# [5] "bla_percentage" "soso" "tatsachlich_liebe" "number_decrease" In summary, I'd change the main part of the suggestion to make_clean_names <- function(string, case = "snake", transliterations = NULL) {
# check if transliterations contains unallowed characters (non alphanumeric or non "_")
if(!is.null(transliterations) & any(str_detect(transliterations, pattern = "[^[:alnum:]^_]"))) {
stop("`transliterations` must only contain characters, digits and underscores")
}
if(is.null(transliterations)) {
transliterations <- c("#" = "number", "%" = "percentage", "'" = "", "\"" = "")
}
string <- snakecase::to_any_case(string,
case = case,
sep_in = "[^[:alnum:]]",
parsing_option = 1L,
abbreviations = names(transliterations),
transliterations = c("Latin-ASCII", transliterations),
numerals = "asis",
# unique_sep = "_duplicated_",
empty_fill = to_any_case("V", case = case)
)
# the case conversion for empty_fill, prefix and possibly sep in `make.unique` possibly needs a bit
# more thought...
prefix <- to_any_case("x", case = case, postfix = "_")
string[string != make.names(string)] <- paste0(prefix, string[string != make.names(string)])
# As the last line can introduce duplications, I switched the duplications part again out of
# snakecase::to_any_case().
make.unique(string, sep = "_duplicated_")
}
string <- string <- c("", "", "",
"bla #", "bla %", "so'so",
"tatsächlich\"Liebe",
"# decrease", "x_123", "123")
make_clean_names(string)
# [1] "v" "v_duplicated_1" "v_duplicated_2" "bla_number"
# [5] "bla_percentage" "soso" "tatsachlich_liebe" "number_decrease"
# [9] "x_123" "x_123_duplicated_1"
I am also not sure. Also about the timeline. But if you like the approach and have some feedback I will be happy to put some more time in this (and try to fix some issues in snakecase with higher priority). |
I already see another issue with my suggestion popping up. The
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I asked some questions, everything looks good I just want to make sure I get it all and that all the i's are dotted and t's crossed (feels like a funny phrase for this function - I should say all the umlauts are un-dotted).
…-rewrite # Conflicts: # DESCRIPTION # NAMESPACE # NEWS.md # R/clean_names.R # tests/testthat/test-clean-names.R
…anitor into clean_names-rewrite # Conflicts: # tests/testthat/test-clean-names.R
The tests that are now failing are, I think locale-specific. That gives me pause for how testable these are overall. |
@sfirke, I'm not sure what else I can do for the locale-specific tests that I've now separated out in test-clean-names.R. If this version doesn't work on Travis, we may need to skip these tests and try to fix them in 2.0.1. I've spent a good chunk of time going down Unicode/locale rabbit holes and my main take-away is that the interaction of Unicode and locales is a horrible mess (and no two OSes have the same locales available). |
You are a saint for digging into that mess! I will take a look at this in its current state and then yeah, we can always comment out tests failing for locale-specific reasons 😁 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Hey-hey, this version passes on Travis! 🎉 I'm good with merging it in, please go ahead. Did you end up withholding any tests? If there are any tricky ones that you want to retain with this code, you could add them but comment them out. |
Not about the actual code: I could see adding a test each for the issues being closed like #268 , #271 , #283 etc. and for the ones that are new features, like #271 and #283 , add that functionality to the examples of Let's not have that stop the merging of this PR though, and if you've had enough of |
I think that I added tests for all but #204, but all to_any_case arguments are now available, so that should work, too. I will admit, I'm tiring of the morass of Unicode and locales that I've been working through on this. If you can take on the examples and other parts, that would definitely be appreciated. Merging now! |
I'm vying to win the award for most issues closed with a single PR! :)
Fixes #268
Fixes #271
Fixes #283
Fixes #331
Fixes #339
Fixes #316
Fixes #204
Fixes #252
Replaces #338
Replaces #254
Replaces #303
This rewrites
make_clean_names()
andclean_names()
to address many issues, with the general goal of increasing flexibility and decreasing locale-dependence.Description
make_clean_names()
are now accessible fromclean_names()
tbl_graph
method is now included inclean_names()
"'%#
stringi
andstringr
were used with Unicode-aware regular expression matching to be less locale-dependent.make.names()
, and that could possibly become optional or be removed to remove locale-dependence entirely. (Removing it would remove functionality, so if removed, it would need to be replaced with a locale-independent version.)