clean_names rewrite #340

billdenney · 2020-03-04T18:28:56Z

I'm vying to win the award for most issues closed with a single PR! :)

Fixes #268
Fixes #271
Fixes #283
Fixes #331
Fixes #339
Fixes #316
Fixes #204
Fixes #252
Replaces #338
Replaces #254
Replaces #303

This rewrites make_clean_names() and clean_names() to address many issues, with the general goal of increasing flexibility and decreasing locale-dependence.

Description

All arguments to make_clean_names() are now accessible from clean_names()
The tbl_graph method is now included in clean_names()
General replacements are available for any text rather than just having the fixed replacements for "'%#
stringi and stringr were used with Unicode-aware regular expression matching to be less locale-dependent.
Hopefully, the only locale-dependent part is the use of make.names(), and that could possibly become optional or be removed to remove locale-dependence entirely. (Removing it would remove functionality, so if removed, it would need to be replaced with a locale-independent version.)

…ames

per advice from Gabor

no error message I see, behavior is bizarre, removing from testing for now

Merge branch 'master' of https://github.com/sfirke/janitor

# Conflicts: # NEWS.md # tests/testthat/test-clean-names.R

codecov · 2020-03-04T19:09:05Z

Codecov Report

Merging #340 into master will not change coverage.
The diff coverage is 100%.

@@          Coverage Diff          @@
##           master   #340   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          26     26           
  Lines         977    996   +19     
=====================================
+ Hits          977    996   +19

Impacted Files	Coverage Δ
R/clean_names.R	`100% <100%> (ø)`	⬆️
R/make_clean_names.R	`100% <100%> (ø)`	⬆️

sfirke · 2020-03-04T19:13:52Z

Awesome, I will review as soon as I can. If make.names() is locale dependent then I'd love to achieve the same thing in a more locale-independent way. I appreciate you calling out guiding principles here, that change would feel in line with those too.

I just merged #338 (a) because they submitted it already before we got to this point (b) to welcome a new contributor. But it looks like that causes a conflict with your pull request, sorry to make more work for you Bill.

Tazinho · 2020-03-04T21:47:03Z

@billdenney @sfirke Hi guys, I believe some parts of the body of make_clean_names() can be even shorter (if you want to break some backword compatibility) and it could probably close some more issues. As I was not following the whole discussion regarding the rewrite of make_clean_names() the following code peace might need to be adjusted at some smaller parts like the "Latin-ASCII"as default transliteration (note snakecase also allows the same tranliteration chains as stringi does). Anyway, I wanted to contribute this as I believe it would help a lot for maintenance when snakecase::to_any_case() handles as much as possible of make_clean_names() features (and it also gives me much more user feedback for snakecase :).

Some of the advantages of the following would be

no make.names() is involved. (though one might need to test this for the special character behaviour as well before publishing this. Note I was not reading the whole discussion about the ASCII translation and reasons for this, but understood that it's somehow related to the locale and snakecase uses stringr internally and therefore typically only english locale.) EDIT: I always forget that we really need make.names() or another mechanism if we want syntacticly correct names (from help(make.names)"A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ".2way" are not valid, and neither are the reserved words.").
the boilerplate code for the translation is included in the snakecase call and the transliterations argument would fix feature request: custom translation of symbols such as % # in clean_names() #316
the unique_sep would fix clean_names() creates duplicate names #251 and also reduce the boilerplate code for adding the suffixes. (Note that this is a small breaking change as the counter for duplicates is different in make.unique (which is what snakecase uses internally) compared to the current implementation in janitor).

library(snakecase)

string <- c("", "", "", 
            "bla #", "bla %", "so'so",
            "tatsächlich\"Liebe")

make_clean_names <- function(string, case = "snake", transliterations = NULL) {
  
  if(is.null(transliterations)) {
    transliterations <- c("#" = "number", "%" = "percentage", "'" = "", "\"" = "")
  }
  
  snakecase::to_any_case(string, 
                       case = case, 
                       sep_in = "[^[:alnum:]]",
                       parsing_option = 1L,
                       abbreviations = names(transliterations),
                       transliterations = c("Latin-ASCII", transliterations),
                       numerals = "asis",
                       unique_sep = "duplicated",
                       empty_fill = "V"
                      )
}

make_clean_names(string)
# [1] "V"                 "V_duplicated_1"    "V_duplicated_2"    "bla_number"       
# [5] "bla_percentage"    "soso"         "tatsachlich_liebe"

If you think this is interesting, especially the transliterations, just let me know, because then I would look rather soon into solving Tazinho/snakecase#186 and Tazinho/snakecase#181 as these might be dealbreakers.

billdenney · 2020-03-04T22:00:13Z

@Tazinho, thanks for the suggestions!

If we can punt some of the additional work to snakecase, I'd be all for it! You're correct about the issue with iconv() relating to the fact that it is locale-dependent. And, for general ASCII conversion, I want to make it not locale-dependent.

There are a few features of current make_clean_names() that your suggested version does not include such as dropping initial spaces and punctuation, and ensuring that the resulting name is usable without quoting it. (I.e. being able to use my_data$clean_name instead of my_data$` clean_name` if the input had a name of " clean name"-- note the initial space).

While I like transliterations occurring within snakecase, an underlying issue comes with things like an input of "% decrease" where if we do the transliteration after dropping initial punctuation and spaces, we would get "decrease" instead of "percent_decrease". So, I'm not sure that we can move transliterations to snakecase as we have a few different goals here.

Overall, I'm not sure that we can really shorten it as much as you suggest while maintaining all the features that we have. That said-- any amount we can simplify is good, so let's think about it a bit more. (Though we have to have it tied up by the 14th, so that CRAN can be appeased.)

Tazinho · 2020-03-04T23:43:17Z

Hmm, I believe these are very strong points that you mention. I always underestimate how tricky this can become.

If we can punt some of the additional work to snakecase, I'd be all for it! You're correct about the issue with iconv() relating to the fact that it is locale-dependent. And, for general ASCII conversion, I want to make it not locale-dependent.

If you have some specific testcases I'd be happy to add them to snakecase. This might be nice if we can get at some point around make.names(). (Not sure if this has potential to help then, but maybe yes, I would be happy to try).

There are a few features of current make_clean_names() that your suggested version does not include such as dropping initial spaces and punctuation, and ensuring that the resulting name is usable without quoting it. (I.e. being able to use my_data$clean_name instead of my_data$` clean_name` if the input had a name of " clean name"-- note the initial space).

I think when we have sep_in = "[^[:alnum:]]" the issue with punctuation in the beginning should be handled. (snakecase trims it's output and the resulting string will contain only alphanumerics, underscores and possibly any unvalid things that the user might supply in transliterations, so one should also add a test to allow only alphanumerics and underscores in the element slots of the transliterations argument). The only problem I see here is that it doesn't safe us from digits in the beginning and reserverd words (help(Reserved)) so one still might need some boilerplate code in the end, i.e. like

string[string != make.names(string)] <- paste("x_", string[string != make.names(string)])

The case of x_ could depend on the supplied value for the case argument (and the value x_ could be itself an argument to make_clean_names()).

While I like transliterations occurring within snakecase, an underlying issue comes with things like an input of "% decrease" where if we do the transliteration after dropping initial punctuation and spaces, we would get "decrease" instead of "percent_decrease". So, I'm not sure that we can move transliterations to snakecase as we have a few different goals here.

I am not sure how mature the abbreviations argument in snakecase is but here it works as it should and if we drop the preprocessing step in make_clean_names() this would also work in make_clean_names() (if this feature is really wanted). See the last element of string.

string <- c("", "", "", 
            "bla #", "bla %", "so'so",
            "tatsächlich\"Liebe", "# decrease")
make_clean_names(string)
# [1] "V"                 "V_DUPLICATED_1"    "V_DUPLICATED_2"    "bla_number"       
# [5] "bla_percentage"    "soso"         "tatsachlich_liebe" "number_decrease"

In summary, I'd change the main part of the suggestion to

make_clean_names <- function(string, case = "snake", transliterations = NULL) {
  
  # check if transliterations contains unallowed characters (non alphanumeric or non "_")
  if(!is.null(transliterations) & any(str_detect(transliterations, pattern = "[^[:alnum:]^_]"))) {
    stop("`transliterations` must only contain characters, digits and underscores")
  }
  
  if(is.null(transliterations)) {
    transliterations <- c("#" = "number", "%" = "percentage", "'" = "", "\"" = "")
  }
  
  string <- snakecase::to_any_case(string, 
                                   case = case, 
                                   sep_in = "[^[:alnum:]]",
                                   parsing_option = 1L,
                                   abbreviations = names(transliterations),
                                   transliterations = c("Latin-ASCII", transliterations),
                                   numerals = "asis",
                                   # unique_sep = "_duplicated_",
                                   empty_fill = to_any_case("V", case = case) 
  )
  
  # the case conversion for empty_fill, prefix and possibly sep in `make.unique` possibly needs a bit
  # more thought...
  prefix <- to_any_case("x", case = case, postfix = "_")
  string[string != make.names(string)] <- paste0(prefix, string[string != make.names(string)])
  # As the last line can introduce duplications, I switched the duplications part again out of
  # snakecase::to_any_case().
  make.unique(string, sep = "_duplicated_")
}
string <- string <- c("", "", "", 
                      "bla #", "bla %", "so'so",
                      "tatsächlich\"Liebe",
                      "# decrease", "x_123", "123")

make_clean_names(string)
# [1] "v"                  "v_duplicated_1"     "v_duplicated_2"     "bla_number"        
# [5] "bla_percentage"     "soso"               "tatsachlich_liebe"  "number_decrease"   
# [9] "x_123"              "x_123_duplicated_1"

Overall, I'm not sure that we can really shorten it as much as you suggest while maintaining all the features that we have. That said-- any amount we can simplify is good, so let's think about it a bit more. (Though we have to have it tied up by the 14th, so that CRAN can be appeased.)

I am also not sure. Also about the timeline. But if you like the approach and have some feedback I will be happy to put some more time in this (and try to fix some issues in snakecase with higher priority).

Tazinho · 2020-03-05T00:12:16Z

I already see another issue with my suggestion popping up. The "so'so" gets split by snakecase (this is correct behaviour which I oversaw) and so, we get an unwanted case conversion here. I think without overthinking this, my suggestion probably doesn't have potential to help ATM.

make_clean_names(string, case = "small_camel")
# [1] "v"                "v_duplicated_1"   "v_duplicated_2"   "blaNumber"       
# [5] "blaPercentage"    "soSo"             "tatsachlichLiebe" "numberDecrease"  
# [9] "x123"             "x_123"

sfirke

Looks great! I asked some questions, everything looks good I just want to make sure I get it all and that all the i's are dotted and t's crossed (feels like a funny phrase for this function - I should say all the umlauts are un-dotted).

NEWS.md

R/clean_names.R

R/make_clean_names.R

tests/testthat/test-clean-names.R

…use make.names()

billdenney · 2020-03-11T02:03:15Z

Other issues that will be resolved with this PR:
Fix #268
Fix #271
Fix #283

…-rewrite # Conflicts: # DESCRIPTION # NAMESPACE # NEWS.md # R/clean_names.R # tests/testthat/test-clean-names.R

…anitor into clean_names-rewrite # Conflicts: # tests/testthat/test-clean-names.R

billdenney · 2020-03-11T20:01:56Z

The tests that are now failing are, I think locale-specific. That gives me pause for how testable these are overall.

billdenney · 2020-03-11T20:31:13Z

@sfirke, I'm not sure what else I can do for the locale-specific tests that I've now separated out in test-clean-names.R. If this version doesn't work on Travis, we may need to skip these tests and try to fix them in 2.0.1.

I've spent a good chunk of time going down Unicode/locale rabbit holes and my main take-away is that the interaction of Unicode and locales is a horrible mess (and no two OSes have the same locales available).

sfirke · 2020-03-12T01:17:33Z

You are a saint for digging into that mess! I will take a look at this in its current state and then yeah, we can always comment out tests failing for locale-specific reasons 😁

sfirke

Looks good!

sfirke · 2020-03-12T02:34:27Z

Hey-hey, this version passes on Travis! 🎉 I'm good with merging it in, please go ahead.

Did you end up withholding any tests? If there are any tricky ones that you want to retain with this code, you could add them but comment them out.

sfirke · 2020-03-12T02:41:34Z

Not about the actual code: I could see adding a test each for the issues being closed like #268 , #271 , #283 etc. and for the ones that are new features, like #271 and #283 , add that functionality to the examples of clean_names and/or make_clean_names and describe in NEWS. So people know about and use the new powers! 🤺

Let's not have that stop the merging of this PR though, and if you've had enough of clean_names the last couple of weeks I could take over this aspect 😄

billdenney · 2020-03-12T03:25:31Z

I think that I added tests for all but #204, but all to_any_case arguments are now available, so that should work, too.

I will admit, I'm tiring of the morass of Unicode and locales that I've been working through on this. If you can take on the examples and other parts, that would definitely be appreciated.

Merging now!

JosiahParry and others added 24 commits October 26, 2018 12:23

create hacky sf method for

697673f

make lightweight clean names method for sf objects using make_clean_n…

dda068a

…ames

convert clean_names to a method. Create sf method

f97e8e0

update documentation

bc348c4

add tests for sf method

c4d40ca

add sf to suggests

6f2581e

requireNamespace() test for sf

3c8310f

add S3 generic utilities to register the method upon pkg loading

f4fdc4a

per advice from Gabor

reset default argument of case to show all possibilities

54989d9

add tibble and sf as suggests, formatting changes from roxygen update

33644c3

address sf dependencies causing travis to fail on linux

7d49c2c

try to get rgdal issues fixed to pass on Linux

91d57ca

oldrel + linux is failing b/c of geo dependency issues ... so remove it

56e72f7

no error message I see, behavior is bizarre, removing from testing for now

update master of my fork

7ca1590

Merge branch 'master' of https://github.com/sfirke/janitor

create simple tbl_graph method for clean_names

3b0ba44

create tests for tbl_graph method

c818bc6

add tidygraph as suggested

4da6587

update news.md to include working on tbl_graph objects

c7dea6b

add a check for tbl_graph class on said method

20ea546

update documentation

d0ad188

Starting point for clean_names rewrite

d7dfe62

Merge branch 'clean_tbl_graph' into clean_names-rewrite

525ac4a

# Conflicts: # NEWS.md # tests/testthat/test-clean-names.R

Rewrite make_clean_names to make it less locale-dependent

a986de1

sf tests work without loading the package

a6663a7

Merge branch 'master' into clean_names-rewrite

ec582e9

sfirke reviewed Mar 7, 2020

View reviewed changes

sfirke added this to the v1.3 milestone Mar 7, 2020

This was referenced Mar 7, 2020

clean names to proper names? #271

Closed

clean_names inserts unwanted underscore "_" in between NON-ASCII characters #268

Closed

billdenney mentioned this pull request Mar 7, 2020

Truncate the length of variable names #332

Open

Pass arguments to snakecase::to_any_case() and allow the user not to …

0d83ef5

…use make.names()

billdenney mentioned this pull request Mar 11, 2020

clean_names: support abbreviations argument of snakecase::to_any_case? #283

Closed

billdenney added 3 commits March 11, 2020 15:17

Merge remote-tracking branch 'remotes/origin/master' into clean_names…

f52aa7e

…-rewrite # Conflicts: # DESCRIPTION # NAMESPACE # NEWS.md # R/clean_names.R # tests/testthat/test-clean-names.R

Merge branch 'clean_names-rewrite' of https://github.com/billdenney/j…

26e3bb2

…anitor into clean_names-rewrite # Conflicts: # tests/testthat/test-clean-names.R

Address code review comments

5e04ff3

Attempt to separate and clean the locale-specific tests

a6d9af7

sfirke approved these changes Mar 12, 2020

View reviewed changes

sfirke mentioned this pull request Mar 12, 2020

improve readability of make_clean_names #303

Closed

billdenney merged commit f0425ee into sfirke:master Mar 12, 2020

sfirke mentioned this pull request Mar 12, 2020

Tweaks to clean_names documentation and maybe tests #344

Closed

billdenney deleted the clean_names-rewrite branch March 7, 2022 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean_names rewrite #340

clean_names rewrite #340

billdenney commented Mar 4, 2020 •

edited

Loading

codecov bot commented Mar 4, 2020 •

edited

Loading

sfirke commented Mar 4, 2020

Tazinho commented Mar 4, 2020 •

edited

Loading

billdenney commented Mar 4, 2020

Tazinho commented Mar 4, 2020 •

edited

Loading

Tazinho commented Mar 5, 2020

sfirke left a comment

billdenney commented Mar 11, 2020

billdenney commented Mar 11, 2020

billdenney commented Mar 11, 2020

sfirke commented Mar 12, 2020

sfirke left a comment

sfirke commented Mar 12, 2020

sfirke commented Mar 12, 2020

billdenney commented Mar 12, 2020

clean_names rewrite #340

clean_names rewrite #340

Conversation

billdenney commented Mar 4, 2020 • edited Loading

Description

codecov bot commented Mar 4, 2020 • edited Loading

Codecov Report

sfirke commented Mar 4, 2020

Tazinho commented Mar 4, 2020 • edited Loading

billdenney commented Mar 4, 2020

Tazinho commented Mar 4, 2020 • edited Loading

Tazinho commented Mar 5, 2020

sfirke left a comment

Choose a reason for hiding this comment

billdenney commented Mar 11, 2020

billdenney commented Mar 11, 2020

billdenney commented Mar 11, 2020

sfirke commented Mar 12, 2020

sfirke left a comment

Choose a reason for hiding this comment

sfirke commented Mar 12, 2020

sfirke commented Mar 12, 2020

billdenney commented Mar 12, 2020

billdenney commented Mar 4, 2020 •

edited

Loading

codecov bot commented Mar 4, 2020 •

edited

Loading

Tazinho commented Mar 4, 2020 •

edited

Loading

Tazinho commented Mar 4, 2020 •

edited

Loading