Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to just make everything plain text #273

Closed
sfirke opened this issue Feb 20, 2019 · 15 comments · Fixed by #366
Closed

Function to just make everything plain text #273

sfirke opened this issue Feb 20, 2019 · 15 comments · Fixed by #366
Assignees
Labels
question seeking comments Users and any interested parties should please weigh in - this is in a discussion phase!

Comments

@sfirke
Copy link
Owner

sfirke commented Feb 20, 2019

Like clean_names but for dealing with encoding of character variables. I postulate that most users don't care about encoding, they don't want to know about it, they just hate that their results in RMarkdown look like Students who earn<U+202F>As<U+202F>in my classes.

Is there a function(s) we can wrap that will put everything in plain text and make stuff like curly quotes (#91), narrow spaces, etc. just go away?

@sfirke sfirke added question seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! labels Feb 20, 2019
@sfirke
Copy link
Owner Author

sfirke commented Feb 20, 2019

Would it be as simple as mutating all character columns with stringi::stri_trans_general("latin-ascii")? https://stackoverflow.com/a/9935242/4470365

@billdenney
Copy link
Collaborator

I use clean_names() in the way that I hope it will make all non-ASCII into something easily usable in ASCII. I know that not all users have the same use case (as was recently shown in #268 and in general for languages with non-ASCII character sets).

Maybe it could be ascii_names()?

Related to this, perhaps some way to save the names for use in proper_names() (#271) down the line would help? (That way to save the names is probably for the user to-- save the names in a variable for use down the line.)

@rgknight
Copy link
Collaborator

Is this for the contents of character columns or for names?

Would it do anything to html in character columns?

@sfirke
Copy link
Owner Author

sfirke commented Feb 21, 2019 via email

@sfirke
Copy link
Owner Author

sfirke commented Mar 15, 2019

I'm still excited about this!

This function would take a character string [or vector] and return it with non-ASCII characters converted. For instance:

> x <- c("it\u2009s", "it\u201Bs")
> x
[1] "it s" "it‛s"
> y <- stringi::stri_trans_general(x, "latin-ascii")
> y
[1] "it s" "it's"
> Encoding(x)
[1] "UTF-8" "UTF-8"
> Encoding(y)
[1] "unknown" "unknown" # = ASCII?

Is it as simple as stringi::stri_trans_general(x, "latin-ascii")? It could be more complex if it covered use cases or edge cases I'm not thinking of, but even just wrapping this one-liner as a function like make_plain_text or make_ascii_text would be highly useful I think. I've been there as a beginner, wondering why my strings aren't matching, then I find a curly quote but have no idea what to do next.

@sfirke sfirke added this to the v1.2 milestone Apr 17, 2019
@sfirke
Copy link
Owner Author

sfirke commented Apr 17, 2019

Wondering if I can squeeze this in for v1.2 this week. It's such a simple function I should be able to. The only risk I see is that it's insufficient / not useful and becomes cruft that clutters the namespace. But I think it could be gold in some situations.

@sfirke sfirke self-assigned this Apr 17, 2019
@sfirke sfirke removed this from the v1.2 milestone Apr 20, 2019
@sfirke
Copy link
Owner Author

sfirke commented Apr 20, 2019

Still think this is worthy, can't pull it off today/tomorrow.

@sfirke
Copy link
Owner Author

sfirke commented Apr 23, 2019

Hrm this doesn't remove the unicode replacement character \ufffd or \u000a0 for instance:

> x <- "hi \ufffd \u000a0 bye"
> x
[1] "hi � \n0 bye"
> stri_trans_general(x, "latin-ascii")
[1] "hi � \n0 bye"

@hlynurhallgrims
Copy link

This is maybe a bit tagential, but within a discussion of something "like clean_names" it's hopefully ok to voice this.

I personally use a (very) slightly modified version of the clean_names() function to clean character vectors that I use when programmatically creating filenames based in various parameters. I use it a lot, so I was wondering if there's maybe value in having a "sibling" function to clean_names within the janitor package, that does (almost) exactly the same thing, but for character vectors instead of names?

clean_string <- function (string, case = c("snake", "lower_camel", "upper_camel", 
                                           "screaming_snake", "lower_upper", "upper_lower", "all_caps", 
                                           "small_camel", "big_camel", "parsed", "mixed")) 
{
  case <- match.arg(case)
  
  old_string <- string
  new_string <- old_string %>% 
    gsub("'", "", .) %>% 
    gsub("\"", "", .) %>%
    gsub("%", ".percent_", .) %>% 
    gsub("#", ".number_", .) %>% 
    gsub("^[[:space:][:punct:]]+", "", .) %>% 
    snakecase::to_any_case(case = case, sep_in = "\\.", transliterations = c("Latin-ASCII"), 
                           parsing_option = 4)
  
  new_string
}

@billdenney
Copy link
Collaborator

@hlynurhallgrims, I think that you’re looking for make_clean_names().

@hlynurhallgrims
Copy link

@billdenney, thanks for the reply. The difference between the suggestion and make_clean_names is that the latter avoids both duplicates and values that start with a numeral. That's what I'm proposing be left out of a sibling function for everyday string cleaning.

@statzhero
Copy link

For what it's worth, I had some horrible experiences with the encodings of various hyphens and dashes.

@billdenney
Copy link
Collaborator

FYI, I've been working on an unrelated problem today, and I learned that what I think the best way to convert to ASCII is the following:
stringi::stri_trans_general(x, id="Any-Latin;Greek-Latin;Latin-ASCII")

I'm adding that to the documentation of make_clean_names(), and will close the issue with it. I think that covers the need without being overly complex.

@sfirke
Copy link
Owner Author

sfirke commented Apr 11, 2020 via email

@sfirke
Copy link
Owner Author

sfirke commented Feb 12, 2021

I just zapped some pesky curly quotes using your recommendation above, Bill. It may not be in janitor, officially, but that to-ASCII transliteration command rules 😎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question seeking comments Users and any interested parties should please weigh in - this is in a discussion phase!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants