Function to just make everything plain text #273

sfirke · 2019-02-20T15:05:39Z

Like clean_names but for dealing with encoding of character variables. I postulate that most users don't care about encoding, they don't want to know about it, they just hate that their results in RMarkdown look like Students who earn<U+202F>As<U+202F>in my classes.

Is there a function(s) we can wrap that will put everything in plain text and make stuff like curly quotes (#91), narrow spaces, etc. just go away?

The text was updated successfully, but these errors were encountered:

sfirke · 2019-02-20T15:10:12Z

Would it be as simple as mutating all character columns with stringi::stri_trans_general("latin-ascii")? https://stackoverflow.com/a/9935242/4470365

billdenney · 2019-02-20T15:58:51Z

I use clean_names() in the way that I hope it will make all non-ASCII into something easily usable in ASCII. I know that not all users have the same use case (as was recently shown in #268 and in general for languages with non-ASCII character sets).

Maybe it could be ascii_names()?

Related to this, perhaps some way to save the names for use in proper_names() (#271) down the line would help? (That way to save the names is probably for the user to-- save the names in a variable for use down the line.)

rgknight · 2019-02-20T17:48:59Z

Is this for the contents of character columns or for names?

Would it do anything to html in character columns?

sfirke · 2019-02-21T03:21:58Z

This is for the content of character columns. The text values themselves. As to html, I dunno. Keep it? Just nuke anything that's not plain text (I don't even know the right terms for encoding).

…

On Wed, Feb 20, 2019, 12:49 PM Ryan Knight ***@***.*** wrote: Is this for the contents of character columns or for names? Would it do anything to html in character columns? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#273 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHOBkFO6M-eCYJnW1PBp71VMH5HIh3OQks5vPYqMgaJpZM4bFbRL> .

sfirke · 2019-03-15T03:19:17Z

I'm still excited about this!

This function would take a character string [or vector] and return it with non-ASCII characters converted. For instance:

> x <- c("it\u2009s", "it\u201Bs")
> x
[1] "it s" "it‛s"
> y <- stringi::stri_trans_general(x, "latin-ascii")
> y
[1] "it s" "it's"
> Encoding(x)
[1] "UTF-8" "UTF-8"
> Encoding(y)
[1] "unknown" "unknown" # = ASCII?

Is it as simple as stringi::stri_trans_general(x, "latin-ascii")? It could be more complex if it covered use cases or edge cases I'm not thinking of, but even just wrapping this one-liner as a function like make_plain_text or make_ascii_text would be highly useful I think. I've been there as a beginner, wondering why my strings aren't matching, then I find a curly quote but have no idea what to do next.

sfirke · 2019-04-17T03:26:43Z

Wondering if I can squeeze this in for v1.2 this week. It's such a simple function I should be able to. The only risk I see is that it's insufficient / not useful and becomes cruft that clutters the namespace. But I think it could be gold in some situations.

sfirke · 2019-04-20T17:13:21Z

Still think this is worthy, can't pull it off today/tomorrow.

sfirke · 2019-04-23T16:05:44Z

Hrm this doesn't remove the unicode replacement character \ufffd or \u000a0 for instance:

> x <- "hi \ufffd \u000a0 bye"
> x
[1] "hi � \n0 bye"
> stri_trans_general(x, "latin-ascii")
[1] "hi � \n0 bye"

hlynurhallgrims · 2019-05-15T15:27:09Z

This is maybe a bit tagential, but within a discussion of something "like clean_names" it's hopefully ok to voice this.

I personally use a (very) slightly modified version of the clean_names() function to clean character vectors that I use when programmatically creating filenames based in various parameters. I use it a lot, so I was wondering if there's maybe value in having a "sibling" function to clean_names within the janitor package, that does (almost) exactly the same thing, but for character vectors instead of names?

clean_string <- function (string, case = c("snake", "lower_camel", "upper_camel", 
                                           "screaming_snake", "lower_upper", "upper_lower", "all_caps", 
                                           "small_camel", "big_camel", "parsed", "mixed")) 
{
  case <- match.arg(case)
  
  old_string <- string
  new_string <- old_string %>% 
    gsub("'", "", .) %>% 
    gsub("\"", "", .) %>%
    gsub("%", ".percent_", .) %>% 
    gsub("#", ".number_", .) %>% 
    gsub("^[[:space:][:punct:]]+", "", .) %>% 
    snakecase::to_any_case(case = case, sep_in = "\\.", transliterations = c("Latin-ASCII"), 
                           parsing_option = 4)
  
  new_string
}

billdenney · 2019-05-15T15:48:48Z

@hlynurhallgrims, I think that you’re looking for make_clean_names().

hlynurhallgrims · 2019-05-15T16:14:30Z

@billdenney, thanks for the reply. The difference between the suggestion and make_clean_names is that the latter avoids both duplicates and values that start with a numeral. That's what I'm proposing be left out of a sibling function for everyday string cleaning.

statzhero · 2020-02-27T18:48:20Z

For what it's worth, I had some horrible experiences with the encodings of various hyphens and dashes.

billdenney · 2020-04-11T17:53:24Z

FYI, I've been working on an unrelated problem today, and I learned that what I think the best way to convert to ASCII is the following:
stringi::stri_trans_general(x, id="Any-Latin;Greek-Latin;Latin-ASCII")

I'm adding that to the documentation of make_clean_names(), and will close the issue with it. I think that covers the need without being overly complex.

sfirke · 2020-04-11T21:07:54Z

Bonus! I look forward to using this.

…

On Sat, Apr 11, 2020 at 2:19 PM Bill Denney ***@***.***> wrote: Closed #273 <#273> via #366 <#366>. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#273 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZYDEED26YRVB67FMXTIJLRMCYBLANCNFSM4GYVWRFQ> .

sfirke · 2021-02-12T20:57:01Z

I just zapped some pesky curly quotes using your recommendation above, Bill. It may not be in janitor, officially, but that to-ASCII transliteration command rules 😎

sfirke added question seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! labels Feb 20, 2019

sfirke added this to the v1.2 milestone Apr 17, 2019

sfirke self-assigned this Apr 17, 2019

sfirke removed this from the v1.2 milestone Apr 20, 2019

billdenney added a commit to billdenney/janitor that referenced this issue Apr 11, 2020

Document a way to convert to ASCII (Fix sfirke#273)

6603795

billdenney mentioned this issue Apr 11, 2020

Fix issue with stringi transliterator availability #366

Merged

billdenney closed this as completed in #366 Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function to just make everything plain text #273

Function to just make everything plain text #273

sfirke commented Feb 20, 2019

sfirke commented Feb 20, 2019

billdenney commented Feb 20, 2019

rgknight commented Feb 20, 2019

sfirke commented Feb 21, 2019 via email

sfirke commented Mar 15, 2019

sfirke commented Apr 17, 2019

sfirke commented Apr 20, 2019

sfirke commented Apr 23, 2019

hlynurhallgrims commented May 15, 2019

billdenney commented May 15, 2019

hlynurhallgrims commented May 15, 2019

statzhero commented Feb 27, 2020

billdenney commented Apr 11, 2020

sfirke commented Apr 11, 2020 via email

sfirke commented Feb 12, 2021

Function to just make everything plain text #273

Function to just make everything plain text #273

Comments

sfirke commented Feb 20, 2019

sfirke commented Feb 20, 2019

billdenney commented Feb 20, 2019

rgknight commented Feb 20, 2019

sfirke commented Feb 21, 2019 via email

sfirke commented Mar 15, 2019

sfirke commented Apr 17, 2019

sfirke commented Apr 20, 2019

sfirke commented Apr 23, 2019

hlynurhallgrims commented May 15, 2019

billdenney commented May 15, 2019

hlynurhallgrims commented May 15, 2019

statzhero commented Feb 27, 2020

billdenney commented Apr 11, 2020

sfirke commented Apr 11, 2020 via email

sfirke commented Feb 12, 2021