Skip to content

Commit

Permalink
fix sooo many typos in factor_foo
Browse files Browse the repository at this point in the history
  • Loading branch information
btupper committed Oct 1, 2024
1 parent 240dc46 commit 81c7b18
Show file tree
Hide file tree
Showing 20 changed files with 295 additions and 1,336 deletions.
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
{
"hash": "acdfe7652612c804bb60e750bf274952",
"hash": "42c81fb9545f2f807153f895a5ec1c87",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Factor foo\"\nformat: html\ndescription: \"Introduction to the cefi package\"\nauthor: \"Ben Tupper\"\ndate: \"2024-09-27\"\ncategories:\n - R-code\n - analysis\n---\n\n![https://clipart-library.com/clipart/6cr5qM4oi.htm](6cr5qM4oi.jpg)\nHave you ever been frustrated by `factors` in R. `factors` are vectors where elements have been grouped into categories which are called \"levels\". Recently we had a discussion about what makes `factors` sometimes seem opaque. One thing we agreed is that the nomenclature: \"factors\" and \"levels\" aren't as intuitive as other names might be such as \"categoricals\" and \"groups\" (or \"categories\"). Fortunately, as rose by any other name smells as sweet.\n\nMany operations in data science manipulations depend upon factored (categorical! grouped!) data. In R this is very obvious when splitting data sets, plotting when color by group and when performing by-group statistics. \n\nThe [forcatsR package](https://forcats.tidyverse.org/) from the [tidyverse](https://tidyverse.org/) does a masterful job of helping users navigate with factors. But there's no harm in looking to the base R utilities to gain a better handle of factors.\n\n## Factoring characater vectors\n\nHere we have a vector of strings (characters!) This the most obvious case - it just makes sense right out of the box. We can ask R to group these which it does readily in alphabetical order.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx = c(\"dog\", \"dog\", \"cat\", \"cat\", \"cat\", \"dog\", \"bird\", \"dog\", \"bird\")\nfx = factor(x)\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] dog dog cat cat cat dog bird dog bird\nLevels: bird cat dog\n```\n\n\n:::\n:::\n\n\nYou can get a vector of the levels.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlevels(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"bird\" \"cat\" \"dog\" \n```\n\n\n:::\n:::\n\n\nYou can count the number of levels in the factor.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnlevels(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3\n```\n\n\n:::\n:::\n\n\n### Get the level per element \n\nNow this gets a little trickier. Suppose you wanted to know what level (group? category?) each element belongs to. R can tell you which **index** into the levels vector. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3 3 2 2 2 3 1 3 1\n```\n\n\n:::\n:::\n\n\nWhoa! Say what?\n\nWell, R is telling us that the first two elements in fx belong to the level 3 group - which is \"dog\". The next three elements belong to the \"cat\" level which is the second level. Did you catch that?\n\n### Specify you own order\n\nWhat if you want the order to be dogs, cats and then birds? Just specify those as the `levels` argument.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfx = factor(x, levels = c(\"dog\", \"cat\", \"bird\"))\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] dog dog cat cat cat dog bird dog bird\nLevels: dog cat bird\n```\n\n\n:::\n:::\n\n\n## Factoring characater integer vectors\n\nEqually intuitive is the idea behind factoring integer vectors. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nx = c(3L, 0L, 0L, 3L, 9L, 9L, 0L)\nfx = factor(x)\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3 0 0 3 9 9 0\nLevels: 0 3 9\n```\n\n\n:::\n:::\n\nHere you can see that the levels (groups) are 0, 3 and 9. But if we ask for the levels you'll see that internally R is helding them as charcaters (strings)!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlevels(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"0\" \"3\" \"9\"\n```\n\n\n:::\n:::\n\nThat's just the way R handles it - it maintains the groupings (levels) as characters which are the most intuitive categorical data types.\n\nSo what happens when you ask for the fatcors `as.numeric()`?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2 1 1 2 3 3 1\n```\n\n\n:::\n:::\n\n\nOh, it's the indices again, just like with the anuimal example above.\n\n## Factoring characater real-number vectors\n\nSo, you should be pausing here and thinking about how R will make character grouping levels if we feed is real-numbers (not whole integers). We'll provide 6 real numbers and then see what it does...\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx = c(3.14, 2.19, 3.2, 2.0001, 0.0001, 0)\nfx = factor(x)\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.14 2.19 3.2 2.0001 1e-04 0 \nLevels: 0 1e-04 2.0001 2.19 3.14 3.2\n```\n\n\n:::\n:::\n\n\nOh, it makes one grouping level for each input value. Well, that sort of makes sense, but also brings one the realization that factoring real numbers doesn't have much value? \n\nWhat you can do to group real number is use `cut()`.\n\n### Use `cut()` on real numbers\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfx = cut(x, c(0,1,2,3,4), include.lowest = TRUE)\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] (3,4] (2,3] (3,4] (2,3] [0,1] [0,1]\nLevels: [0,1] (1,2] (2,3] (3,4]\n```\n\n\n:::\n:::\n\n\n\nWell, this makes a bit of sense since we are cutting into groups 0-1, 1-2, 2-3, and 3-4.\n\nThe square bracket mean \"inclusive\" `[` while the `(` means \"exclusive\" boundaries.\n\nSo, let's see the what we can know about the levels.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnlevels(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 4\n```\n\n\n:::\n\n```{.r .cell-code}\nlevels(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"[0,1]\" \"(1,2]\" \"(2,3]\" \"(3,4]\"\n```\n\n\n:::\n:::\n\n\nOnce again, the levels (groupings) are returned to us as strings. We could specify our own.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfx = cut(x, c(0,1,2,3,4), include.lowest = TRUE, labels = c(\"almost none\", \"low\", \"medium\", \"high\"))\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] high medium high medium almost none almost none\nLevels: almost none low medium high\n```\n\n\n:::\n:::\n\n\nThis is different than what we have seen before - in thas case the actual values have been changed to the grouping label we provided. This provides a mechanism for you to transform real numeric data to labels quickly. \n\nAnd can we get back to the numeric index mapping?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 4 3 4 3 1 1\n```\n\n\n:::\n:::\n\nYup!\n\n### Summary\n\n`factor()` provides a means for grouping elements in a vector - they work most intuitively with character and integer vectors. Use `cut()` to do similar groupings using real numbers.",
"markdown": "---\ntitle: \"Factor foo\"\nformat: html\ndescription: \"Introduction to factors\"\nauthor: \"Ben Tupper\"\ndate: \"2024-09-30\"\ncategories:\n - R-code\n - analysis\n---\n\n\n![From https://clipart-library.com/clipart/6cr5qM4oi.htm](6cr5qM4oi.jpg){.lightbox}\n\n## Fooey!\n\nHave you ever been frustrated by `factors` in R? `factors` are vectors where elements have been grouped into categories which are called \"levels\". Recently we had a discussion about what makes `factors` sometimes seem opaque. One thing we agreed upon is that the nomenclature (\"factors\" and \"levels\") aren't as intuitive as other names might be such as \"categoricals\" and \"groups\" (or \"categories\"). Fortunately, a rose by any other name smells as sweet.\n\nMany operations in data science manipulations depend upon factored (categorical! grouped!) data. In R this is very obvious when splitting data sets, plotting when coloring by group and when performing by-group statistics. \n\nThe [forcats R package](https://forcats.tidyverse.org/) from the [tidyverse](https://tidyverse.org/) does a masterful job of helping users navigate code factors. But there's no harm in looking to the base R utilities to gain a better handle of factors.\n\n## Factoring character vectors\n\nHere we have a vector of strings (characters!) This the most obvious case - it just makes sense right out of the box. We can ask R to group these (factor them!) which it does readily in alphabetical order.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx = c(\"dog\", \"dog\", \"cat\", \"cat\", \"cat\", \"dog\", \"bird\", \"dog\", \"bird\")\nfx = factor(x)\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] dog dog cat cat cat dog bird dog bird\nLevels: bird cat dog\n```\n:::\n:::\n\n\nYou can get a vector of the levels.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlevels(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"bird\" \"cat\" \"dog\" \n```\n:::\n:::\n\n\nYou can count the number of levels in the factor.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnlevels(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n:::\n\n\n### Get the level per element \n\nNow this gets a little trickier. Suppose you wanted to know what level (group? category?) each element belongs to. R can tell you the **indecies** into the levels vector. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3 3 2 2 2 3 1 3 1\n```\n:::\n:::\n\n\nWhoa! Say what?\n\nWell, R is telling us that the first two elements in fx belong to the level 3 group - which is \"dog\". The next three elements belong to the \"cat\" level which is the 2nd level. Did you catch that?\n\n### Specify you own order\n\nWhat if you want the order to be dogs, cats and then birds? Just specify those as the `levels` argument.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfx = factor(x, levels = c(\"dog\", \"cat\", \"bird\"))\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] dog dog cat cat cat dog bird dog bird\nLevels: dog cat bird\n```\n:::\n:::\n\n\n## Factoring integer vectors\n\nEqually intuitive is the idea behind factoring integer vectors. Note that we indicate to R that we are specifying integers with the trailing \"L\" after each number. The \"L\" comes from \"long integer\" which has it's own [history](https://www.techopedia.com/definition/24004/long-integer.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx = c(3L, 0L, 0L, 3L, 9L, 9L, 0L)\nfx = factor(x)\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3 0 0 3 9 9 0\nLevels: 0 3 9\n```\n:::\n:::\n\nHere you can see that the levels (groups) are 0, 3 and 9. But if we ask for the levels you'll see that internally R is helding them as characters (strings)!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlevels(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"0\" \"3\" \"9\"\n```\n:::\n:::\n\nThat's just the way R handles it - it maintains the groupings (levels) as characters which are the most intuitive categorical data types.\n\nSo what happens when you ask for the fatcors `as.numeric()`?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2 1 1 2 3 3 1\n```\n:::\n:::\n\n\nOh, it's the indices again, just like with the animal example above.\n\n## Factoring real-number vectors\n\nSo, you should be pausing here and thinking about how R will make character grouping levels if we feed is real-numbers (not whole integers). We'll provide 6 real numbers and then see what it does...\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx = c(3.14, 2.19, 3.2, 2.0001, 0.0001, 0)\nfx = factor(x)\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.14 2.19 3.2 2.0001 1e-04 0 \nLevels: 0 1e-04 2.0001 2.19 3.14 3.2\n```\n:::\n:::\n\n\nOh, it makes one grouping level for each input value. Well, that sort of makes sense, but also brings one the realization that factoring real numbers doesn't have much value. What's the point of grouping if R makes a group for every element in the vector?\n\nWhat you can do to group real numbers is use `cut()`.\n\n### Use `cut()` on real numbers\n\nCut divides a set of real numbers into groups based upon boundaries (aka \"breaks\"). We'll take the same collection of real numbers and cut them into groups: 0-1, 1-2, 2-3, 3-4 where the left hand boundary is inclusive.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfx = cut(x, c(0,1,2,3,4), include.lowest = TRUE)\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] (3,4] (2,3] (3,4] (2,3] [0,1] [0,1]\nLevels: [0,1] (1,2] (2,3] (3,4]\n```\n:::\n:::\n\n\n\nWell, 4 groups just like we spcified! This makes a bit of sense since we are cutting into groups 0-1, 1-2, 2-3, and 3-4.\n\nThe square bracket mean \"inclusive\" `[` while the `(` means \"exclusive\" boundaries.\n\nSo, let's see the what we can know about the levels.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnlevels(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4\n```\n:::\n\n```{.r .cell-code}\nlevels(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"[0,1]\" \"(1,2]\" \"(2,3]\" \"(3,4]\"\n```\n:::\n:::\n\n\nOnce again, the levels (groupings) are returned to us as characters We could specify our own special group names using the `labels` argument.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfx = cut(x, c(0,1,2,3,4), include.lowest = TRUE, labels = c(\"almost none\", \"low\", \"medium\", \"high\"))\nfx\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] high medium high medium almost none almost none\nLevels: almost none low medium high\n```\n:::\n:::\n\n\nThis is different than what we have seen before - in this case the actual values have been changed to the grouping label we provided. This provides a mechanism for you to transform real numeric data to labels quickly. \n\nAnd can we get back to the numeric index mapping?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(fx)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4 3 4 3 1 1\n```\n:::\n:::\n\nYup!\n\n### Summary\n\n`factor()` provides a means for grouping elements in a vector - they work most intuitively with character and integer vectors. Use `cut()` to do similar groupings using real numbers.",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
Expand Down
Loading

0 comments on commit 81c7b18

Please sign in to comment.