Identifying non-data columns #1023

mattansb · 2024-10-06T10:33:08Z

In one of the recent bayestestR PRs (easystats/bayestestR#673 & easystats/bayestestR#672) some of the printing methods have adopted allowing for arbitrary columns in the resulting data frame objects. This was done by setting a new type of attribute called idvars the contained the names of the columns that don't hold the "statistical" information, but information used to identify the rows.

I've found this to better and more stable than keeping track of all possible column names, and very flexible.

I wonder if this should be used across easystats? We can have several "classes" of such attributes - idvars, grouping-vars, etc....

These could be useful if "detected" by the various formatting and printing methods in parameters (which is why I've opened this issue here, but feel free to move it elsewhere) and in insight (maybe also datawizard, correlation and modelbased?).

WDYT? @easystats/core-team

The text was updated successfully, but these errors were encountered:

strengejacke · 2024-10-06T11:33:21Z

Do you have an example that makes clear how this affects printing or how we have to change methods?

mattansb · 2024-10-07T12:07:33Z

Example data frame of CI results:

results <- data.frame(Parameter = c("q", "w"), 
                      CI = c(0.95, 0.95), 
                      CI_low = c(-1.87971309451912, 0.0409341466147453),
                      CI_high = c(2.15779289064407, 0.992114187916741),
                      method = "🦆")

results
#>   Parameter   CI      CI_low   CI_high method
#> 1         q 0.95 -1.87971309 2.1577929      🦆
#> 2         w 0.95  0.04093415 0.9921142      🦆

The old code for printing was something like this - it only kept relevant columns (e.g., the "method" column will be dropped):

OLD_format_ci <- function(x, ...) {
  # Keep only columns we want to show:
  i_keep <- colnames(x) %in% c("Parameter", "CI", "CI_low", "CI_high")
  
  insight::format_table(x[,i_keep, drop = FALSE])
}

OLD_format_ci(results)
#>   Parameter        95% CI
#> 1         q [-1.88, 2.16]
#> 2         w [ 0.04, 0.99]

However, if you need more than 1 column to identiy a row, this breaks because you can't store all of this inforation nicly in a Parameter column:

results <- data.frame(Xval = c(1, 3),
                      Zlev = c("q", "w"), 
                      CI = c(0.95, 0.95), 
                      CI_low = c(-1.87971309451912, 0.0409341466147453),
                      CI_high = c(2.15779289064407, 0.992114187916741),
                      method = "🦆")

results
#>   Xval Zlev   CI      CI_low   CI_high method
#> 1    1    q 0.95 -1.87971309 2.1577929      🦆
#> 2    3    w 0.95  0.04093415 0.9921142      🦆

OLD_format_ci(results)
#>          95% CI
#> 1 [-1.88, 2.16]
#> 2 [ 0.04, 0.99]

Instead you need a more flexible method:

NEW_format_ci <- function(x, ...) {
  # Keep only columns we want to show:
  i_keep <- colnames(x) %in% c(attr(x, "idvars"), "CI", "CI_low", "CI_high")
  
  insight::format_table(x[,i_keep, drop = FALSE])
}

# Set the idvars attribute:
attr(results, "idvars") <- c("Xval", "Zlev")

NEW_format_ci(results)
#>   Xval Zlev        95% CI
#> 1 1.00    q [-1.88, 2.16]
#> 2 3.00    w [ 0.04, 0.99]

We can extend this to also include columns that "group" rows:

NEW_print_ci_html <- function(x, ...) {
  # Keep only columns we want to show:
  i_keep <- colnames(x) %in% c(attr(x, "idvars"), "CI", "CI_low", "CI_high")
  
  x_fmt <- insight::format_table(x[,i_keep, drop = FALSE])
  
  insight::print_html(x_fmt, by = attr(x, "groupvars"))
}

results_grouped <- cbind(A = rep(c("a1", "a2"), each = 2), rbind(results, results))
results_grouped
#>    A Xval Zlev   CI      CI_low   CI_high method
#> 1 a1    1    q 0.95 -1.87971309 2.1577929      🦆
#> 2 a1    3    w 0.95  0.04093415 0.9921142      🦆
#> 3 a2    1    q 0.95 -1.87971309 2.1577929      🦆
#> 4 a2    3    w 0.95  0.04093415 0.9921142      🦆

attr(results_grouped, "idvars") <- c("A", "Xval", "Zlev")
attr(results_grouped, "groupvars") <- c("A")

NEW_print_ci_html(results_grouped)

Xval	Zlev	95% CI
a1
1.00	q	[-1.88, 2.16]
3.00	w	[ 0.04, 0.99]
a2
1.00	q	[-1.88, 2.16]
3.00	w	[ 0.04, 0.99]

^{Created on 2024-10-07 with reprex v2.1.1}

strengejacke · 2024-10-07T12:43:44Z

ok, I see. I think this is something that needs to be handled in the packages' format() methods - in insight, only the "final" data frame is processed, no filtering/column-selection is usually done there.

We should then decide on the attributes' names. If I look at your code changes, you would suggest the attribute idvars for those columns that should also be included in the output, additional to the default-columns, right?

mattansb · 2024-10-27T20:31:55Z

I think this is something that needs to be handled in the packages' format() methods

Yes, the way things are setup now. But perhaps this can be directly adapted into insight::format_table() or insight::export_table() at some point.

strengejacke added the Feature idea 🔥 New feature or request label Oct 9, 2024

strengejacke added this to the Release 1.0.0 milestone Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifying non-data columns #1023

Identifying non-data columns #1023

mattansb commented Oct 6, 2024

strengejacke commented Oct 6, 2024

mattansb commented Oct 7, 2024

strengejacke commented Oct 7, 2024

mattansb commented Oct 27, 2024

Identifying non-data columns #1023

Identifying non-data columns #1023

Comments

mattansb commented Oct 6, 2024

strengejacke commented Oct 6, 2024

mattansb commented Oct 7, 2024

strengejacke commented Oct 7, 2024

mattansb commented Oct 27, 2024