Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying non-data columns #1023

Open
mattansb opened this issue Oct 6, 2024 · 4 comments
Open

Identifying non-data columns #1023

mattansb opened this issue Oct 6, 2024 · 4 comments
Labels
Feature idea 🔥 New feature or request
Milestone

Comments

@mattansb
Copy link
Member

mattansb commented Oct 6, 2024

In one of the recent bayestestR PRs (easystats/bayestestR#673 & easystats/bayestestR#672) some of the printing methods have adopted allowing for arbitrary columns in the resulting data frame objects. This was done by setting a new type of attribute called idvars the contained the names of the columns that don't hold the "statistical" information, but information used to identify the rows.

I've found this to better and more stable than keeping track of all possible column names, and very flexible.

I wonder if this should be used across easystats? We can have several "classes" of such attributes - idvars, grouping-vars, etc....

These could be useful if "detected" by the various formatting and printing methods in parameters (which is why I've opened this issue here, but feel free to move it elsewhere) and in insight (maybe also datawizard, correlation and modelbased?).

WDYT? @easystats/core-team

@strengejacke
Copy link
Member

Do you have an example that makes clear how this affects printing or how we have to change methods?

@mattansb
Copy link
Member Author

mattansb commented Oct 7, 2024

Example data frame of CI results:

results <- data.frame(Parameter = c("q", "w"), 
                      CI = c(0.95, 0.95), 
                      CI_low = c(-1.87971309451912, 0.0409341466147453),
                      CI_high = c(2.15779289064407, 0.992114187916741),
                      method = "🦆")

results
#>   Parameter   CI      CI_low   CI_high method
#> 1         q 0.95 -1.87971309 2.1577929      🦆
#> 2         w 0.95  0.04093415 0.9921142      🦆

The old code for printing was something like this - it only kept relevant columns (e.g., the "method" column will be dropped):

OLD_format_ci <- function(x, ...) {
  # Keep only columns we want to show:
  i_keep <- colnames(x) %in% c("Parameter", "CI", "CI_low", "CI_high")
  
  insight::format_table(x[,i_keep, drop = FALSE])
}

OLD_format_ci(results)
#>   Parameter        95% CI
#> 1         q [-1.88, 2.16]
#> 2         w [ 0.04, 0.99]

However, if you need more than 1 column to identiy a row, this breaks because you can't store all of this inforation nicly in a Parameter column:

results <- data.frame(Xval = c(1, 3),
                      Zlev = c("q", "w"), 
                      CI = c(0.95, 0.95), 
                      CI_low = c(-1.87971309451912, 0.0409341466147453),
                      CI_high = c(2.15779289064407, 0.992114187916741),
                      method = "🦆")

results
#>   Xval Zlev   CI      CI_low   CI_high method
#> 1    1    q 0.95 -1.87971309 2.1577929      🦆
#> 2    3    w 0.95  0.04093415 0.9921142      🦆

OLD_format_ci(results)
#>          95% CI
#> 1 [-1.88, 2.16]
#> 2 [ 0.04, 0.99]

Instead you need a more flexible method:

NEW_format_ci <- function(x, ...) {
  # Keep only columns we want to show:
  i_keep <- colnames(x) %in% c(attr(x, "idvars"), "CI", "CI_low", "CI_high")
  
  insight::format_table(x[,i_keep, drop = FALSE])
}

# Set the idvars attribute:
attr(results, "idvars") <- c("Xval", "Zlev")

NEW_format_ci(results)
#>   Xval Zlev        95% CI
#> 1 1.00    q [-1.88, 2.16]
#> 2 3.00    w [ 0.04, 0.99]

We can extend this to also include columns that "group" rows:

NEW_print_ci_html <- function(x, ...) {
  # Keep only columns we want to show:
  i_keep <- colnames(x) %in% c(attr(x, "idvars"), "CI", "CI_low", "CI_high")
  
  x_fmt <- insight::format_table(x[,i_keep, drop = FALSE])
  
  insight::print_html(x_fmt, by = attr(x, "groupvars"))
}

results_grouped <- cbind(A = rep(c("a1", "a2"), each = 2), rbind(results, results))
results_grouped
#>    A Xval Zlev   CI      CI_low   CI_high method
#> 1 a1    1    q 0.95 -1.87971309 2.1577929      🦆
#> 2 a1    3    w 0.95  0.04093415 0.9921142      🦆
#> 3 a2    1    q 0.95 -1.87971309 2.1577929      🦆
#> 4 a2    3    w 0.95  0.04093415 0.9921142      🦆

attr(results_grouped, "idvars") <- c("A", "Xval", "Zlev")
attr(results_grouped, "groupvars") <- c("A")

NEW_print_ci_html(results_grouped)
Xval Zlev 95% CI
a1
1.00 q [-1.88, 2.16]
3.00 w [ 0.04, 0.99]
a2
1.00 q [-1.88, 2.16]
3.00 w [ 0.04, 0.99]

Created on 2024-10-07 with reprex v2.1.1

@strengejacke
Copy link
Member

ok, I see. I think this is something that needs to be handled in the packages' format() methods - in insight, only the "final" data frame is processed, no filtering/column-selection is usually done there.

We should then decide on the attributes' names. If I look at your code changes, you would suggest the attribute idvars for those columns that should also be included in the output, additional to the default-columns, right?

@strengejacke strengejacke added the Feature idea 🔥 New feature or request label Oct 9, 2024
@strengejacke strengejacke added this to the Release 1.0.0 milestone Oct 9, 2024
@mattansb
Copy link
Member Author

I think this is something that needs to be handled in the packages' format() methods

Yes, the way things are setup now. But perhaps this can be directly adapted into insight::format_table() or insight::export_table() at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature idea 🔥 New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants