Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data formatting, internal and external #5

Open
topepo opened this issue Aug 18, 2022 · 5 comments
Open

data formatting, internal and external #5

topepo opened this issue Aug 18, 2022 · 5 comments
Assignees

Comments

@topepo
Copy link
Collaborator

topepo commented Aug 18, 2022

Just some thoughts about data structures...

This will be much more informed when we have examples of more complex experiments and complex instrument results.

"External format"

This is the shape of the data as the user has it.

There are a few ways that the data could be formatted by the user. I'll use the tidyr terminology of "longer" and "wider".

Wider would be where the wavelength values are common across samples and the intensity data are in columns. The number of rows probably represents the number of samples in the data. The meats data in the model data package is formatted like this:

> meats %>% relocate(water, fat, protein)
# A tibble: 215 × 103
   water   fat protein x_001 x_002 x_003 x_004 x_005 x_006 x_007 x_008 x_009 x_010
   <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  60.5  22.5    16.7  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.63  2.63
 2  46    40.1    13.5  2.83  2.84  2.84  2.85  2.85  2.86  2.86  2.87  2.87  2.88
 3  71     8.4    20.5  2.58  2.58  2.59  2.59  2.59  2.59  2.59  2.60  2.60  2.60
 4  72.8   5.9    20.7  2.82  2.82  2.83  2.83  2.83  2.83  2.83  2.84  2.84  2.84
 5  58.3  25.5    15.5  2.79  2.79  2.79  2.79  2.80  2.80  2.80  2.80  2.81  2.81
 6  44    42.7    13.7  3.01  3.02  3.02  3.03  3.03  3.04  3.04  3.05  3.06  3.06
 7  44    42.7    13.7  2.99  2.99  3.00  3.01  3.01  3.02  3.02  3.03  3.04  3.04
 8  69.3  10.6    19.3  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.54  2.54
 9  61.4  19.9    17.7  3.27  3.28  3.29  3.29  3.30  3.31  3.31  3.32  3.33  3.33
10  61.4  19.9    17.7  3.40  3.41  3.41  3.42  3.43  3.43  3.44  3.45  3.46  3.47
# … with 205 more rows, and 90 more variables: x_011 <dbl>, x_012 <dbl>,
#   x_013 <dbl>, x_014 <dbl>, x_015 <dbl>, x_016 <dbl>, x_017 <dbl>, x_018 <dbl>,
#   x_019 <dbl>, x_020 <dbl>, x_021 <dbl>, x_022 <dbl>, x_023 <dbl>, x_024 <dbl>,
#   x_025 <dbl>, x_026 <dbl>, x_027 <dbl>, x_028 <dbl>, x_029 <dbl>, x_030 <dbl>,
#   x_031 <dbl>, x_032 <dbl>, x_033 <dbl>, x_034 <dbl>, x_035 <dbl>, x_036 <dbl>,
#   x_037 <dbl>, x_038 <dbl>, x_039 <dbl>, x_040 <dbl>, x_041 <dbl>, x_042 <dbl>,
#   x_043 <dbl>, x_044 <dbl>, x_045 <dbl>, x_046 <dbl>, x_047 <dbl>, x_048 <dbl>, …

A longer version would be where there is a column for the wavelength (or some frequency-type index) and another column for the outcome (e.g. intensity, absorption, etc).

For the meats data, that would look like

> meat_longer
# A tibble: 21,500 × 6
   water   fat protein sample intensity wavelength
   <dbl> <dbl>   <dbl>  <int>     <dbl>      <dbl>
 1  60.5  22.5    16.7      1      2.62          1
 2  60.5  22.5    16.7      1      2.62          2
 3  60.5  22.5    16.7      1      2.62          3
 4  60.5  22.5    16.7      1      2.62          4
 5  60.5  22.5    16.7      1      2.62          5
 6  60.5  22.5    16.7      1      2.62          6
 7  60.5  22.5    16.7      1      2.62          7
 8  60.5  22.5    16.7      1      2.62          8
 9  60.5  22.5    16.7      1      2.63          9
10  60.5  22.5    16.7      1      2.63         10
# … with 21,490 more rows

We should be able to work with data in either format.

"Internal format"

Internal to the recipe, the longer format is better but we probably want to store the data in a more compact way.

For the combinations of the non-measurement columns, we should put the spectroscopy data in a compact format.

For the meat data (in longer format), that would be

> meat_grouped
# A tibble: 215 × 5
   water   fat protein sample      .measurements
   <dbl> <dbl>   <dbl>  <int> <list<tibble[,2]>>
 1  60.5  22.5    16.7      1          [100 × 2]
 2  46    40.1    13.5      2          [100 × 2]
 3  71     8.4    20.5      3          [100 × 2]
 4  72.8   5.9    20.7      4          [100 × 2]
 5  58.3  25.5    15.5      5          [100 × 2]
 6  44    42.7    13.7      6          [100 × 2]
 7  44    42.7    13.7      7          [100 × 2]
 8  69.3  10.6    19.3      8          [100 × 2]
 9  61.4  19.9    17.7      9          [100 × 2]
10  61.4  19.9    17.7     10          [100 × 2]
# … with 205 more rows

The rows again reflect the total number of samples and .measurements is a list column with the assay results:

> meat_grouped$.measurements[[1]]
# A tibble: 100 × 2
   intensity wavelength
       <dbl>      <dbl>
 1      2.62          1
 2      2.62          2
 3      2.62          3
 4      2.62          4
 5      2.62          5
 6      2.62          6
 7      2.62          7
 8      2.62          8
 9      2.63          9
10      2.63         10
# … with 90 more rows

We could have an initial function that can make this conversion. Something like step_spectra_collect(outcome, index) to make the formatting (I think that we could have step names that start with step_spectra_* or something).

Here's some example code to go between formats for two examples:

library(janitor)
library(tidymodels)

# ------------------------------------------------------------------------------

tidymodels_prefer()
theme_set(theme_bw())

# ------------------------------------------------------------------------------

data(meats)

meat_longer <-
  meats %>%
  mutate(sample = row_number()) %>%
  pivot_longer(c(starts_with("x_")), names_to = "name", values_to = "intensity") %>%
  mutate(wavelength = as.numeric(gsub("x_", "", name))) %>%
  select(-name)

meat_grouped <-
  meat_longer %>%
  group_by(water, fat, protein, sample) %>% 
  group_nest(.key = ".measurements") %>% 
  arrange(sample)

# ------------------------------------------------------------------------------

load(url("https://github.com/topepo/FES/blob/master/Data_Sets/Pharmaceutical_Manufacturing_Monitoring/small_scale.RData?raw=true"))

pharma_longer <-
  small_scale %>%
  clean_names() %>% 
  select(-batch_sample) %>% 
  pivot_longer(c(starts_with("x")), names_to = "name", values_to = "intensity") %>%
  mutate(wavelength = as.numeric(gsub("x", "", name))) %>%
  select(-name)

pharma_grouped <-
  pharma_longer %>%
  group_by(batch_id, sample, batch_sample, glucose) %>% 
  group_nest(.key = ".measurements") %>% 
  arrange(sample)
@JamesHWade
Copy link
Owner

A first pass at this is addressed by #7. I'm sure we can make it a lot better but it "works." Feedback is more than welcome since I'm still very much in "learning" mode for recipes.

@JamesHWade JamesHWade moved this from 🏗 In progress to 👀 In review in A recipes extension for measurement data Jan 4, 2023
@topepo
Copy link
Collaborator Author

topepo commented Sep 11, 2023

A friend and I were working on a data set like this the other day, prompting me to get off my 🍑 a bit on this.

Would it make sense to:

  • Have two different recipe steps that collate the data: one for wide inputs and another for long inputs?
  • Use a common step prefix in the package. So maybe step_spectra_input_wide() and step_spectra_input_long() (then later things like step_spectra_{baseline subtract technique} and so on)? Tab-complete has been very helpful for recipe step names.

@topepo
Copy link
Collaborator Author

topepo commented Sep 11, 2023

Hmm. Is "spectra" too specific?

@JamesHWade
Copy link
Owner

I like wide vs long for function names. "Spectra" is a bit too specific. It works for a lot of the acronym soup of measurement science (e.g., NMR, MS, IR, UV/VIS) but misses on others (e.g., chromatrogram, thermogram). Are step_measure_input_wide() and step_measure_input_long() too generic?

@topepo
Copy link
Collaborator Author

topepo commented Sep 15, 2023

Are step_measure_input_wide() and step_measure_input_long() too generic?

Nope!

I'll work on a PR and then another to re-do the data into long and wide formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 👀 In review
Development

No branches or pull requests

2 participants