data formatting, internal and external #5

topepo · 2022-08-18T19:36:50Z

Just some thoughts about data structures...

This will be much more informed when we have examples of more complex experiments and complex instrument results.

"External format"

This is the shape of the data as the user has it.

There are a few ways that the data could be formatted by the user. I'll use the tidyr terminology of "longer" and "wider".

Wider would be where the wavelength values are common across samples and the intensity data are in columns. The number of rows probably represents the number of samples in the data. The meats data in the model data package is formatted like this:

> meats %>% relocate(water, fat, protein)
# A tibble: 215 × 103
   water   fat protein x_001 x_002 x_003 x_004 x_005 x_006 x_007 x_008 x_009 x_010
   <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  60.5  22.5    16.7  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.62  2.63  2.63
 2  46    40.1    13.5  2.83  2.84  2.84  2.85  2.85  2.86  2.86  2.87  2.87  2.88
 3  71     8.4    20.5  2.58  2.58  2.59  2.59  2.59  2.59  2.59  2.60  2.60  2.60
 4  72.8   5.9    20.7  2.82  2.82  2.83  2.83  2.83  2.83  2.83  2.84  2.84  2.84
 5  58.3  25.5    15.5  2.79  2.79  2.79  2.79  2.80  2.80  2.80  2.80  2.81  2.81
 6  44    42.7    13.7  3.01  3.02  3.02  3.03  3.03  3.04  3.04  3.05  3.06  3.06
 7  44    42.7    13.7  2.99  2.99  3.00  3.01  3.01  3.02  3.02  3.03  3.04  3.04
 8  69.3  10.6    19.3  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.53  2.54  2.54
 9  61.4  19.9    17.7  3.27  3.28  3.29  3.29  3.30  3.31  3.31  3.32  3.33  3.33
10  61.4  19.9    17.7  3.40  3.41  3.41  3.42  3.43  3.43  3.44  3.45  3.46  3.47
# … with 205 more rows, and 90 more variables: x_011 <dbl>, x_012 <dbl>,
#   x_013 <dbl>, x_014 <dbl>, x_015 <dbl>, x_016 <dbl>, x_017 <dbl>, x_018 <dbl>,
#   x_019 <dbl>, x_020 <dbl>, x_021 <dbl>, x_022 <dbl>, x_023 <dbl>, x_024 <dbl>,
#   x_025 <dbl>, x_026 <dbl>, x_027 <dbl>, x_028 <dbl>, x_029 <dbl>, x_030 <dbl>,
#   x_031 <dbl>, x_032 <dbl>, x_033 <dbl>, x_034 <dbl>, x_035 <dbl>, x_036 <dbl>,
#   x_037 <dbl>, x_038 <dbl>, x_039 <dbl>, x_040 <dbl>, x_041 <dbl>, x_042 <dbl>,
#   x_043 <dbl>, x_044 <dbl>, x_045 <dbl>, x_046 <dbl>, x_047 <dbl>, x_048 <dbl>, …

A longer version would be where there is a column for the wavelength (or some frequency-type index) and another column for the outcome (e.g. intensity, absorption, etc).

For the meats data, that would look like

> meat_longer
# A tibble: 21,500 × 6
   water   fat protein sample intensity wavelength
   <dbl> <dbl>   <dbl>  <int>     <dbl>      <dbl>
 1  60.5  22.5    16.7      1      2.62          1
 2  60.5  22.5    16.7      1      2.62          2
 3  60.5  22.5    16.7      1      2.62          3
 4  60.5  22.5    16.7      1      2.62          4
 5  60.5  22.5    16.7      1      2.62          5
 6  60.5  22.5    16.7      1      2.62          6
 7  60.5  22.5    16.7      1      2.62          7
 8  60.5  22.5    16.7      1      2.62          8
 9  60.5  22.5    16.7      1      2.63          9
10  60.5  22.5    16.7      1      2.63         10
# … with 21,490 more rows

We should be able to work with data in either format.

"Internal format"

Internal to the recipe, the longer format is better but we probably want to store the data in a more compact way.

For the combinations of the non-measurement columns, we should put the spectroscopy data in a compact format.

For the meat data (in longer format), that would be

> meat_grouped
# A tibble: 215 × 5
   water   fat protein sample      .measurements
   <dbl> <dbl>   <dbl>  <int> <list<tibble[,2]>>
 1  60.5  22.5    16.7      1          [100 × 2]
 2  46    40.1    13.5      2          [100 × 2]
 3  71     8.4    20.5      3          [100 × 2]
 4  72.8   5.9    20.7      4          [100 × 2]
 5  58.3  25.5    15.5      5          [100 × 2]
 6  44    42.7    13.7      6          [100 × 2]
 7  44    42.7    13.7      7          [100 × 2]
 8  69.3  10.6    19.3      8          [100 × 2]
 9  61.4  19.9    17.7      9          [100 × 2]
10  61.4  19.9    17.7     10          [100 × 2]
# … with 205 more rows

The rows again reflect the total number of samples and .measurements is a list column with the assay results:

> meat_grouped$.measurements[[1]]
# A tibble: 100 × 2
   intensity wavelength
       <dbl>      <dbl>
 1      2.62          1
 2      2.62          2
 3      2.62          3
 4      2.62          4
 5      2.62          5
 6      2.62          6
 7      2.62          7
 8      2.62          8
 9      2.63          9
10      2.63         10
# … with 90 more rows

We could have an initial function that can make this conversion. Something like step_spectra_collect(outcome, index) to make the formatting (I think that we could have step names that start with step_spectra_* or something).

Here's some example code to go between formats for two examples:

library(janitor)
library(tidymodels)

# ------------------------------------------------------------------------------

tidymodels_prefer()
theme_set(theme_bw())

# ------------------------------------------------------------------------------

data(meats)

meat_longer <-
  meats %>%
  mutate(sample = row_number()) %>%
  pivot_longer(c(starts_with("x_")), names_to = "name", values_to = "intensity") %>%
  mutate(wavelength = as.numeric(gsub("x_", "", name))) %>%
  select(-name)

meat_grouped <-
  meat_longer %>%
  group_by(water, fat, protein, sample) %>% 
  group_nest(.key = ".measurements") %>% 
  arrange(sample)

# ------------------------------------------------------------------------------

load(url("https://github.com/topepo/FES/blob/master/Data_Sets/Pharmaceutical_Manufacturing_Monitoring/small_scale.RData?raw=true"))

pharma_longer <-
  small_scale %>%
  clean_names() %>% 
  select(-batch_sample) %>% 
  pivot_longer(c(starts_with("x")), names_to = "name", values_to = "intensity") %>%
  mutate(wavelength = as.numeric(gsub("x", "", name))) %>%
  select(-name)

pharma_grouped <-
  pharma_longer %>%
  group_by(batch_id, sample, batch_sample, glucose) %>% 
  group_nest(.key = ".measurements") %>% 
  arrange(sample)

The text was updated successfully, but these errors were encountered:

JamesHWade · 2023-01-04T21:47:46Z

A first pass at this is addressed by #7. I'm sure we can make it a lot better but it "works." Feedback is more than welcome since I'm still very much in "learning" mode for recipes.

topepo · 2023-09-11T21:19:35Z

A friend and I were working on a data set like this the other day, prompting me to get off my 🍑 a bit on this.

Would it make sense to:

Have two different recipe steps that collate the data: one for wide inputs and another for long inputs?
Use a common step prefix in the package. So maybe step_spectra_input_wide() and step_spectra_input_long() (then later things like step_spectra_{baseline subtract technique} and so on)? Tab-complete has been very helpful for recipe step names.

topepo · 2023-09-11T21:21:58Z

Hmm. Is "spectra" too specific?

JamesHWade · 2023-09-12T00:33:22Z

I like wide vs long for function names. "Spectra" is a bit too specific. It works for a lot of the acronym soup of measurement science (e.g., NMR, MS, IR, UV/VIS) but misses on others (e.g., chromatrogram, thermogram). Are step_measure_input_wide() and step_measure_input_long() too generic?

topepo · 2023-09-15T21:25:44Z

Are step_measure_input_wide() and step_measure_input_long() too generic?

Nope!

I'll work on a PR and then another to re-do the data into long and wide formats.

JamesHWade moved this to 📋 Backlog in A recipes extension for measurement data Jan 3, 2023

JamesHWade added this to A recipes extension for measurement data Jan 3, 2023

JamesHWade moved this from 📋 Backlog to 🏗 In progress in A recipes extension for measurement data Jan 3, 2023

JamesHWade self-assigned this Jan 3, 2023

JamesHWade moved this from 🏗 In progress to 👀 In review in A recipes extension for measurement data Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data formatting, internal and external #5

data formatting, internal and external #5

topepo commented Aug 18, 2022

JamesHWade commented Jan 4, 2023

topepo commented Sep 11, 2023 •

edited

Loading

topepo commented Sep 11, 2023 •

edited

Loading

JamesHWade commented Sep 12, 2023

topepo commented Sep 15, 2023

data formatting, internal and external #5

data formatting, internal and external #5

Comments

topepo commented Aug 18, 2022

"External format"

"Internal format"

JamesHWade commented Jan 4, 2023

topepo commented Sep 11, 2023 • edited Loading

topepo commented Sep 11, 2023 • edited Loading

JamesHWade commented Sep 12, 2023

topepo commented Sep 15, 2023

topepo commented Sep 11, 2023 •

edited

Loading

topepo commented Sep 11, 2023 •

edited

Loading