Handling common data formats #29

adamkucharski · 2024-11-30T08:55:25Z

Thanks for putting this useful tool together!

A couple of thoughts, having tried with a few pre-existing datasets:

Raw lab data are often stored in wide format (for easy viewing of values in Excel etc.). The typically involves either: an ID and time value, then columns for biomarkers (like this flu dataset from Fonville et al, Science, 2014). Or an ID then columns that concatenate year and biomarker. Wonder if there's scope to allow either wide or long data to be input? The latter harder to automate, of course - below example for wrangling DENV/ZIKV data from Henderson et al, eLife, 2020:

library(dplyr)
library(tidyr)

# Load the data from CSV
data_in <- read.csv("https://raw.githubusercontent.com/hendersonad/zika-sero-pacific/refs/heads/master/data/dset3-fiji-neutralizationassay.csv")

# Reshape the data into long format
long_data <- data_in |> 
  select(id, starts_with("D")) |> 
  pivot_longer(
    cols = starts_with("D"),
    names_to = "column",
    values_to = "value"
  ) |> 
  mutate(
    biomarker = sub("s\\d+", "", column),   # Extract the biomarker (e.g., D1, D2)
    year = paste0("20", sub("D\\ds", "", column)), # Extract and format the year
    value = ifelse(value == -Inf, 0, value) # Replace -Inf with -1
  ) %>%
  select(id, year, biomarker, value)

# View the transformed data
print(long_data)

# Write new CSV
write.csv(long_data,"data_seroviz.csv")

I notice that current app doesn't handle NA or -Inf values (which were in the above raw datasets – the Ha Nam one also includes * for missing entries). Maybe it would be useful to allow user to define a value to represent 'missing'? Or, easier, tell them it has to be given as NA? I notice issue Allow omission of values outside detection limits #26 is already looking at undetectable titres.

For the above DENV/ZIKV data, I also got this warning in the app:

Some traces generated warnings
all:
    pseudoinverse used at 2013
    neighborhood radius 4.02
    reciprocal condition number 1.0336e-16
    There are other near singularities as well. 4.0804

I'm guessing this is instability in the smoothing spline? This may be a common issue for sparse data, so could make the warning more informative for less technical users?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling common data formats #29

Handling common data formats #29

adamkucharski commented Nov 30, 2024

Handling common data formats #29

Handling common data formats #29

Comments

adamkucharski commented Nov 30, 2024