Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling common data formats #29

Open
adamkucharski opened this issue Nov 30, 2024 · 0 comments
Open

Handling common data formats #29

adamkucharski opened this issue Nov 30, 2024 · 0 comments

Comments

@adamkucharski
Copy link

Thanks for putting this useful tool together!

A couple of thoughts, having tried with a few pre-existing datasets:

  1. Raw lab data are often stored in wide format (for easy viewing of values in Excel etc.). The typically involves either: an ID and time value, then columns for biomarkers (like this flu dataset from Fonville et al, Science, 2014). Or an ID then columns that concatenate year and biomarker. Wonder if there's scope to allow either wide or long data to be input? The latter harder to automate, of course - below example for wrangling DENV/ZIKV data from Henderson et al, eLife, 2020:
library(dplyr)
library(tidyr)

# Load the data from CSV
data_in <- read.csv("https://raw.githubusercontent.com/hendersonad/zika-sero-pacific/refs/heads/master/data/dset3-fiji-neutralizationassay.csv")

# Reshape the data into long format
long_data <- data_in |> 
  select(id, starts_with("D")) |> 
  pivot_longer(
    cols = starts_with("D"),
    names_to = "column",
    values_to = "value"
  ) |> 
  mutate(
    biomarker = sub("s\\d+", "", column),   # Extract the biomarker (e.g., D1, D2)
    year = paste0("20", sub("D\\ds", "", column)), # Extract and format the year
    value = ifelse(value == -Inf, 0, value) # Replace -Inf with -1
  ) %>%
  select(id, year, biomarker, value)

# View the transformed data
print(long_data)

# Write new CSV
write.csv(long_data,"data_seroviz.csv")
  1. I notice that current app doesn't handle NA or -Inf values (which were in the above raw datasets – the Ha Nam one also includes * for missing entries). Maybe it would be useful to allow user to define a value to represent 'missing'? Or, easier, tell them it has to be given as NA? I notice issue Allow omission of values outside detection limits #26 is already looking at undetectable titres.

For the above DENV/ZIKV data, I also got this warning in the app:

Some traces generated warnings
all:
    pseudoinverse used at 2013
    neighborhood radius 4.02
    reciprocal condition number 1.0336e-16
    There are other near singularities as well. 4.0804

I'm guessing this is instability in the smoothing spline? This may be a common issue for sparse data, so could make the warning more informative for less technical users?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant