Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working list of things to accomplish #1

Open
mech3132 opened this issue Jan 30, 2020 · 9 comments
Open

Working list of things to accomplish #1

mech3132 opened this issue Jan 30, 2020 · 9 comments

Comments

@mech3132
Copy link
Member

Hi tick peeps,
Here is our new repo!
You'll notice that I explicitly made a gitignore file here-- when you clone this repo you'll have that gitignore file. I'm not sure if we want to keep that or not? Let me know.

Below, we can make comments on things we want to work on, and any warnings to others you might want to include.

@mech3132
Copy link
Member Author

Figure out that if you put any of your documents in the .gitignore file (both on the remote repo and locally on your computer), the .gitignore file WILL push, but your file will not. I've set it so all *.RProj are not pushed or pulled-- sound good for all of you?

@mech3132
Copy link
Member Author

mech3132 commented Feb 6, 2020

Melissa- data orange, determine useful metadata

  • awful format for ticks stage/sex [ merged abundance and prevalence ]— how in the world do individual observations relate to each other
  • plots of variables

Wynne- make maps

  • summarize sites that have borellia— all via maps????????
  • Plot by site— how do different plots differ across time?
  • Are plots highly variable within site?
  • What are zero tick plots? Are these interesting zeros, or silly zeros?

Brendan- rodents

  • download rodents; see how it looks through time across space

Matt - vegetation

@mech3132
Copy link
Member Author

Request:
After thinking about it a bit more, could we actually alter the naming scheme a little? So that numbers always indicate the order that they should be run?
IE If you have 01 in the name, you do NOT need any other scripts to be run first. If you have 02 in the name, you will need at least one (or maybe all) 01 script(s) already run, but do NOT need any 03's run yet-- etc etc.

I know there was previously some logic to the "TickPathogen" line and "TickAbundance" line, but I still find myself getting confused by "TickMerged_01", which actually requires both TickPathogen_01 (Downloading) and TickPathogen_02 (Data clean up) to be run.

Although obviously if there are strong opinions against this, I am willing to be flexible ;)

@WynneMoss
Copy link
Member

Yes totally! Feel free to re-name any of those files. Agree, it got a little weird once there were multiple scripts for downloading.

@mech3132
Copy link
Member Author

Personal goals for this week:

  • Finish plotting tick pathogen vs plot vs time.
  • Standardize and sanitize data to include only pathogen abundances that are meaninful/useful
  • Determine what distributions are appropriate for estimating tick density and prevalence-- should we be using normal distributions? Some kind of aggregated model? Look at data and fit distributions to see what it looks like.

@WynneMoss
Copy link
Member

Hi all! Here's a reminder of what we talked about today (and thanks @scelmendorf for many of these ideas!):

To-dos:

  • Create tick abundance dataset(s) for use in simple models (@mech3132 working on this) to put in data_derived folder. This will include a training dataset and a testing dataset for model validation. It may include one table for ALL ticks (grouped) as well as a table where the two focal tick species are listed separately. Downstream users of this dataset will have to account for effort (drag area) either using poisson offsets or dividing to get density. @mech3132 feel free to chime in with any other peculiarities we should be aware of!

  • Variance partitioning model to partition variation in (1) total tick abundance (2) IXODES tick abundance (3) AMBAME tick abundance. Variables included would be site, plot, year, month, nlcd, and nlcd nested within site. Any others I forgot? @matthewbitters is working on this. We hadn't decided what life stages to use but probably need to keep them separate to make them coherent with other models. e.g.
    perhaps need to run a separate model for each species x lifestage combination. Might also be useful to have additional model where all species and lifestages are grouped? But open to changing this.

  • Time series model to explain the same (total tick abundances of each species-lifestage) @WynneMoss to work on this. Full disclosure I have no idea what I'm doing. I think seeing whether abundances of each of these lifestages can predict abundance in the next time step? I have some homework to do on time series and tick life history...

  • Random forest model or some other machine learning tool to predict abundances. Maybe pull from SDM techniques. Same predictor variables as variance partitioning model? Again, out of my league here but @bkhobart is on it.

The next goal is to explore the accuracy of predictions made by these three models on the test dataset.

@scelmendorf
Copy link

@WynneMoss I am definitely not an expert in time-series but some places you could start are (1) Just a random walk model - kind of like a null model. Chaper 8 in Dieze's forecasting book has some examples and there is code here:https://github.com/EcoForecast/EF_Activities/blob/master/Exercise_06_StateSpace.Rmd (since I know you love JAGS). Or, one simple way to add in a seasonal component - which seemed somewhat prominent in the summary plots we looked at today is Gavin Simpson has a whole mess of stuff about using gams to model seasonal time-series (+code) on fromthebottomoftheheap blog (and there is a paper as well on how to go hierarchical: https://peerj.com/articles/6876/)

@mech3132
Copy link
Member Author

mech3132 commented Feb 23, 2020

Hi all,

[updated 23feb2020]
I posted the script to get time, domain, site, and 'random' validition/test sets. (TickMerged_02_DataFiltering.R).

Here is a short description of each of the files. There are notes in the main folder that says "TickMerged_02_DataFiltering_NOTES.txt" that is my filtering process and notes. You might want to read sections 6 and 9, since those are the most important changes.

./data_derived/MASTER_all_tck_data_merged.csv

  • A complete set of all tick field, taxonomy and pathogen data.
  • Each line is ONE TICK. Many lines are missing taxonomy or pathogen data, since these files were not complete to begin with.
  • If you plan on using this and filtering yourself, here are a few things to look out for:
  • taxonRank is the lowest taxonomy level any tick was ID'd to. Therefore, if you want only ticks ID'd to species, you can filter taxonRank=="species".
  • acceptedTaxonID is shorthand for all species, except: IXOSP==family Ixodidae; IXOSP2==order Ixodidae; and IXOSPP1==multiple possible species in genus Ixodes. Be aware of this if you are filtering by species.

For all subset data (./data_derived/subset_data/) the following is true:

  • TIME: the validation set are the months May, June, July from year 2018 across all sites.
  • DOMAIN: the validation set is domain==D06.
  • SITE: the validation set are sites SCBI, DSNY, STEI, DELA, DEJU (each from a different domain)
  • RANDOM: the validation set is a random 30% of all samples

Details about specific subsets:

./data_derived/subset_data/ticks_taxonomy_collapsed/

  • All files in this folder ignore tick taxonomy, but counts have been adjusted to match taxonomy counts as described in section 6

./data_derived/subset_data/ticks_taxonomy_separated/ ** Not recommended **

  • I strongly discourage using this dataset, unless you read my NOTES file carefully.
  • There is taxonomy included here as acceptedTaxonID. BUT, please read section 9 about how estimatedCount was calculated. In short, some samples only ID'd 1 or 2 ticks out of hundreds, so estimatedCount uses existing tick IDs in a sample and extrapolates to the whole sample-- if you find one IXOSCA and one AMBAME out of 500 ticks, then you will have 250 IXOSCA and 250 AMBAME. If you prefer to use original counts, use rawCount
  • Also, please be aware that acceptedTaxonID includes categories like IXOSP, IXOSP2, and IXOSPP1, which are not at the species level-- they are at the family, order, and genus levels, respectively.

./data_derived/subset_data/ticks_IXOSCA/ and ./data_derived/subset_data/ticks_AMBAME/

  • Please review section 9 for how estimatedCount was calculated.
  • I included 2 extra columns of data for you: rawCount and allSpeciesTotal. rawCount is the number of IXOSCA or AMBAME recorded in each sample WITHOUT adjustments described in section 9. The column allSpeciesTotal are the counts for all species of ticks, including IXOSCA/AMBAME. You may want to remove this column when running models, since the density of total ticks will obviously be correlated with density of IXOSCA/AMBAME, and may artificially improve the performance of your model.

Good luck, and let me know you find any bugs/problems!

M

@mech3132
Copy link
Member Author

mech3132 commented Apr 16, 2020

FYI: wrote a short function called "stepGAM" that finds the AIC value for a set of models, in case anyone finds it useful. After it runs, you can sort by AIC, REML, or Dev.expl. You can set "constant" predictors (that are common throughout all models). There's also a progress bar.
#dat = the tick dataset
#predictors = a vector of predictors (e.g. c("s(dayOfYear)","nlcdClass","s(plotID,bs='re')")). Will test ALL combination of predictors
#response = response variable (e.g. "positiveCounts")
#constant_pred = keep this predictor in all combos; such as, "offset(log(numberTested))"
#family = normal, binomial, etc

Function

stepGAM <- function(dat, predictors, response, constant_pred, family) {
n_rows <- sum(sapply(1:length(predictors),FUN=function(x){ncol(combn(predictors,m=x))} ))
allAIC <- setNames(data.frame(matrix(nrow=n_rows, ncol=(3+length(predictors)))),nm = c("AIC","REML","Dev.expl",predictors))
i <- 1
pb <- txtProgressBar(title = "progress bar", min = 0,
max = n_rows, width = 100)
for ( p in 1:length(predictors) ) {
combos <- combn(predictors, m=p)
for ( c in 1:ncol(combos) ) {
frml.temp <- paste(c(paste0(response," ~ ",constant_pred),combos[,c]), collapse="+")
gam.temp <- gam(as.formula(frml.temp)
, data=dat
, method="REML"
, family=family)
pred.temp <- (colnames(allAIC) %in% combos[,c])[-c(1,2,3)]
allAIC[i,"AIC"] <- gam.temp$aic
allAIC[i,"REML"] <- gam.temp$gcv.ubre[1]
allAIC[i,"Dev.expl"] <- summary(gam.temp)$dev.expl
allAIC[i, predictors] <- pred.temp
i <- i+1
Sys.sleep(0.1)
setTxtProgressBar(pb, i, label=paste( round(i/total*100, 0),
"% done"))
}
}
close(pb)
return(allAIC)
}

Example

allAIC <- stepGAM(dat = tck_borrelia_adj, predictors = allPred, response = "borrPresent", constant_pred = "offset(log(numberTested))", family = binomial)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants