Working list of things to accomplish #1

mech3132 · 2020-01-30T22:01:52Z

Hi tick peeps,
Here is our new repo!
You'll notice that I explicitly made a gitignore file here-- when you clone this repo you'll have that gitignore file. I'm not sure if we want to keep that or not? Let me know.

Below, we can make comments on things we want to work on, and any warnings to others you might want to include.

mech3132 · 2020-01-30T22:07:29Z

Figure out that if you put any of your documents in the .gitignore file (both on the remote repo and locally on your computer), the .gitignore file WILL push, but your file will not. I've set it so all *.RProj are not pushed or pulled-- sound good for all of you?

mech3132 · 2020-02-06T17:35:45Z

Melissa- data orange, determine useful metadata

awful format for ticks stage/sex [ merged abundance and prevalence ]— how in the world do individual observations relate to each other
plots of variables

Wynne- make maps

summarize sites that have borellia— all via maps????????
Plot by site— how do different plots differ across time?
Are plots highly variable within site?
What are zero tick plots? Are these interesting zeros, or silly zeros?

Brendan- rodents

download rodents; see how it looks through time across space

Matt - vegetation

mech3132 · 2020-02-12T16:04:09Z

Request:
After thinking about it a bit more, could we actually alter the naming scheme a little? So that numbers always indicate the order that they should be run?
IE If you have 01 in the name, you do NOT need any other scripts to be run first. If you have 02 in the name, you will need at least one (or maybe all) 01 script(s) already run, but do NOT need any 03's run yet-- etc etc.

I know there was previously some logic to the "TickPathogen" line and "TickAbundance" line, but I still find myself getting confused by "TickMerged_01", which actually requires both TickPathogen_01 (Downloading) and TickPathogen_02 (Data clean up) to be run.

Although obviously if there are strong opinions against this, I am willing to be flexible ;)

WynneMoss · 2020-02-12T16:26:38Z

Yes totally! Feel free to re-name any of those files. Agree, it got a little weird once there were multiple scripts for downloading.

mech3132 · 2020-02-13T18:20:24Z

Personal goals for this week:

Finish plotting tick pathogen vs plot vs time.
Standardize and sanitize data to include only pathogen abundances that are meaninful/useful
Determine what distributions are appropriate for estimating tick density and prevalence-- should we be using normal distributions? Some kind of aggregated model? Look at data and fit distributions to see what it looks like.

WynneMoss · 2020-02-20T19:17:52Z

Hi all! Here's a reminder of what we talked about today (and thanks @scelmendorf for many of these ideas!):

To-dos:

Create tick abundance dataset(s) for use in simple models (@mech3132 working on this) to put in data_derived folder. This will include a training dataset and a testing dataset for model validation. It may include one table for ALL ticks (grouped) as well as a table where the two focal tick species are listed separately. Downstream users of this dataset will have to account for effort (drag area) either using poisson offsets or dividing to get density. @mech3132 feel free to chime in with any other peculiarities we should be aware of!
Variance partitioning model to partition variation in (1) total tick abundance (2) IXODES tick abundance (3) AMBAME tick abundance. Variables included would be site, plot, year, month, nlcd, and nlcd nested within site. Any others I forgot? @matthewbitters is working on this. We hadn't decided what life stages to use but probably need to keep them separate to make them coherent with other models. e.g.
perhaps need to run a separate model for each species x lifestage combination. Might also be useful to have additional model where all species and lifestages are grouped? But open to changing this.
Time series model to explain the same (total tick abundances of each species-lifestage) @WynneMoss to work on this. Full disclosure I have no idea what I'm doing. I think seeing whether abundances of each of these lifestages can predict abundance in the next time step? I have some homework to do on time series and tick life history...
Random forest model or some other machine learning tool to predict abundances. Maybe pull from SDM techniques. Same predictor variables as variance partitioning model? Again, out of my league here but @bkhobart is on it.

The next goal is to explore the accuracy of predictions made by these three models on the test dataset.

scelmendorf · 2020-02-20T20:02:37Z

@WynneMoss I am definitely not an expert in time-series but some places you could start are (1) Just a random walk model - kind of like a null model. Chaper 8 in Dieze's forecasting book has some examples and there is code here:https://github.com/EcoForecast/EF_Activities/blob/master/Exercise_06_StateSpace.Rmd (since I know you love JAGS). Or, one simple way to add in a seasonal component - which seemed somewhat prominent in the summary plots we looked at today is Gavin Simpson has a whole mess of stuff about using gams to model seasonal time-series (+code) on fromthebottomoftheheap blog (and there is a paper as well on how to go hierarchical: https://peerj.com/articles/6876/)

mech3132 · 2020-02-23T02:33:35Z

Hi all,

[updated 23feb2020]
I posted the script to get time, domain, site, and 'random' validition/test sets. (TickMerged_02_DataFiltering.R).

Here is a short description of each of the files. There are notes in the main folder that says "TickMerged_02_DataFiltering_NOTES.txt" that is my filtering process and notes. You might want to read sections 6 and 9, since those are the most important changes.

./data_derived/MASTER_all_tck_data_merged.csv

A complete set of all tick field, taxonomy and pathogen data.
Each line is ONE TICK. Many lines are missing taxonomy or pathogen data, since these files were not complete to begin with.
If you plan on using this and filtering yourself, here are a few things to look out for:
taxonRank is the lowest taxonomy level any tick was ID'd to. Therefore, if you want only ticks ID'd to species, you can filter taxonRank=="species".
acceptedTaxonID is shorthand for all species, except: IXOSP==family Ixodidae; IXOSP2==order Ixodidae; and IXOSPP1==multiple possible species in genus Ixodes. Be aware of this if you are filtering by species.

For all subset data (./data_derived/subset_data/) the following is true:

TIME: the validation set are the months May, June, July from year 2018 across all sites.
DOMAIN: the validation set is domain==D06.
SITE: the validation set are sites SCBI, DSNY, STEI, DELA, DEJU (each from a different domain)
RANDOM: the validation set is a random 30% of all samples

Details about specific subsets:

./data_derived/subset_data/ticks_taxonomy_collapsed/

All files in this folder ignore tick taxonomy, but counts have been adjusted to match taxonomy counts as described in section 6

./data_derived/subset_data/ticks_taxonomy_separated/ ** Not recommended **

I strongly discourage using this dataset, unless you read my NOTES file carefully.
There is taxonomy included here as acceptedTaxonID. BUT, please read section 9 about how estimatedCount was calculated. In short, some samples only ID'd 1 or 2 ticks out of hundreds, so estimatedCount uses existing tick IDs in a sample and extrapolates to the whole sample-- if you find one IXOSCA and one AMBAME out of 500 ticks, then you will have 250 IXOSCA and 250 AMBAME. If you prefer to use original counts, use rawCount
Also, please be aware that acceptedTaxonID includes categories like IXOSP, IXOSP2, and IXOSPP1, which are not at the species level-- they are at the family, order, and genus levels, respectively.

./data_derived/subset_data/ticks_IXOSCA/ and ./data_derived/subset_data/ticks_AMBAME/

Please review section 9 for how estimatedCount was calculated.
I included 2 extra columns of data for you: rawCount and allSpeciesTotal. rawCount is the number of IXOSCA or AMBAME recorded in each sample WITHOUT adjustments described in section 9. The column allSpeciesTotal are the counts for all species of ticks, including IXOSCA/AMBAME. You may want to remove this column when running models, since the density of total ticks will obviously be correlated with density of IXOSCA/AMBAME, and may artificially improve the performance of your model.

Good luck, and let me know you find any bugs/problems!

M

mech3132 · 2020-04-16T21:08:35Z

FYI: wrote a short function called "stepGAM" that finds the AIC value for a set of models, in case anyone finds it useful. After it runs, you can sort by AIC, REML, or Dev.expl. You can set "constant" predictors (that are common throughout all models). There's also a progress bar.
#dat = the tick dataset
#predictors = a vector of predictors (e.g. c("s(dayOfYear)","nlcdClass","s(plotID,bs='re')")). Will test ALL combination of predictors
#response = response variable (e.g. "positiveCounts")
#constant_pred = keep this predictor in all combos; such as, "offset(log(numberTested))"
#family = normal, binomial, etc

Function

stepGAM <- function(dat, predictors, response, constant_pred, family) {
n_rows <- sum(sapply(1:length(predictors),FUN=function(x){ncol(combn(predictors,m=x))} ))
allAIC <- setNames(data.frame(matrix(nrow=n_rows, ncol=(3+length(predictors)))),nm = c("AIC","REML","Dev.expl",predictors))
i <- 1
pb <- txtProgressBar(title = "progress bar", min = 0,
max = n_rows, width = 100)
for ( p in 1:length(predictors) ) {
combos <- combn(predictors, m=p)
for ( c in 1:ncol(combos) ) {
frml.temp <- paste(c(paste0(response," ~ ",constant_pred),combos[,c]), collapse="+")
gam.temp <- gam(as.formula(frml.temp)
, data=dat
, method="REML"
, family=family)
pred.temp <- (colnames(allAIC) %in% combos[,c])[-c(1,2,3)]
allAIC[i,"AIC"] <- gam.temp$aic
allAIC[i,"REML"] <- gam.temp$gcv.ubre[1]
allAIC[i,"Dev.expl"] <- summary(gam.temp)$dev.expl
allAIC[i, predictors] <- pred.temp
i <- i+1
Sys.sleep(0.1)
setTxtProgressBar(pb, i, label=paste( round(i/total*100, 0),
"% done"))
}
}
close(pb)
return(allAIC)
}

Example

allAIC <- stepGAM(dat = tck_borrelia_adj, predictors = allPred, response = "borrPresent", constant_pred = "offset(log(numberTested))", family = binomial)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working list of things to accomplish #1

Working list of things to accomplish #1

mech3132 commented Jan 30, 2020

mech3132 commented Jan 30, 2020

mech3132 commented Feb 6, 2020

mech3132 commented Feb 12, 2020

WynneMoss commented Feb 12, 2020

mech3132 commented Feb 13, 2020

WynneMoss commented Feb 20, 2020

scelmendorf commented Feb 20, 2020

mech3132 commented Feb 23, 2020 •

edited

Loading

mech3132 commented Apr 16, 2020 •

edited

Loading

Working list of things to accomplish #1

Working list of things to accomplish #1

Comments

mech3132 commented Jan 30, 2020

mech3132 commented Jan 30, 2020

mech3132 commented Feb 6, 2020

mech3132 commented Feb 12, 2020

WynneMoss commented Feb 12, 2020

mech3132 commented Feb 13, 2020

WynneMoss commented Feb 20, 2020

scelmendorf commented Feb 20, 2020

mech3132 commented Feb 23, 2020 • edited Loading

mech3132 commented Apr 16, 2020 • edited Loading

Function

Example

mech3132 commented Feb 23, 2020 •

edited

Loading

mech3132 commented Apr 16, 2020 •

edited

Loading