From 89983825744030fc7eaad8dd9bd486d82ca9f9ba Mon Sep 17 00:00:00 2001 From: JannisGottwald <53615516+JannisGottwald@users.noreply.github.com> Date: Fri, 16 Sep 2022 12:37:17 +0300 Subject: [PATCH 1/5] Delete html directory --- .../Model_training_tuning_and_validation.html | 1479 ---------- html/bat_data_HGAM_tutorial.html | 2393 ----------------- ...-Tutorial-for-activity-classification.html | 2082 -------------- 3 files changed, 5954 deletions(-) delete mode 100644 html/Model_training_tuning_and_validation.html delete mode 100644 html/bat_data_HGAM_tutorial.html delete mode 100644 html/tRackIT-Tutorial-for-activity-classification.html diff --git a/html/Model_training_tuning_and_validation.html b/html/Model_training_tuning_and_validation.html deleted file mode 100644 index 432906c..0000000 --- a/html/Model_training_tuning_and_validation.html +++ /dev/null @@ -1,1479 +0,0 @@ - - - - - - - - - - - - - -Radom Forest model training, tuning and validation for activity classification - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
-
-
- -
- - - - - - - - -
-

Introduction

-

Small movements of tagged animals result in discernible variations in the strength of the received signal (Cochran et al. (1965); Kjos and Cochran (1970)) that reflect changes in the angle and distance between the transmitter and receiver. Kays et al. (2011) proposed a method for automatically classifying active and passive behaviour based on a threshold difference in the signal strength of successive VHF signals recorded by a customised automatic radio-tracking system. However, machine learning (ML) algorithms are optimised for the recognition of complex patterns in a dataset and are typically robust against factors that influence signal propagation, such as changes in temperature and humidity, physical contact with conspecifics and/or multipath signal propagation (Alade (2013)). Accordingly, a ML model trained with a dataset encompassing the possible diversity of signal patterns related to active and passive behaviour can be expected to perform at least as well as a threshold-based approach. In this work, we built on the methodology of Kays et al. (2011) by calibrating two random forest models (1 for data comming from only one receiver and one for data coming from at least two receivers), based on millions of data points representing the behaviours of multiple tagged individuals of two temperate bat species (Myotis bechsteinii, Nyctalus leisleri).

-

The method was tested by applying it to independent data from bats, humans, and a bird species and then comparing the results with those obtained using the threshold-based approach of Kays et al. (2011).

-

In order to make our work comprehensible, code and data are made available to all interested parties here (https://doi.org/10.17192/fdr/81). This resource contains the following steps:

-
    -
  • Training two models in conjunction with forward feature selection
  • -
  • Parameter tuning
  • -
  • Validation using three independent data sets
  • -
  • Comparison to a threshold based approach as proposed in Kays et al. 2011
  • -
-

But before we get started:

-
-

Why Random Forest?

-

Although deep learning methods have been successfully applied to several ecological problems where large amounts of data are available (Christin, Hervet, and Lecomte (2019)), we use a random forest model due to the following reasons:

-
    -
    1. -
    2. developing a (supervised) deep learning method requires considerable effort for selecting an appropriate neural network architecture, choosing an appropriate framework to implement the neural network and training, validating, testing, and refining the neural network (Christin, Hervet, and Lecomte (2019))
    3. -
  • -
    1. -
    2. essentially, we have to solve a binary classification task based on tabular data; in this setting, tree ensemble methods such as random forests seem to have clear advantages - they are less computationally intensive, easy to implement, robust, and at least as performant as deep learning (Shwartz-Ziv and Armon (2022))
    3. -
  • -
    1. -
    2. in a large study comparing 179 classifiers applied to the 121 classification data sets of the UCI repository, random forests are the best classifiers in over 90% of the cases (Fernández-Delgado et al. (2014)).
    3. -
  • -
-
-
-
-

Model training and tuning

-

For model training and tuning we use the caret R-Package (Kuhn 2008). For the forward feature selection we use the CAST R-Package developed by Meyer et al. 2018.

-

Additional packages needed are: randomForest,ranger, doParallel , MLeval, data.table, dplyr, plyr

-

Load packages

-
library(caret); library(randomForest);library(ranger); library(doParallel);library(MLeval);library(CAST);library(data.table);library(dplyr);library(plyr)
-
-

Data for model training

-

Only one antenna is necessary to classify VHF signals into active vs. passive states (Kays et al. 2011). However, agreement between receivers of the same station provides additional information and can improve the reliability of the classification. Our groundtruth dataset was balanced by randomly down-sampling the activity class with the most data to the amount of data contained by the class with the least data. These balanced datasets were then split into 50% training data and 50% test data for data originating from one receiver. The same procedure was used for data derived from the signals of two receivers, resulting in two training and two test datasets. From a total of 3,243,753 VHF signals, 124,898 signals were assigned to train the two-receiver model and 294,440 signals to train the one-receiver model (Table 1).

-
-
-

Feature selection

-

Since not all variables are equally important to the model and some may even be misleading,we performed a forward feature selection on 50% of the training data. The forward feature selection algorithm implemented in the R package CAST (Meyer et al. (2018)) selects the best pair of all possible two variable combinations by evaluating the performance of a k-fold cross-validation (CV). The algorithm iteratively increases the number of predictors until no improvement of the performance is achieved by adding further variables.

-
-

1 Receiver model

-
######################### 1 Receiver ########################
-
-
-data_1<-readRDS("model_tunig_and_validation/data/batsTrain_1_receiver.rds")
-
-table(data_1$Class)
-
## 
-##  active passive 
-##  294173  294173
-
##############FFS############################
-
-predictors<-names(data_1[, -ncol(data_1)])
-
-
-cl<-makeCluster(10)
-
-registerDoParallel(cl)
-
-ctrl <- trainControl(## 10-fold CV
-  method = "cv",
-  number = 10)
-
-
-#run ffs model with Leave Location out CV
-set.seed(10)
-
-ffsmodel <- ffs(predictors=data_1[,predictors],response = data_1$Class,method="rf",
-                metric="Kappa",
-                tuneLength = 1,
-                trControl=ctrl,
-                verbose = TRUE)
-
-ffsmodel$selectedvars
-
-
-saveRDS(ffsmodel, "model_tunig_and_validation/models/m_r1.rds")
-
-stopCluster(cl)
-
-
-

Results feature selection

-

Red dots display two-variables combinations, dots with the colors from yellow to pink stand for models to each of which another variable has been added. Dots with a black border mark the optimal variable combination in the respective iteration.

-
m1<-readRDS("model_tunig_and_validation/models/m_r1.rds")
-
-print(m1)
-
## Random Forest 
-## 
-## 588346 samples
-##      7 predictor
-##      2 classes: 'active', 'passive' 
-## 
-## No pre-processing
-## Resampling: Cross-Validated (10 fold) 
-## Summary of sample sizes: 529512, 529512, 529511, 529512, 529511, 529511, ... 
-## Resampling results:
-## 
-##   Accuracy   Kappa    
-##   0.9631628  0.9263257
-## 
-## Tuning parameter 'mtry' was held constant at a value of 2
-
print(plot_ffs(m1))
-

-
-
-

Variable importance 1 receiver model

-
plot(varImp(m1))
-

-
-
-

2 Receivers model

-
######################### 2 Receivers ########################
-
-data_2<-readRDS("model_tunig_and_validation/data/batsTrain_2_receivers.rds")
-
-table(data_2$Class)
-
## 
-##  active passive 
-##  110274  110274
-
predictors<-names(data_2[, -ncol(data_2)])
-
-##############FFS#####
-
-cl<-makeCluster(10)
-
-registerDoParallel(cl)
-
-ctrl <- trainControl(## 10-fold CV
-  method = "cv",
-  number = 10)
-
-#run ffs model
-set.seed(10)
-
-ffsmodel <- ffs(predictors=data_2[,predictors],response = data_2$Class,method="rf",
-                metric="Kappa",
-                tuneLength = 1,
-                trControl=ctrl,
-                verbose = TRUE)
-
-ffsmodel$selectedvars
-
-saveRDS(ffsmodel, "model_tunig_and_validation/models/m_r2.rds")
-
-stopCluster(cl)
-
-
-

Results feature selection

-

Red dots display two-variables combinations, dots with the colors from yellow to pink stand for models to each of which another variable has been added. Dots with a black border mark the optimal variable combination in the respective iteration.

-
m2<-readRDS("model_tunig_and_validation/models/m_r2.rds")
-print(m2)
-
## Random Forest 
-## 
-## 220548 samples
-##      8 predictor
-##      2 classes: 'active', 'passive' 
-## 
-## No pre-processing
-## Resampling: Cross-Validated (10 fold) 
-## Summary of sample sizes: 198494, 198493, 198494, 198494, 198492, 198492, ... 
-## Resampling results:
-## 
-##   Accuracy   Kappa    
-##   0.9740011  0.9480022
-## 
-## Tuning parameter 'mtry' was held constant at a value of 2
-
print(plot_ffs(m2))
-

-
-
-

Variable importance 2 receivers model

-
plot(varImp(m2))
-

-
-
-
-

Model tuning

-

Random forest is an algorithm which is far less tunable than other algorithms such as support vector machines (Probst, Wright, and Boulesteix (2019)) and is known to provide good results in the default settings of existing software packages (Fernández-Delgado et al., 2014). Even though the performance gain is still low, tuning the parameter mtry provides the biggest average improvement of the AUC (0.006) (Probst et al.2018). Mtry is defined as the number of randomly drawn candidate variables out of which each split is selected when growing a tree. Here we reduce the existing predictor variables to those selected by the forward feature selection and iteratively increase the number of randomly drawn candidate variables from 1 to the total number of selcted variables. Other parameters, such as the number of trees are held constant according to default settings in the packages used.

-
-

Tuning mtry on the 1 receiver model

-
#reduce to ffs variables
-predictors<-names(data_1[, c(m1$selectedvars, "Class")])
-batsTune<-data_1[, predictors]
-
-#tune number of variable evaluated per tree- number of trees is 500
-ctrl <- trainControl(## 10-fold CV
-  method = "cv",
-  number = 10,
-  verboseIter = TRUE)
-  )
-
-tunegrid <- expand.grid(
-  mtry = 1:(length(predictors)-1),                                  # mtry specified here
-  splitrule = "gini"
-  ,min.node.size = 10
-)
-
-tuned_model <- train(Class~.,
-                    data=batsTune,
-                    method='rf',
-                    metric='Kappa',
-                    tuneGrid=tunegrid,
-                    ntree=1000,
-                    trControl=ctrl)
-
-saveRDS(tuned_model,"model_tunig_and_validation/models/m_r1_tuned.rds")
-
-
-

Results model tuning 1 receiver

-
m1_tuned<-readRDS("model_tunig_and_validation/models/m_r1_tuned.rds")
-
-print(m1_tuned)
-
## Random Forest 
-## 
-## 588346 samples
-##      7 predictor
-##      2 classes: 'active', 'passive' 
-## 
-## No pre-processing
-## Resampling: Cross-Validated (10 fold) 
-## Summary of sample sizes: 529510, 529512, 529511, 529511, 529511, 529512, ... 
-## Resampling results across tuning parameters:
-## 
-##   mtry  Accuracy   Kappa    
-##   1     0.9591601  0.9183202
-##   2     0.9616518  0.9233036
-##   3     0.9619646  0.9239291
-##   4     0.9618371  0.9236742
-##   5     0.9615039  0.9230079
-##   6     0.9610569  0.9221139
-##   7     0.9602819  0.9205637
-## 
-## Tuning parameter 'splitrule' was held constant at a value of gini
-## 
-## Tuning parameter 'min.node.size' was held constant at a value of 10
-## Kappa was used to select the optimal model using the largest value.
-## The final values used for the model were mtry = 3, splitrule = gini
-##  and min.node.size = 10.
-
-
-

Tuning mtry on the 2 receivers model

-
#reduce to ffs variables
-predictors<-names(data_2[, c(m2$selectedvars, "Class")])
-batsTune<-data_2[, predictors]
-
-#tune number of variable evaluated per tree- number of trees is 1000
-ctrl <- trainControl(## 10-fold CV
-  method = "cv",
-  number = 10,
-  verboseIter = TRUE
-)
-
-
-tunegrid <- expand.grid(
-  mtry = 1:(length(predictors)-1),                                  # mtry specified here
-  splitrule = "gini"
-  ,min.node.size = 10
-)
-tuned_model_2 <- train(Class~.,
-                     data=batsTune,
-                     method='rf',
-                     metric='Kappa',
-                     tuneGrid=tunegrid,
-                     ntree=1000,
-                     trControl=ctrl)
-print(tuned_model_2)
-
-
-saveRDS(tuned_model_2,"model_tunig_and_validation/models/m_r2_tuned.rds")
-
-
-

Results model tuning 2 receivers

-
m2_tuned<-readRDS("model_tunig_and_validation/models/m_r2_tuned.rds")
-print(m2_tuned)
-
## Random Forest 
-## 
-## 220548 samples
-##      8 predictor
-##      2 classes: 'active', 'passive' 
-## 
-## No pre-processing
-## Resampling: Cross-Validated (10 fold) 
-## Summary of sample sizes: 198494, 198494, 198493, 198492, 198493, 198494, ... 
-## Resampling results across tuning parameters:
-## 
-##   mtry  Accuracy   Kappa    
-##   1     0.9719608  0.9439215
-##   2     0.9724187  0.9448374
-##   3     0.9717975  0.9435951
-##   4     0.9712988  0.9425976
-##   5     0.9710041  0.9420081
-##   6     0.9707139  0.9414277
-##   7     0.9703285  0.9406569
-##   8     0.9702605  0.9405209
-## 
-## Tuning parameter 'splitrule' was held constant at a value of gini
-## 
-## Tuning parameter 'min.node.size' was held constant at a value of 10
-## Kappa was used to select the optimal model using the largest value.
-## The final values used for the model were mtry = 2, splitrule = gini
-##  and min.node.size = 10.
-
-
-

Results and discussion model training and tuning

-

Both models ( based on data from 1 receiver and 2 receivers) had very high performance metrics (Kappa, Accuracy) with slightly better results for the 2 receivers model.Tuning the mtry parameter did not increase the performance which indicates that for our use case default settings are a good choice.

-
-
-
-
-

Model evaluation

-

For the Validation of the model performance and applicability to species with different movement behaviour (speed etc. than bats) we generated three different data sets. + 1. We put 50% of our bat data aside + 2. We collected ground truth data of a tagged medium spotted woodpecker + 3. We simulated different movement intensities by humans carrying transmitters through the forest

-

In this section we will test how well the models perform in terms of different performance metrics such as F-Score, Accuracy, ROC-AUC

-
-

Bats

-

We first take a look at te 50% test data that has been put aside for evaluation. Here we actually perform the prediction using the two trained models. For the woodpecker and human walk data set we will use already predicted data that has been processed by script validation_woodpecker and validation_human_activity.

-
-

Test data collected by one Receiver

-
# Testdata 1 receiver
-Test_1<-readRDS("model_tunig_and_validation/data/batsTest_1_receiver.rds")
-print(table(Test_1$Class))
-
## 
-##  active passive 
-##  294172  294172
-
# Default names as expected in Caret
-Test_1$obs<-factor(Test_1$Class)
-
-#get binary prediction
-pred1<-predict(m1, Test_1)
-Test_1$pred<-factor(pred1)
-
-#probabilities
-prob<-predict(m1, Test_1, type="prob")
-Test_1<-cbind(Test_1, prob)
-
#calculate roc-auc 
-
-roc1 <- MLeval::evalm(data.frame(prob, Test_1$obs))
-saveRDS(roc1, "model_tunig_and_validation/results/roc_1receiver.rds")
-
-
-

Performance metrics for 1-receiver test data of bats

-
#create confusion matrix
-cm_r1<- confusionMatrix(factor(Test_1$pred), factor(Test_1$Class))
-print(cm_r1)
-
## Confusion Matrix and Statistics
-## 
-##           Reference
-## Prediction active passive
-##    active  282864   10218
-##    passive  11308  283954
-##                                           
-##                Accuracy : 0.9634          
-##                  95% CI : (0.9629, 0.9639)
-##     No Information Rate : 0.5             
-##     P-Value [Acc > NIR] : < 2.2e-16       
-##                                           
-##                   Kappa : 0.9268          
-##                                           
-##  Mcnemar's Test P-Value : 1.15e-13        
-##                                           
-##             Sensitivity : 0.9616          
-##             Specificity : 0.9653          
-##          Pos Pred Value : 0.9651          
-##          Neg Pred Value : 0.9617          
-##              Prevalence : 0.5000          
-##          Detection Rate : 0.4808          
-##    Detection Prevalence : 0.4981          
-##       Balanced Accuracy : 0.9634          
-##                                           
-##        'Positive' Class : active          
-## 
-
print(cm_r1$byClass)
-
##          Sensitivity          Specificity       Pos Pred Value 
-##            0.9615599            0.9652652            0.9651360 
-##       Neg Pred Value            Precision               Recall 
-##            0.9617018            0.9651360            0.9615599 
-##                   F1           Prevalence       Detection Rate 
-##            0.9633447            0.5000000            0.4807800 
-## Detection Prevalence    Balanced Accuracy 
-##            0.4981473            0.9634126
-
-
-

ROC-AUC for 1-receiver test data of bats

-
#
-twoClassSummary(Test_1, lev = levels(Test_1$obs))
-
##       ROC      Sens      Spec 
-## 0.9942587 0.9615599 0.9652652
-
  roc1 <- readRDS("model_tunig_and_validation/results/roc_1receiver.rds")
-print(roc1$roc)
-

-
-
-

Bats: Test data collected by two receivers

-
#get model
-
-#two receivers
-Test_2<-readRDS("model_tunig_and_validation/data/batsTest_2_receivers.rds")
-
-table(Test_2$Class)
-
## 
-##  active passive 
-##  110273  110273
-
Test_2$obs<-Test_2$Class
-#get binary prediction
-pred2<-predict(m2, Test_2)
-Test_2$pred<-pred2
-#probabilities
-prob2<-predict(m2, Test_2, type="prob")
-Test_2<-cbind(Test_2, prob2)
-
#calculate roc-auc
-roc2 <- MLeval::evalm(data.frame(prob2, Test_2$obs))
-
-saveRDS(roc2, "model_tunig_and_validation/results/roc_2receivers.rds")
-
-
-

Performance metrics for 2-receiver test data of bats

-
cm_r2<- confusionMatrix(factor(Test_2$pred), factor(Test_2$obs))
-print(cm_r2)
-
## Confusion Matrix and Statistics
-## 
-##           Reference
-## Prediction active passive
-##    active  107571    2745
-##    passive   2702  107528
-##                                           
-##                Accuracy : 0.9753          
-##                  95% CI : (0.9746, 0.9759)
-##     No Information Rate : 0.5             
-##     P-Value [Acc > NIR] : <2e-16          
-##                                           
-##                   Kappa : 0.9506          
-##                                           
-##  Mcnemar's Test P-Value : 0.5693          
-##                                           
-##             Sensitivity : 0.9755          
-##             Specificity : 0.9751          
-##          Pos Pred Value : 0.9751          
-##          Neg Pred Value : 0.9755          
-##              Prevalence : 0.5000          
-##          Detection Rate : 0.4877          
-##    Detection Prevalence : 0.5002          
-##       Balanced Accuracy : 0.9753          
-##                                           
-##        'Positive' Class : active          
-## 
-
print(cm_r2$byClass)
-
##          Sensitivity          Specificity       Pos Pred Value 
-##            0.9754972            0.9751072            0.9751169 
-##       Neg Pred Value            Precision               Recall 
-##            0.9754876            0.9751169            0.9754972 
-##                   F1           Prevalence       Detection Rate 
-##            0.9753070            0.5000000            0.4877486 
-## Detection Prevalence    Balanced Accuracy 
-##            0.5001950            0.9753022
-
-
-

ROC-AUC for 2-receiver test data of bats

-
#
-twoClassSummary(data_2, lev = levels(data_2$obs))
-roc2 <- readRDS("model_tunig_and_validation/results/roc_2receivers.rds")
-
-print(roc2$roc)
-
-
-
-

Woodpecker

-
#two receivers
-wp<-readRDS("model_tunig_and_validation/data/woodpecker_groundtruth.rds")
-
-wp$obs<-as.factor(wp$observed)
-wp$pred<-as.factor(wp$prediction)
-
-

Performance metrics woodpecker

-
#create confusion matrix
-cm_wp<- confusionMatrix(wp$pred, wp$obs)
-print(cm_wp)
-
## Confusion Matrix and Statistics
-## 
-##           Reference
-## Prediction active passive
-##    active    8309      31
-##    passive    432    7969
-##                                           
-##                Accuracy : 0.9723          
-##                  95% CI : (0.9697, 0.9748)
-##     No Information Rate : 0.5221          
-##     P-Value [Acc > NIR] : < 2.2e-16       
-##                                           
-##                   Kappa : 0.9447          
-##                                           
-##  Mcnemar's Test P-Value : < 2.2e-16       
-##                                           
-##             Sensitivity : 0.9506          
-##             Specificity : 0.9961          
-##          Pos Pred Value : 0.9963          
-##          Neg Pred Value : 0.9486          
-##              Prevalence : 0.5221          
-##          Detection Rate : 0.4963          
-##    Detection Prevalence : 0.4982          
-##       Balanced Accuracy : 0.9734          
-##                                           
-##        'Positive' Class : active          
-## 
-
print(cm_wp$byClass)
-
##          Sensitivity          Specificity       Pos Pred Value 
-##            0.9505777            0.9961250            0.9962830 
-##       Neg Pred Value            Precision               Recall 
-##            0.9485776            0.9962830            0.9505777 
-##                   F1           Prevalence       Detection Rate 
-##            0.9728939            0.5221313            0.4963264 
-## Detection Prevalence    Balanced Accuracy 
-##            0.4981781            0.9733514
-
-
-

ROC-AUC Woodpecker

-
print(twoClassSummary(wp, lev = levels(wp$obs)))
-
##       ROC      Sens      Spec 
-## 0.9982197 0.9505777 0.9961250
-
roc_wp <- MLeval::evalm(data.frame(wp[, c("active", "passive")], wp$obs), plots=c("r"))
-

-
#print(roc_wp$roc)
-
-
-
-

Human activity

-
#two receivers
-hm<-readRDS("model_tunig_and_validation/data/human_walk_groundtruth.rds")
-hm$obs<-factor(hm$observation)
-hm$pred<-factor(hm$prediction)
-
-

Performance metrics human activity

-
#create confusion matrix
-cm_hm<- confusionMatrix(factor(hm$pred), factor(hm$obs))
-print(cm_hm)
-
## Confusion Matrix and Statistics
-## 
-##           Reference
-## Prediction active passive
-##    active   25787     280
-##    passive    717    5870
-##                                           
-##                Accuracy : 0.9695          
-##                  95% CI : (0.9675, 0.9713)
-##     No Information Rate : 0.8117          
-##     P-Value [Acc > NIR] : < 2.2e-16       
-##                                           
-##                   Kappa : 0.9028          
-##                                           
-##  Mcnemar's Test P-Value : < 2.2e-16       
-##                                           
-##             Sensitivity : 0.9729          
-##             Specificity : 0.9545          
-##          Pos Pred Value : 0.9893          
-##          Neg Pred Value : 0.8911          
-##              Prevalence : 0.8117          
-##          Detection Rate : 0.7897          
-##    Detection Prevalence : 0.7983          
-##       Balanced Accuracy : 0.9637          
-##                                           
-##        'Positive' Class : active          
-## 
-
print(cm_hm$byClass)
-
##          Sensitivity          Specificity       Pos Pred Value 
-##            0.9729475            0.9544715            0.9892584 
-##       Neg Pred Value            Precision               Recall 
-##            0.8911492            0.9892584            0.9729475 
-##                   F1           Prevalence       Detection Rate 
-##            0.9810352            0.8116617            0.7897042 
-## Detection Prevalence    Balanced Accuracy 
-##            0.7982789            0.9637095
-
-
-

ROC-AUC human activity

-
twoClassSummary(hm, lev = levels(hm$obs))
-
##       ROC      Sens      Spec 
-## 0.9902507 0.9729475 0.9544715
-
roc_hm <- MLeval::evalm(data.frame(hm[, c("active", "passive")], hm$obs),plots=c("r"))
-

-
#print(roc_hm$roc)
-
-
-
-

Results random-forest model validation

-

Regardless of whether the models were tested on independent test data from bats or on data from other species (human, woodpecker), the performance metrics were always close to their maxima.

-
-
-
-

Comparison to a threshold based approach

-

The results of the ML-based approach were compared with those of a threshold-based approach (Kays et al. 2011)by calculating the difference in the signal strength between successive signals for all three test datasets (bats, bird, humans). We applied a threshold of 4 dB which was deemed appropriate to optimally separate active and passive behaviours in previous studies (Holland et al. (2011)). In addition, the optimize-function of the R-package stats (R Core Team, 2021) was used to identify the value of the signal strength difference that separated the training dataset into active and passive with the highest accuracy. This value was also applied to all three test datasets.

-
-

Bats

-
-

Threshold optimisation

-

To find the threshold value that optimizes the accuracy (data is balanced) when separating the data into active and passive, we first calculated the signal strength difference of consecutive signals in the complete bats data set, than separated 50 % balanced test and train data and finally used the optimize function from the R base package to determine the best threshold.

-
#get all bat data
-trn<-fread("model_tunig_and_validation/data/train_2020_2021.csv")
-
-#calculate signal strength difference per station
-dtrn<-plyr::ldply(unique(trn$station), function(x){
-  
-  tmp<-trn[trn$station==x,]
-  tmp<-tmp[order(tmp$timestamp),]
-  tmp<-tmp%>%group_by(ID)%>%
-    mutate(Diff = abs(max_signal - lag(max_signal)))
-  return(tmp)
-  })
-
-##data clean up
-dtrn<-dtrn[!is.na(dtrn$Diff),]
-dtrn<-dtrn[!(dtrn$behaviour=="active" & dtrn$Diff==0),]
-
-##factorize
-dtrn$behaviour<-as.factor(dtrn$behaviour)
-table(dtrn$behaviour)
-
## 
-##  active passive 
-##  513831 2654868
-
#balance data
-set.seed(10)
-
-tdown<-downSample(x = dtrn,
-                  y = dtrn$behaviour)
-
-#create 50% train and test
-
-trainIndex <-createDataPartition(tdown$Class, p = .5, 
-                                 list = FALSE, 
-                                 times = 1)
-
-dtrn <- tdown[ trainIndex,]
-dtst  <- tdown[-trainIndex,]
-
-#optimize seperation value based on accuracy (remeber data is balanced)
-
-value<-dtrn$Diff
-group<-dtrn$behaviour
-
-accuracy = Vectorize(function(th) mean(c("passive", "active")[(value > th) + 1] == group))
-ac<-optimize(accuracy, c(min(value, na.rm=TRUE), max(value, na.rm=TRUE)), maximum=TRUE)
-
-ac$maximum
-
## [1] 1.088167
-
-
-

Performance metrics based on optimized threshold

-
#classify data by optimized value
-dtst$pred<-NA
-dtst$pred[dtst$Diff>ac$maximum]<-"active"
-dtst$pred[dtst$Diff<=ac$maximum]<-"passive"
-
-#calc confusion matrix
-dtst$pred<-factor(dtst$pred)
-cm<-confusionMatrix(factor(dtst$Class), factor(dtst$pred))
-
-print(cm)
-
## Confusion Matrix and Statistics
-## 
-##           Reference
-## Prediction active passive
-##    active  198976   57939
-##    passive  81121  175794
-##                                           
-##                Accuracy : 0.7294          
-##                  95% CI : (0.7281, 0.7306)
-##     No Information Rate : 0.5451          
-##     P-Value [Acc > NIR] : < 2.2e-16       
-##                                           
-##                   Kappa : 0.4587          
-##                                           
-##  Mcnemar's Test P-Value : < 2.2e-16       
-##                                           
-##             Sensitivity : 0.7104          
-##             Specificity : 0.7521          
-##          Pos Pred Value : 0.7745          
-##          Neg Pred Value : 0.6842          
-##              Prevalence : 0.5451          
-##          Detection Rate : 0.3872          
-##    Detection Prevalence : 0.5000          
-##       Balanced Accuracy : 0.7312          
-##                                           
-##        'Positive' Class : active          
-## 
-
print(cm$byClass)
-
##          Sensitivity          Specificity       Pos Pred Value 
-##            0.7103825            0.7521146            0.7744818 
-##       Neg Pred Value            Precision               Recall 
-##            0.6842497            0.7744818            0.7103825 
-##                   F1           Prevalence       Detection Rate 
-##            0.7410486            0.5451161            0.3872409 
-## Detection Prevalence    Balanced Accuracy 
-##            0.5000000            0.7312485
-
-
-

Performance metrics based on 4 dB threshold from the literature

-
#4 dB value from the literature
-dtst$pred<-NA
-dtst$pred[dtst$Diff>4]<-"active"
-dtst$pred[dtst$Diff<=4]<-"passive"
-
-dtst$pred<-factor(dtst$pred)
-cm<-confusionMatrix(dtst$Class, dtst$pred)
-print(cm)
-
## Confusion Matrix and Statistics
-## 
-##           Reference
-## Prediction active passive
-##    active  102253  154662
-##    passive  35342  221573
-##                                           
-##                Accuracy : 0.6302          
-##                  95% CI : (0.6289, 0.6315)
-##     No Information Rate : 0.7322          
-##     P-Value [Acc > NIR] : 1               
-##                                           
-##                   Kappa : 0.2604          
-##                                           
-##  Mcnemar's Test P-Value : <2e-16          
-##                                           
-##             Sensitivity : 0.7431          
-##             Specificity : 0.5889          
-##          Pos Pred Value : 0.3980          
-##          Neg Pred Value : 0.8624          
-##              Prevalence : 0.2678          
-##          Detection Rate : 0.1990          
-##    Detection Prevalence : 0.5000          
-##       Balanced Accuracy : 0.6660          
-##                                           
-##        'Positive' Class : active          
-## 
-
print(cm$byClass)
-
##          Sensitivity          Specificity       Pos Pred Value 
-##            0.7431447            0.5889218            0.3980032 
-##       Neg Pred Value            Precision               Recall 
-##            0.8624370            0.3980032            0.7431447 
-##                   F1           Prevalence       Detection Rate 
-##            0.5183798            0.2677831            0.1990016 
-## Detection Prevalence    Balanced Accuracy 
-##            0.5000000            0.6660333
-
-
-
-

Woodpecker

-

Since activity observations are not continuous but signal recording on the tRackIT-Stations is, we first have to calculate the signal strength difference on the raw data and than match it to the ground truth observations

-
#list raw signals
-wp<-list.files("model_tunig_and_validation/data/woodpecker_raw/variables/", full.names = TRUE)
-
-
-#calculate signal strength difference
-wp_tst<-plyr::ldply(wp, function(x){
-  
-  tmp<-fread(x)
-  tmp<-tmp[order(tmp$timestamp),]
-  tmp<-tmp%>%mutate(Diff = abs(max_signal - lag(max_signal)))
-  return(tmp)
-})
-
-wp_tst$timestamp<-lubridate::with_tz(wp_tst$timestamp, "CET")
-
-#get observations and merge by timestamp
-
-wp_gtruth<-readRDS("model_tunig_and_validation/data/woodpecker_groundtruth.rds")
-
-wp_tst<-merge(wp_gtruth, wp_tst, all.x = TRUE)
-
-

Performance metrics based on optimized threshold

-
wp_tst$pred<-NA
-wp_tst$pred[wp_tst$Diff>ac$maximum]<-"active"
-wp_tst$pred[wp_tst$Diff<=ac$maximum]<-"passive"
-
-wp_tst$pred<-factor(wp_tst$pred)
-wp_tst$observed<-factor(wp_tst$observed)
-
-cm<-confusionMatrix(factor(wp_tst$observed), factor(wp_tst$pred))
-
-print(cm)
-
## Confusion Matrix and Statistics
-## 
-##           Reference
-## Prediction active passive
-##    active    8191    3822
-##    passive    590    7691
-##                                           
-##                Accuracy : 0.7826          
-##                  95% CI : (0.7769, 0.7883)
-##     No Information Rate : 0.5673          
-##     P-Value [Acc > NIR] : < 2.2e-16       
-##                                           
-##                   Kappa : 0.5757          
-##                                           
-##  Mcnemar's Test P-Value : < 2.2e-16       
-##                                           
-##             Sensitivity : 0.9328          
-##             Specificity : 0.6680          
-##          Pos Pred Value : 0.6818          
-##          Neg Pred Value : 0.9288          
-##              Prevalence : 0.4327          
-##          Detection Rate : 0.4036          
-##    Detection Prevalence : 0.5919          
-##       Balanced Accuracy : 0.8004          
-##                                           
-##        'Positive' Class : active          
-## 
-
print(cm$byClass)
-
##          Sensitivity          Specificity       Pos Pred Value 
-##            0.9328095            0.6680274            0.6818447 
-##       Neg Pred Value            Precision               Recall 
-##            0.9287526            0.6818447            0.9328095 
-##                   F1           Prevalence       Detection Rate 
-##            0.7878234            0.4326895            0.4036168 
-## Detection Prevalence    Balanced Accuracy 
-##            0.5919484            0.8004185
-
-
-

Performance metrics based on 4 dB threshold from the literature

-
#evaluate with 4 dB value from the literature
-wp_tst$pred<-NA
-wp_tst$pred[wp_tst$Diff>4]<-"active"
-wp_tst$pred[wp_tst$Diff<=4]<-"passive"
-
-wp_tst$pred<-factor(wp_tst$pred)
-cm<-confusionMatrix(wp_tst$observed, wp_tst$pred)
-print(cm)
-
## Confusion Matrix and Statistics
-## 
-##           Reference
-## Prediction active passive
-##    active    3669    8344
-##    passive    191    8090
-##                                           
-##                Accuracy : 0.5794          
-##                  95% CI : (0.5726, 0.5862)
-##     No Information Rate : 0.8098          
-##     P-Value [Acc > NIR] : 1               
-##                                           
-##                   Kappa : 0.2449          
-##                                           
-##  Mcnemar's Test P-Value : <2e-16          
-##                                           
-##             Sensitivity : 0.9505          
-##             Specificity : 0.4923          
-##          Pos Pred Value : 0.3054          
-##          Neg Pred Value : 0.9769          
-##              Prevalence : 0.1902          
-##          Detection Rate : 0.1808          
-##    Detection Prevalence : 0.5919          
-##       Balanced Accuracy : 0.7214          
-##                                           
-##        'Positive' Class : active          
-## 
-
print(cm$byClass)
-
##          Sensitivity          Specificity       Pos Pred Value 
-##            0.9505181            0.4922721            0.3054191 
-##       Neg Pred Value            Precision               Recall 
-##            0.9769352            0.3054191            0.9505181 
-##                   F1           Prevalence       Detection Rate 
-##            0.4622945            0.1902040            0.1807924 
-## Detection Prevalence    Balanced Accuracy 
-##            0.5919484            0.7213951
-
-
-
-

Humans

-

Human activity observations are also not continuous so we have to calc signal strength diff for each individual on the raw data

-
hm_dirs<-list.dirs("model_tunig_and_validation/data/human_raw/", full.names = TRUE)
-hm_dirs<-hm_dirs[grep("variables", hm_dirs)]
-hm_tst<-plyr::ldply(hm_dirs, function(d){
-  
-  fls<-list.files(d, full.names = TRUE)
-  
-  tmp_dat<-plyr::ldply(fls, function(x){
-  
-  tmp<-fread(x)
-  tmp<-tmp[order(tmp$timestamp),]
-  tmp<-tmp%>%mutate(Diff = abs(max_signal - lag(max_signal)))
-  return(tmp)
-})
-  
-  return(tmp_dat)})
-
-#get obesrvations and merge
-hm_gtruth<-readRDS("model_tunig_and_validation/data/human_walk_groundtruth.rds")
-hm_tst<-merge(hm_gtruth, hm_tst, all.x = TRUE)
-hm_tst<-hm_tst[!duplicated(hm_tst$timestamp),]
-
-

Performance metrics based on optimized threshold

-
#evaluate based on optimized threshold
-hm_tst$pred<-NA
-hm_tst$pred[hm_tst$Diff>ac$maximum]<-"active"
-hm_tst$pred[hm_tst$Diff<=ac$maximum]<-"passive"
-
-hm_tst$pred<-factor(hm_tst$pred)
-hm_tst$observed<-factor(hm_tst$observation)
-
-cm<-confusionMatrix(hm_tst$observed, hm_tst$pred)
-
-print(cm)
-
## Confusion Matrix and Statistics
-## 
-##           Reference
-## Prediction active passive
-##    active    9613    2292
-##    passive    143    2030
-##                                           
-##                Accuracy : 0.827           
-##                  95% CI : (0.8207, 0.8333)
-##     No Information Rate : 0.693           
-##     P-Value [Acc > NIR] : < 2.2e-16       
-##                                           
-##                   Kappa : 0.5282          
-##                                           
-##  Mcnemar's Test P-Value : < 2.2e-16       
-##                                           
-##             Sensitivity : 0.9853          
-##             Specificity : 0.4697          
-##          Pos Pred Value : 0.8075          
-##          Neg Pred Value : 0.9342          
-##              Prevalence : 0.6930          
-##          Detection Rate : 0.6828          
-##    Detection Prevalence : 0.8456          
-##       Balanced Accuracy : 0.7275          
-##                                           
-##        'Positive' Class : active          
-## 
-
print(cm$byClass)
-
##          Sensitivity          Specificity       Pos Pred Value 
-##            0.9853424            0.4696900            0.8074759 
-##       Neg Pred Value            Precision               Recall 
-##            0.9341924            0.8074759            0.9853424 
-##                   F1           Prevalence       Detection Rate 
-##            0.8875860            0.6929962            0.6828385 
-## Detection Prevalence    Balanced Accuracy 
-##            0.8456457            0.7275162
-
#print(cm$table)
-
-
-

Performance metrics based on 4 dB threshold from the literature

-
#evaluate based on 4 dB value from the literature 
-
-hm_tst$pred<-NA
-hm_tst$pred[hm_tst$Diff>4]<-"active"
-hm_tst$pred[hm_tst$Diff<=4]<-"passive"
-
-hm_tst$pred<-factor(hm_tst$pred)
-cm<-confusionMatrix(hm_tst$observed, hm_tst$pred)
-
-print(cm)
-
## Confusion Matrix and Statistics
-## 
-##           Reference
-## Prediction active passive
-##    active    4851    7054
-##    passive      8    2165
-##                                           
-##                Accuracy : 0.4984          
-##                  95% CI : (0.4901, 0.5067)
-##     No Information Rate : 0.6549          
-##     P-Value [Acc > NIR] : 1               
-##                                           
-##                   Kappa : 0.1736          
-##                                           
-##  Mcnemar's Test P-Value : <2e-16          
-##                                           
-##             Sensitivity : 0.9984          
-##             Specificity : 0.2348          
-##          Pos Pred Value : 0.4075          
-##          Neg Pred Value : 0.9963          
-##              Prevalence : 0.3451          
-##          Detection Rate : 0.3446          
-##    Detection Prevalence : 0.8456          
-##       Balanced Accuracy : 0.6166          
-##                                           
-##        'Positive' Class : active          
-## 
-
print(cm$byClass)
-
##          Sensitivity          Specificity       Pos Pred Value 
-##            0.9983536            0.2348411            0.4074759 
-##       Neg Pred Value            Precision               Recall 
-##            0.9963185            0.4074759            0.9983536 
-##                   F1           Prevalence       Detection Rate 
-##            0.5787402            0.3451485            0.3445802 
-## Detection Prevalence    Balanced Accuracy 
-##            0.8456457            0.6165973
-
print(cm$table)
-
##           Reference
-## Prediction active passive
-##    active    4851    7054
-##    passive      8    2165
-
-
-
-
-

Comparison of the threshold based approach and the random-forest Model

-

When calibrating the threshold based on an adequate train data set, the approach is generally able to separate active and passive behavior but performance metrics (F1=0.79, 0.78, 0.88; bats, woodpecker, human) are between 10 and 20 points worth and more variable than our random forest model (F1= 0.97, 0.97, 0.98; bats,woodpecker,human). With F-scores between 0.46 and 0.58 the threshold value proposed in the literature performed significantly worth.

-

Since only the test data set of the bats is balanced but the woodpecker data is slightly imbalanced and the human activity data set is highly imbalanced lets also take a look at a metric that takes the data distributoin into account:

-

Cohen’s kappa is defined as:

-

K=(p_0-p_e)/(1-p_e)

-

where p_0 is the overall accuracy of the model and p_e is the measure of the agreement between the model predictions and the actual class values as if happening by chance.

-

Cohen’s kappa is always less than or equal to 1. Values of 0 or less, indicate that the classifier is not batter than chance. Landis and Koch (1977) provide a way to characterize values. According to their scheme a value < 0 is indicating no agreement , 0–0.20 slight agrement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement , and 0.81–1 as almost perfect agreement.

-

Kappa values based on the 4 dB separation value from the literature ranged between 0.17 (humans) and 0.26 (bats), i.e. a slight to fair agreement. For the optimized threshold Kappa values were significantly better in all cases (0.46, 0.58, 0.53; bats, woodpecker, humans); i.e. moderate agreement. However, even the best Kappa value for the threshold based approach only showed a moderate agreement while all Kappa values based on the random-forest model showed an almost perfect agreement ( 0.94, 0.94, 0.90 ; bats, woodpecker, humans ).

-
-
-

REFERENCES

-
-
-Alade, Michael Olusope. 2013. “Investigation of the Effect of Ground and Air Temperature on Very High Frequency Radio Signals.” Journal of Information Engineering and Applications 3: 16–21. -
-
-Christin, Sylvain, Éric Hervet, and Nicolas Lecomte. 2019. “Applications for Deep Learning in Ecology.” Methods Ecol. Evol. 10 (10): 1632–44. -
-
-Cochran, W W, D W Warner, J R Tester, and V B Kuechle. 1965. “Automatic Radio-Tracking System for Monitoring Animal Movements.” Bioscience 15 (2): 98–100. -
-
-Fernández-Delgado, Manuel, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” J. Mach. Learn. Res. 15 (1): 3133–81. -
-
-Holland, Richard A, Christoph F J Meyer, Elisabeth K V Kalko, Roland Kays, and Martin Wikelski. 2011. “Emergence Time and Foraging Activity in Pallas’ Mastiff Bat, Molossus Molossus (Chiroptera: Molossidae) in Relation to Sunset/Sunrise and Phase of the Moon.” Acta Chiropt. 13 (2): 399–404. -
-
-Kays, Roland, Sameer Tilak, Margaret Crofoot, Tony Fountain, Daniel Obando, Alejandro Ortega, Franz Kuemmeth, et al. 2011. “Tracking Animal Location and Activity with an Automated Radio Telemetry System in a Tropical Rainforest.” Comput. J. 54 (12): 1931–48. -
-
-Kjos, Charles G, and William W Cochran. 1970. “Activity of Migrant Thrushes as Determined by Radio-Telemetry.” Wilson Bull., 225–26. -
-
-Landis, J R, and G G Koch. 1977. “An Application of Hierarchical Kappa-Type Statistics in the Assessment of Majority Agreement Among Multiple Observers.” Biometrics 33 (2): 363–74. -
-
-Meyer, Hanna, Christoph Reudenbach, Tomislav Hengl, Marwan Katurji, and Thomas Nauss. 2018. “Improving Performance of Spatio-Temporal Machine Learning Models Using Forward Feature Selection and Target-Oriented Validation.” Environmental Modelling & Software 101 (March): 1–9. -
-
-Probst, Philipp, Marvin N Wright, and Anne-Laure Boulesteix. 2019. “Hyperparameters and Tuning Strategies for Random Forest.” Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 9 (3): e1301. -
-
-Shwartz-Ziv, Ravid, and Amitai Armon. 2022. “Tabular Data: Deep Learning Is Not All You Need.” Inf. Fusion 81 (May): 84–90. -
-
-
- - - -
-
- -
- - - - - - - - - - - - - - - - diff --git a/html/bat_data_HGAM_tutorial.html b/html/bat_data_HGAM_tutorial.html deleted file mode 100644 index 58b371f..0000000 --- a/html/bat_data_HGAM_tutorial.html +++ /dev/null @@ -1,2393 +0,0 @@ - - - - - - - - - - - - - -Timing of activity probability in Bechstein’s and Leisler’s bats: Data wrangling and checks procedure (Updated: ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
-
-
- -
- - - - - - - -
-

Introduction

-

This tutorial shows how probability of activity curves can be extracted from the VHF signal data after classification of 1 min intervals into active or passive states. I’m using Hierarchical Generalized Additive Models following the Pedersen et al. 2021 article.

-
-

Packages required

-

Before you start, make sure that you have the following packages installed: tidyverse, data.table,lubridate, hms, mgcv, gratia, itsadug, mgcViz, ggthemes, viridis

-

Additionally we will use functionalities fro the tRackIT R package that is not hosted on cran yet. The tRackIT R-package also uses dependencies that are not published on CRAN. So before we install the package, we install these dependencies

-
library(remotes)
-library(devtools)
-Sys.setenv("R_REMOTES_NO_ERRORS_FROM_WARNINGS" = "true") 
-#fast rolling windows 
-install_github("andrewuhl/RollingWindow")
-#fortran based triangulation algorithms
-install_github("barryrowlingson/telemetr")
-

Load the packages

-
library(tidyverse); library(data.table)
-library(lubridate); library(hms)
-library(mgcv); library(gratia); #library(itsadug)
-library(mgcViz); library(ggthemes); library(viridis)
-library(tRackIT)
-
-
-
-

1. Data importation and wrangling

-

The tRackIT-Tutorial-for-activity-classification shows the complete workflow to get from raw signals up to activity classifications with a 1 Minute resolution. We will skip that part here and start with the classified data of bats that have been tracket from 2018 to 2021 (ècological_case_study/bats_for_activity`). There is one file per individual now, that contains nothing but the timestamp, the activity class per timestamp and the ID. We will need some more info such as timing in relation to sunset and sunrise as well as species, sex etc. To do so we will use two functionalities of the tRackIT-package. The time_of_day() function claculates timedifference tu sunset and sunrise etc, the add.ID.info() function adds meta-info for the individual to the data. This section of the tutorial can only be executed if the “ecological_case_study_bat_activity” directory has been downloaded. You can skip that part and continue with the next chunk of code.

-
#set coordinates for sunset/sunrise calculation
-
-Lat<-50.844868
-Lon<-8.663649
-
-#Loop through years 
-for(y in c(2018, 2019, 2020, 2021)){
-#print(y)
-#get projectfile per year
-pr<-getProject(projroot=paste0("D:/data_ms_activity_classification/ecological_case_study_bat_activity/data/bats_for_activity/",y, "/"))
-
-#remove ids with no data at all
-pr$tags<-pr$tags[pr$tags$ID!="Mbec180518_2",]
-pr$tags<-pr$tags[pr$tags$ID!="mbec150155_m",]
-
-#loop through individuals
-for(id in pr$tags$ID){
-  
-  #print(id)
-  
-  anml<-getAnimal(projList = pr, animalID = id)
-  #get activity classification aggregated to 1 Minute
-  fls<-list.files(anml$path$classification,pattern="agg", full.names = TRUE)
-  
-  if(length(fls)>0){
-  
-  data<-data.table::fread(fls[1])
-  
-  #calculate daytime infos
-  data<-time_of_day(data = data, Lat=Lat, Lon = Lon, tcol = "timestamp", tz = "CET", activity_period = "nocturnal")
-  
-  #add id info
-data<-add.ID.Info( data=data, animal=anml)
-
-#number of tracking days
-data$n_days<-as.numeric(abs(difftime(as.Date(data$start_datetime), as.Date(data$stop_datetime))))+1
-
-#binary activity classification
-data$activity<-ifelse(data$prediction=="active", 1,0)  
-
-#correction of typo
-data$species[data$species=="mbec"]<-"Mbec"
-  
- data.table::fwrite(data, paste0("D:/data_ms_activity_classification/ecological_case_study_bat_activity/data/all_bats_aggregated/",anml$meta$animalID, "_",as.character(y), "_aggregated_1_Minute.csv"))
-    
-      }
-}}
-
-
-#merge all data 
-  fls<-list.files("bat_data_HGAM_tutorial/all_bats_aggregated/", full.names = TRUE)
-data<-plyr::ldply(fls, function(x){data.table::fread(x)})
-
-data<-data[data$species=="Mbec" | data$species=="Nlei",]
-
-
-
-#get info which ids should be excluded
-check_ids<-data.table::fread("bat_data_HGAM_tutorial/bats_inspect_id.csv")
-excld<-check_ids[check_ids$exclude_individual=="Y",]
-
-#exclude ids from data
-data<-data[!(data$ID %in% excld$ID),]
-
-#account for rep state transition
-
-df_rep<-read.csv("bat_data_HGAM_tutorial/df_rep_state_transition.csv")
-for (id in unique(df_rep$ID)){
-  
-  if (df_rep$rep2[df_rep$ID==id]=="pregnant"){
-    
-    data$rep.state[data$ID==id & data$year==df_rep$year[df_rep$ID==id] & data$timestamp>=df_rep$start_rep1[df_rep$ID==id]]<-"lactating"}
-  
-  if (df_rep$rep2[df_rep$ID==id]=="post-lactating"){
-    
-    data$rep.state[data$ID==id & data$year==df_rep$year[df_rep$ID==id] & data$timestamp>=df_rep$start_rep2[df_rep$ID==id]]<-"post-lactating"}
-  
-}
-
-
-
-#save data
-data.table::fwrite(data,"bat_data_HGAM_tutorial/all_bats_aggregated.csv")
-
-

1.1 Data importation

-
df_1min <- fread("bat_data_HGAM_tutorial/all_bats_aggregated.csv", stringsAsFactors = T) # import data
-
-
-glimpse(df_1min)
-
## Rows: 818,398
-## Columns: 24
-## $ timestamp      <dttm> 2020-05-15 20:14:00, 2020-05-15 20:15:00, 2020-05-15 2~
-## $ prediction     <fct> active, active, active, active, active, active, active,~
-## $ ID             <fct> h146474, h146474, h146474, h146474, h146474, h146474, h~
-## $ date           <date> 2020-05-15, 2020-05-15, 2020-05-15, 2020-05-15, 2020-0~
-## $ sunset_date    <date> 2020-05-15, 2020-05-15, 2020-05-15, 2020-05-15, 2020-0~
-## $ hour           <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,~
-## $ sunrise_date   <date> 2020-05-16, 2020-05-16, 2020-05-16, 2020-05-16, 2020-0~
-## $ sunset         <dttm> 2020-05-15 19:08:35, 2020-05-15 19:08:35, 2020-05-15 1~
-## $ sunrise        <dttm> 2020-05-16 03:34:14, 2020-05-16 03:34:14, 2020-05-16 0~
-## $ time_to_rise   <dbl> -7.337226, -7.320559, -7.303893, -7.287226, -7.270559, ~
-## $ time_to_set    <dbl> 1.090028, 1.106695, 1.123362, 1.140028, 1.156695, 1.173~
-## $ start_datetime <dttm> 2020-05-15 20:14:00, 2020-05-15 20:14:00, 2020-05-15 2~
-## $ stop_datetime  <dttm> 2020-05-29 21:58:00, 2020-05-29 21:58:00, 2020-05-29 2~
-## $ year           <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2~
-## $ ydate          <int> 136, 136, 136, 136, 136, 136, 136, 136, 136, 136, 136, ~
-## $ month          <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5~
-## $ t              <dbl> 0, 60, 120, 180, 240, 300, 360, 420, 480, 540, 600, 660~
-## $ species        <fct> Mbec, Mbec, Mbec, Mbec, Mbec, Mbec, Mbec, Mbec, Mbec, M~
-## $ sex            <fct> w, w, w, w, w, w, w, w, w, w, w, w, w, w, w, w, w, w, w~
-## $ age            <fct> ad, ad, ad, ad, ad, ad, ad, ad, ad, ad, ad, ad, ad, ad,~
-## $ weight         <dbl> 8.2, 8.2, 8.2, 8.2, 8.2, 8.2, 8.2, 8.2, 8.2, 8.2, 8.2, ~
-## $ rep.state      <fct> pregnant, pregnant, pregnant, pregnant, pregnant, pregn~
-## $ n_days         <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,~
-## $ activity       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1~
-
-
-

1.2 Data wrangling

-
df_1min$species<-as.character(df_1min$species)
-df_1min$species[df_1min$species=="Mbec"]<-"M.bechsteinii"
-df_1min$species[df_1min$species=="Nlei"]<-"N.leisleri"
-df_1min$species<-as.factor(df_1min$species)
-
-df_1min$month_f <- factor(month(df_1min$date))
-df_1min$year_f <- factor(df_1min$year)
-df_1min$date <- date(df_1min$date)
-df_1min$hour <- hour(df_1min$timestamp)
-df_1min$week <- week(df_1min$timestamp)
-
-min_set <- min(df_1min$time_to_set)
-max_set <- max(df_1min$time_to_set)
-df_1min$start.event <- ifelse(df_1min$time_to_set==min_set,T,F) # Identify start of the time series 
-df_1min$ydate_f <- as.factor(df_1min$ydate)
-df_1min$date_f <- as.factor(df_1min$date)
-
-K.time_of_day <- length(unique(df_1min$time_to_rise))
-
-df_1min <- df_1min %>% data.frame()
-

Exclude retagged individuals and extract sample sizes

-
df_1min<- df_1min[!(df_1min$ID =="Nlei20211" & df_1min$ydate >= 200),]
-df_1min<- df_1min[!(df_1min$ID =="h146487" & df_1min$ydate >= 150),]
-
-# Sample sizes
-df_1min %>%
-  group_by(species) %>% 
-  summarise(nID = n_distinct(ID),
-            nObs = n(),
-            meanDays = mean(n_days))
-
## # A tibble: 2 x 4
-##   species         nID   nObs meanDays
-##   <fct>         <int>  <int>    <dbl>
-## 1 M.bechsteinii    52 577977     18.4
-## 2 N.leisleri       20 204443     21.3
-
df_1min %>%
-  summarise(nID = n_distinct(ID),
-            nObs = n(),
-            meanDays = mean(n_days))
-
##   nID   nObs meanDays
-## 1  72 782420 19.15002
-

NOTE Some individuals were tagged twice within the same year. We want to avoid these situations and reduce the sampling period to the first tagging event. The full details of this analysis can be found under filename, section 0.2.

-
-
-

1.3 Visual inspection

-

We can plot visually inspect the data by presenting the probability of activity over time of the day. Here, it is easier to calculate this probability by time intervals of 15 minutes for easier viuslization

-
df_15min <- df_1min %>%
-  #filter(species == "M.bechsteinii" | species == "N.leisleri") %>%
-  mutate(interval = as_hms(floor_date(timestamp, unit = "15minutes"))) %>%
-  group_by(ID, species, year, ydate, hour, interval) %>%
-  summarise(n_intervals = length(activity), 
-            n_active = length(activity[activity == 1]),
-            n_passive = length(activity[activity == 0]), 
-            time_to_set = mean(time_to_set))  # calculate average time to sunset for that 15 minute interval
-
-df_15min %>%
-  ggplot(aes(x = time_to_set, y = n_active/n_intervals, color = species)) + 
-  geom_point(alpha = 0.1) +
-  geom_smooth() + scale_color_wsj() + theme_bw(14) + 
-  facet_wrap(~species) + geom_hline(yintercept = 0.5,linetype = "dashed") + 
-  ylim(0, 1) + ylab("Activity probability (15 min interval)") +
-  xlab("Time to sunset (h)")
-

-

We can also inspect each tagged individual in the same way

-
df_15min %>%
-  filter(species == "N.leisleri") %>%
-  ggplot(aes(x = time_to_set, y = n_active/n_intervals)) + 
-  geom_point(alpha = 0.2) +
-  geom_smooth() + scale_color_wsj() + theme_bw(14) + 
-  facet_wrap(~ID) + geom_hline(yintercept = 0.5,linetype = "dashed") + 
-  ylim(0, 1) + ylab("Activity probability (15 min interval)") +
-  xlab("Time to sunset (h)")
-

-
df_15min %>%
-  filter(species == "M.bechsteinii") %>%
-  ggplot(aes(x = time_to_set, y = n_active/n_intervals)) + 
-  geom_point(alpha = 0.2) +
-  geom_smooth() + scale_color_wsj() + theme_bw(14) + 
-  facet_wrap(~ID) + geom_hline(yintercept = 0.5,linetype = "dashed") + 
-  ylim(0, 1) + ylab("Activity probability (15 min interval)") +
-  xlab("Time to sunset (h)")
-

-
-
-
-

2. HGAM for species comparison of daily activity patterns

-

We fit a Hierarchical Generalized Additive Model to compare whether Bechstein’s and Leisler’s bats differ significantly in their daily activity patterns. We assume that the probability of activity is a non-linear function of time of the day, here centered around time of sunset (t = 0). We use a binomial error term since our activity column is a string of 0 and 1 (i.e the bat is either marked as passive for that minute or active).

-
-

2.1 Fit models

-

We use circular spline functions to constrain the activity probability to be equal at 00:00 and 23:59 (argument bs = "cc"). We also need to account for the fact that individuals were measured repeatedely but in differnet years of monitoring. The simplest way to do that is to include individuals and date as random intercepts with the argument bs = "re". Note that there are many flavors for specifying random effects with HGAMs which apply more or less penalty to the random effects and allow them to deviate from population level trends. Here, we are mainly interested in species differences and assume that there is a general time of activity function that is species specific. Note that more complex random effect structures gave functionally similar results and did not affect conclusions. Pedersen et al. 2021 provides a great and exhasutive tutorials on the different ways to approach random effects within the GAM framework. There are two other arguments that require ou attention. First, we need to account for the fact that there observations are likely to be highly autocorrelated because they are taken at 1 min intervals. This value has to be set manually and we show our procedure for investigating autocorrelation in our analysis R script (see filename, section 1.1). Next, we need to select the degree of complexity of our smoothing terms: k. After inspection with the k.check function, setting k to 120 was the highest complexity we could set without overfitting the data (see filename, section 1.2 for details).

-

To test for species difference in activity patterns, we fit two models. Model M0 assumes that both species have the same global activity patterns while model M1 allows both species to follow their own trend (argment by = species within the spline function)

-

Set autocorrelation (r1) and basis dimension (k) of the smooth term

-
r1 <- 0.5765725
-k <- 120
-

Fit model M0

-
fit.gam.M0 <- bam(activity ~ 
-                    s(time_to_set, bs = c("cc"), k = k) +
-                    s(ID, bs = "re") + s(date_f, bs = "re"), 
-                  rho = r1, AR.start = start.event,
-                  data = df_1min, 
-                  method = "fREML", discrete=T, family = "binomial", 
-                  knots=list(time_to_rise=c(min_set, max_set)))
-summary(fit.gam.M0)
-
## 
-## Family: binomial 
-## Link function: logit 
-## 
-## Formula:
-## activity ~ s(time_to_set, bs = c("cc"), k = k) + s(ID, bs = "re") + 
-##     s(date_f, bs = "re")
-## 
-## Parametric coefficients:
-##             Estimate Std. Error z value Pr(>|z|)    
-## (Intercept)  -1.7383     0.1339  -12.98   <2e-16 ***
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## Approximate significance of smooth terms:
-##                   edf Ref.df  Chi.sq p-value    
-## s(time_to_set)  51.84    118 4061286  <2e-16 ***
-## s(ID)           67.07     71 1976322  <2e-16 ***
-## s(date_f)      280.26    306 3084458  <2e-16 ***
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## R-sq.(adj) =   0.45   Deviance explained = 38.8%
-## fREML = 8.5139e+05  Scale est. = 1         n = 782420
-
#draw(fit.gam.M0)
-

Fit model M1

-
fit.gam.M1 <- bam(activity ~ 
-                    s(time_to_set, bs = c("cc"), k = k, by = species) +
-                    s(ID, bs = "re") + s(date_f, bs = "re"), 
-                  rho = r1, AR.start = start.event,
-                  data = df_1min, 
-                  method = "fREML", discrete=T, family = "binomial", 
-                  knots=list(time_to_rise=c(min_set, max_set)))
-summary(fit.gam.M1)
-
## 
-## Family: binomial 
-## Link function: logit 
-## 
-## Formula:
-## activity ~ s(time_to_set, bs = c("cc"), k = k, by = species) + 
-##     s(ID, bs = "re") + s(date_f, bs = "re")
-## 
-## Parametric coefficients:
-##             Estimate Std. Error z value Pr(>|z|)    
-## (Intercept)  -1.7399     0.1309  -13.29   <2e-16 ***
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## Approximate significance of smooth terms:
-##                                        edf Ref.df  Chi.sq p-value    
-## s(time_to_set):speciesM.bechsteinii  47.23    118 3571851  <2e-16 ***
-## s(time_to_set):speciesN.leisleri     45.82    118  188033  <2e-16 ***
-## s(ID)                                66.32     71 1824189  <2e-16 ***
-## s(date_f)                           282.01    306 2421802  <2e-16 ***
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## R-sq.(adj) =  0.468   Deviance explained = 40.4%
-## fREML = 8.4389e+05  Scale est. = 1         n = 782420
-
#draw(fit.gam.M1)
-

Compare M0 and M1

-
anova(fit.gam.M0, fit.gam.M1)
-
## Analysis of Deviance Table
-## 
-## Model 1: activity ~ s(time_to_set, bs = c("cc"), k = k) + s(ID, bs = "re") + 
-##     s(date_f, bs = "re")
-## Model 2: activity ~ s(time_to_set, bs = c("cc"), k = k, by = species) + 
-##     s(ID, bs = "re") + s(date_f, bs = "re")
-##   Resid. Df Resid. Dev    Df Deviance
-## 1    781994     578790               
-## 2    781942     563613 52.41    15177
-
AIC(fit.gam.M0, fit.gam.M1) %>% 
-  mutate(delta.AIC = AIC - min(AIC)) %>% arrange(delta.AIC)
-
##                  df      AIC delta.AIC
-## fit.gam.M1 443.2209 248309.1      0.00
-## fit.gam.M0 400.6479 263401.1  15092.01
-
-
-

2.2 Plot models

-
# Calculate hourly interval data
-df_1h <- df_1min %>%
-  mutate(interval = as_hms(floor_date(timestamp, unit = "60minutes"))) %>%
-  group_by(ID, species, year, ydate, hour, interval) %>%
-  summarise(n_intervals = length(activity), 
-            n_active = length(activity[activity == 1]),
-            n_passive = length(activity[activity == 0]), 
-            time_to_set = mean(time_to_set))  # calculate average time to sunset for that 15 minute interval
-df_1h$p_act <- df_1h$n_active/df_1h$n_intervals
-
-fit.values <- evaluate_smooth(fit.gam.M1, "s(time_to_set)", n = 244,
-                              overall_uncertainty = T,
-                              unconditional = T)
-draw(fit.values)
-

-
b0 <- coef(fit.gam.M1)[1]
-Fig5 <- ggplot(data = fit.values, 
-                        aes(x = time_to_set, y = plogis(est+b0), 
-                            color = species, group = species))
-Fig5a <- Fig5 +
-  geom_point(data = df_1h, alpha = .1,
-             aes(x = time_to_set, 
-                 y = p_act)) +
-  geom_ribbon(aes(ymin = plogis(est+b0 - 2 * se) ,
-                  ymax = plogis(est+b0 + 2 * se)), 
-              fill = "grey", color = "grey") +
-  geom_line(size = .5) + 
-  geom_hline(yintercept = 0.5, linetype = "dashed") +
-  scale_color_wsj() + theme_bw(14) +
-  xlab("") + 
-  ylab("Activity probability \n") + 
-  ylim(0, 1) +
-  theme(legend.position = c(.1,.745))
-Fig5a
-

-
fit.delta <- difference_smooths(fit.gam.M1, "s(time_to_set)", n = 244)
-
-Fig5b <- ggplot(data = fit.delta, 
-                    aes(x = time_to_set, y = diff)) +
-  geom_ribbon(aes(ymin = lower,
-                  ymax = upper), color = "grey",
-              alpha = 0.3) +
-  geom_line() + geom_hline(yintercept = 0, linetype = "dashed") +
-  theme_bw(14) + theme(legend.position = "none") +
-  xlab("Time since sunset (h)") + 
-  ylab("Activity difference \n (M.bechsteinii - N.leisleri)")
-
-Fig5 <- Fig5a / Fig5b
-Fig5
-

-
-
-

2.3 Extract timing of activity values

-

The following section shows how different aspects of the dialy activity patterns can be extracted from the HGAM output. We focus here on determining the timinig for onset and end of the daily activity period, the timing of peak activity and the intensity of activity during the night.

-
# Onset of activity up time: When does p(activity) > 0.5 first
-# End of activity time: When does p(activity) > 0.5 last
-fit.values %>% group_by(species) %>% 
-  filter(plogis(est+b0) > .5) %>% 
-  summarise(a.onset = as_hms(min(time_to_set)*60),
-            a.end = as_hms(max(time_to_set)*60))
-
## # A tibble: 2 x 3
-##   species       a.onset       a.end        
-##   <fct>         <time>        <time>       
-## 1 M.bechsteinii 00'30.879867" 07'10.972640"
-## 2 N.leisleri    00'12.125519" 07'23.475539"
-
# Peak activity: what are the two highest values for p(activity)?
-fit.values %>% group_by(species) %>% 
-  filter(plogis(est+b0) == max(plogis(est+b0))) %>% 
-  summarise(peak.a = max(plogis(est+b0)),
-            peak.a.low = plogis(est+b0-2*se),
-            peak.a.up = plogis(est+b0+2*se))
-
## # A tibble: 2 x 4
-##   species       peak.a peak.a.low peak.a.up
-##   <fct>          <dbl>      <dbl>     <dbl>
-## 1 M.bechsteinii  0.770      0.751     0.788
-## 2 N.leisleri     0.822      0.798     0.845
-
# Peak activity timing: time of day whith maximum p(activity)
-fit.values %>% group_by(species) %>% 
-  filter(plogis(est+b0) == max(plogis(est+b0))) %>% 
-  group_by(species) %>%
-  summarise(t.peak = as_hms(time_to_set*60))
-
## # A tibble: 2 x 2
-##   species       t.peak       
-##   <fct>         <time>       
-## 1 M.bechsteinii 01'33.394363"
-## 2 N.leisleri    00'37.131317"
-
# Activity density: area  under the curve when p(activity) > 0.5
-fit.values %>% group_by(species) %>% 
-  filter(plogis(est+b0) > .5) %>% 
-  summarise(auc = bayestestR::auc(time_to_set, plogis(est+b0), 
-                                  method = "spline"),
-            auc.low = bayestestR::auc(time_to_set, plogis(est+b0-2*se), 
-                                  method = "spline"),
-            auc.up = bayestestR::auc(time_to_set, plogis(est+b0+2*se), 
-                                  method = "spline"))
-
## # A tibble: 2 x 4
-##   species         auc auc.low auc.up
-##   <fct>         <dbl>   <dbl>  <dbl>
-## 1 M.bechsteinii  4.70    4.56   4.83
-## 2 N.leisleri     3.42    3.23   3.62
-
-
-
-

3. HGAM for comparing activity patterns reproductive periods

-

We can also use HGAM to compare the timing of daily activity depending on the reproductive status of individuals within each species. The models used here are very similar to those used in section 2. with the main difference that we are now comparing a model that assumes a common activity pattern for all statuses (model 0) and one that allows reproductive statuses to have differing smoothing functions.

-

Exclude individuals of unknown reproductive status

-
df_1min %>% filter(rep.state != "unknown") %>% 
-  group_by(species) %>% 
-  summarise(nID = n_distinct(ID),
-            nObs = n())
-
## # A tibble: 2 x 3
-##   species         nID   nObs
-##   <fct>         <int>  <int>
-## 1 M.bechsteinii    35 384536
-## 2 N.leisleri       19 203261
-
df_1min %>% filter(rep.state != "unknown") %>% 
-  group_by(species, rep.state) %>% 
-  summarise(nID = n_distinct(ID),
-            nObs = n())
-
## # A tibble: 8 x 4
-## # Groups:   species [2]
-##   species       rep.state        nID   nObs
-##   <fct>         <fct>          <int>  <int>
-## 1 M.bechsteinii lactating         13 102076
-## 2 M.bechsteinii non.repro          4  68685
-## 3 M.bechsteinii post-lactating    12 135582
-## 4 M.bechsteinii pregnant          13  78193
-## 5 N.leisleri    lactating         11  71560
-## 6 N.leisleri    non.repro          2  23296
-## 7 N.leisleri    post-lactating     7  58889
-## 8 N.leisleri    pregnant           6  49516
-
# Remove individuals of unknown status
-df_1min.Mb <- df_1min %>% filter(species == "M.bechsteinii" & rep.state != "unknown") %>% droplevels()
-df_1min.Nl <- df_1min %>% filter(species == "N.leisleri" & rep.state != "unknown") %>% droplevels()
-
-

3.1 Bechstein’s bats

-
fit.gam.Mb.0 <- bam(activity ~ 
-                      s(time_to_set, bs = c("cc"), k = k) +
-                      s(ID, bs = "re") + #, k = 25
-                      s(date_f, bs = "re"), #k = 173 
-                    rho = r1, AR.start = start.event,
-                    data = df_1min.Mb, 
-                    method = "fREML", discrete=T, family = "binomial", 
-                    knots=list(time_to_rise=c(min_set, max_set)))
-summary(fit.gam.Mb.0)
-
## 
-## Family: binomial 
-## Link function: logit 
-## 
-## Formula:
-## activity ~ s(time_to_set, bs = c("cc"), k = k) + s(ID, bs = "re") + 
-##     s(date_f, bs = "re")
-## 
-## Parametric coefficients:
-##             Estimate Std. Error z value Pr(>|z|)    
-## (Intercept)  -1.6686     0.1146  -14.57   <2e-16 ***
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## Approximate significance of smooth terms:
-##                   edf Ref.df Chi.sq  p-value    
-## s(time_to_set)  45.50    118 573547  < 2e-16 ***
-## s(ID)           28.39     35  41968  0.00532 ** 
-## s(date_f)      226.32    252  63224 6.82e-05 ***
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## R-sq.(adj) =  0.519   Deviance explained = 44.5%
-## fREML = 4.0871e+05  Scale est. = 1         n = 384536
-
### M1: difference in reproductive status on average & in smooth ----
-fit.gam.Mb.1 <- bam(activity ~ rep.state +
-                    s(time_to_set, bs = c("cc"), k = k, 
-                      by = rep.state) +
-                    s(ID, bs = "re") + s(date_f, bs = "re"), 
-                  rho = r1, AR.start = start.event,
-                  data = df_1min.Mb, 
-                  method = "fREML", discrete=T, family = "binomial", 
-                  knots=list(time_to_rise=c(min_set, max_set)))
-summary(fit.gam.Mb.1)
-
## 
-## Family: binomial 
-## Link function: logit 
-## 
-## Formula:
-## activity ~ rep.state + s(time_to_set, bs = c("cc"), k = k, by = rep.state) + 
-##     s(ID, bs = "re") + s(date_f, bs = "re")
-## 
-## Parametric coefficients:
-##                         Estimate Std. Error z value Pr(>|z|)    
-## (Intercept)              -1.2606     0.1477  -8.532  < 2e-16 ***
-## rep.statenon.repro       -0.7385     0.2799  -2.639  0.00832 ** 
-## rep.statepost-lactating  -0.8867     0.1325  -6.692  2.2e-11 ***
-## rep.statepregnant        -0.1065     0.1996  -0.534  0.59357    
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## Approximate significance of smooth terms:
-##                                           edf Ref.df Chi.sq p-value    
-## s(time_to_set):rep.statelactating       33.82    118  77911 < 2e-16 ***
-## s(time_to_set):rep.statenon.repro       30.96    118 159447 < 2e-16 ***
-## s(time_to_set):rep.statepost-lactating  38.59    118 159446 < 2e-16 ***
-## s(time_to_set):rep.statepregnant        31.39    118  58687 < 2e-16 ***
-## s(ID)                                   26.75     34  17596  0.0648 .  
-## s(date_f)                              224.82    252  44575 7.5e-06 ***
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## R-sq.(adj) =  0.526   Deviance explained = 45.3%
-## fREML = 4.07e+05  Scale est. = 1         n = 384536
-
AIC(fit.gam.Mb.0, fit.gam.Mb.1) %>% 
-  mutate(delta.AIC = AIC - min(AIC)) %>% arrange(delta.AIC)
-
##                    df      AIC delta.AIC
-## fit.gam.Mb.1 392.2189 106210.7     0.000
-## fit.gam.Mb.0 301.9440 109867.9  3657.224
-
-
-

3.2 Leisler’ bats

-
fit.gam.Nl.0 <- bam(activity ~ 
-                      s(time_to_set, bs = c("cc"), k = k) +
-                      s(ID, bs = "re") + s(date_f, bs = "re"), 
-                    rho = r1, AR.start = start.event,
-                    data = df_1min.Nl, 
-                    method = "fREML", discrete=T, family = "binomial", 
-                    knots=list(time_to_rise=c(min_set, max_set)))
-summary(fit.gam.Nl.0)
-
## 
-## Family: binomial 
-## Link function: logit 
-## 
-## Formula:
-## activity ~ s(time_to_set, bs = c("cc"), k = k) + s(ID, bs = "re") + 
-##     s(date_f, bs = "re")
-## 
-## Parametric coefficients:
-##             Estimate Std. Error z value Pr(>|z|)    
-## (Intercept)  -1.7416     0.3772  -4.617 3.89e-06 ***
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## Approximate significance of smooth terms:
-##                   edf Ref.df  Chi.sq p-value    
-## s(time_to_set)  45.71    118  401373  <2e-16 ***
-## s(ID)           16.94     18 2697157  <2e-16 ***
-## s(date_f)      114.41    126 4172289  <2e-16 ***
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## R-sq.(adj) =   0.35   Deviance explained = 32.8%
-## fREML = 2.1573e+05  Scale est. = 1         n = 203261
-
fit.gam.Nl.1 <- bam(activity ~ rep.state +
-                      s(time_to_set, bs = c("cc"), k = k,
-                        by = rep.state) +
-                      s(ID, bs = "re") + s(date_f, bs = "re"), 
-                    rho = r1, AR.start = start.event,
-                    data = df_1min.Nl, 
-                    method = "fREML", discrete=T, family = "binomial", 
-                    knots=list(time_to_rise=c(min_set, max_set)))
-summary(fit.gam.Nl.1)
-
## 
-## Family: binomial 
-## Link function: logit 
-## 
-## Formula:
-## activity ~ rep.state + s(time_to_set, bs = c("cc"), k = k, by = rep.state) + 
-##     s(ID, bs = "re") + s(date_f, bs = "re")
-## 
-## Parametric coefficients:
-##                         Estimate Std. Error z value Pr(>|z|)    
-## (Intercept)              -1.3364     0.5211  -2.564 0.010336 *  
-## rep.statenon.repro        1.0943     1.3571   0.806 0.420023    
-## rep.statepost-lactating   0.7711     0.2588   2.980 0.002887 ** 
-## rep.statepregnant        -2.6379     0.6794  -3.883 0.000103 ***
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## Approximate significance of smooth terms:
-##                                           edf Ref.df Chi.sq p-value    
-## s(time_to_set):rep.statelactating       36.84    118 146915  <2e-16 ***
-## s(time_to_set):rep.statenon.repro       28.29    118  17017  <2e-16 ***
-## s(time_to_set):rep.statepost-lactating  36.58    118  46074  <2e-16 ***
-## s(time_to_set):rep.statepregnant        36.46    118  58619  <2e-16 ***
-## s(ID)                                   15.94     17 904539   0.022 *  
-## s(date_f)                              112.07    126 765022  <2e-16 ***
-## ---
-## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-## 
-## R-sq.(adj) =  0.367   Deviance explained = 34.3%
-## fREML = 2.1435e+05  Scale est. = 1         n = 203261
-
AIC(fit.gam.Nl.0, fit.gam.Nl.1) %>% 
-  mutate(delta.AIC = AIC - min(AIC)) %>% arrange(delta.AIC)
-
##                    df      AIC delta.AIC
-## fit.gam.Nl.1 272.7888 54313.97     0.000
-## fit.gam.Nl.0 178.7082 57287.66  2973.691
-
-
-

3.3 Plot figure

-
fit.values.Mb <- evaluate_smooth(fit.gam.Mb.1, "s(time_to_set)", n = 244,
-                                 overall_uncertainty = T,
-                                 unconditional = T)
-
-fit.values.Nl <- evaluate_smooth(fit.gam.Nl.1, "s(time_to_set)", n = 244,
-                                 overall_uncertainty = T,
-                                 unconditional = T)
-
-b0 <- coef(fit.gam.Mb.1)[1]
-Fig6a <- ggplot(data = fit.values.Mb, 
-       aes(x = time_to_set, y = plogis(est+b0), 
-           fill = rep.state, group = rep.state)) +
-  geom_ribbon(aes(ymin = plogis(est+b0 - 2 * se) ,
-                  ymax = plogis(est+b0 + 2 * se)),
-              alpha = .5) +
-  geom_line(size = .5) + 
-  geom_hline(yintercept = 0.5, linetype = "dashed") +
-  scale_fill_wsj() + theme_bw(14) +
-  #facet_wrap(~rep.state) +
-  xlab("time since sunset (h)") + 
-  ylab("Activity probability \n") + 
-  ylim(0, 1) +
-  #theme(legend.position = "none") +
-  theme(legend.position = c(.15,.745),
-        legend.title=element_blank()) +
-  ggtitle("M.bechsteinii")
-
-b0 <- coef(fit.gam.Nl.1)[1]
-Fig6b <- ggplot(data = fit.values.Nl, 
-       aes(x = time_to_set, y = plogis(est+b0), 
-           fill = rep.state, group = rep.state)) +
-  geom_ribbon(aes(ymin = plogis(est+b0 - 2 * se) ,
-                  ymax = plogis(est+b0 + 2 * se)),
-              alpha = .5) +
-  geom_line(size = .5) + 
-  geom_hline(yintercept = 0.5, linetype = "dashed") +
-  scale_fill_wsj() + theme_bw(14) +
-  #facet_wrap(~rep.state) +
-  xlab("time since sunset (h)") + 
-  ylab("Activity probability \n") + 
-  ylim(0, 1) +
-  #theme(legend.position = "none") +
-  theme(legend.position = "none") +
-  ggtitle("N.leisleri")
-
-Fig6 <- Fig6a + Fig6b
-Fig6
-

-
-
- - - -
-
- -
- - - - - - - - - - - - - - - - diff --git a/html/tRackIT-Tutorial-for-activity-classification.html b/html/tRackIT-Tutorial-for-activity-classification.html deleted file mode 100644 index e003316..0000000 --- a/html/tRackIT-Tutorial-for-activity-classification.html +++ /dev/null @@ -1,2082 +0,0 @@ - - - - - - - - - - - - - -Activity classification with the tRackIT R-Package - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
-
-
- -
- - - - - - - - -
-

The tRackIT package

-

The tRackIT R package provides a range of functionalities for processing data recorded with automatic telemetry systems. It is specifically tailored to data recorded with one of the sensors from the tRackIT ecosystem (Höchst & Gottwald et al 2021, Gottwald and Lampe et al. 2020, Gottwald 2019), but can also be used for other systems. This tutorial will first explain the basic functionalities in terms of project and tagged individual management. Subsequently, the processing steps for classification of VHF signals into fundamental behaviours with the help of specially trained machine-learning models (Gottwald et al. 2022) will be presented. The dataset used in this tutorial was created to test the transferability of our model trained on bat movements to observed movements of a middle spotted woodpecker. See chapter validation and the corresponding publication for details. This work is conducted within the Natur 4.0 | Sensing Biodiversity project (https://www.uni-marburg.de/de/fb19/natur40). Please download the tutorial data from (doifjgkug).

-
-
-

Package installation

-

The tRackIT R-package uses dependencies that are not published on CRAN. So before we install the package, we install these dependencies

- -

Now we can install the actual package from the Nature40 gitHub page.

- -

Additional packages needed are: caret, randomForest, data.tree,plyr, ggplot2,lubridate,data.table, tidyverse

-

Load packages

- -

After the package has been installed please store the two model files from the downloaded data folder modelling/models (m_r1.rds, m_r2.rds) into the extdata folder of the tRackIT-package.

-
-
-

Initialize a project

-

For reasons of interchangeability of tRackIT projects the workflow requires the initiation of a project. A project folder (projroot) with various subfolders for products and reference data as well as a projectfile that contains a list in which all paths are stored, a data.frame with coordinates of tRackIT-stations and a data.frame with information on tagged individuals is created by the initProject() function. In addition, the path to absolute raw data, as created by the software running on the stations, which may be stored on an external hard disk and do not need to be integrated into the project, can be specified in the initProject() function. The epsg code of the coordinate system of the stations needs to be provided. Coordinates will be transformed to Lat/Lon.

-

Since data on tagged individuals and stations may be stored in tables of different style, the names of the columns containing the most important informations must be passed to the initProject() function. Columns that are not available need to be set to NULL. (e.g. weight_col=NULL, o_col=NULL)

- -

The created folder structure looks like this

- -
##                     levelName
-## 1  projroot                  
-## 2   ¦--results               
-## 3   ¦--R                     
-## 4   ¦   ¦--scripts           
-## 5   ¦   °--fun               
-## 6   °--data                  
-## 7       ¦--batch_awk         
-## 8       ¦--calibration_curves
-## 9       ¦--catalogues        
-## 10      ¦--correction_values 
-## 11      ¦--individuals       
-## 12      ¦--logger_data_csv   
-## 13      ¦--models            
-## 14      ¦--param_lst         
-## 15      °--reference_data
-
-
-

Initializing individuals

-

The initAnimal() function creates a subdirectory in projroot/data/individuals/ with different subdirectories for data processing products as well as an animal file containing a list of meta data information such as species, sex, age or reproductive state. Again, if any of the meta data information is not available, set it =NULL.

- -

Folder structure looks like this:

- -
##                                levelName
-## 1  projroot                             
-## 2   °--data                             
-## 3       °--individuals                  
-## 4           °--woodpecker               
-## 5               ¦--bearings             
-## 6               ¦--bearings_filtered    
-## 7               ¦--calibrated           
-## 8               ¦--classification       
-## 9               ¦--filtered             
-## 10              ¦--filtered_awk         
-## 11              ¦--gps_timematch        
-## 12              ¦--imputed              
-## 13              ¦--logger_timematch     
-## 14              ¦--station_timematch    
-## 15              ¦--triangulations       
-## 16              ¦--variables            
-## 17              °--woodpecker_idFile.rds
-
-

get Project and individuals

-

The work steps described above only have to be carried out once. Afterwards, the project file can simply be read into the working environment with the function getProject () and each individual with the function getAnimal().

- - -
## $animalID
-## [1] "woodpecker"
-## 
-## $species
-## [1] "Dendrocopos major"
-## 
-## $sex
-## [1] "male"
-## 
-## $age
-## [1] "ad"
-## 
-## $weight
-## [1] "20g"
-## 
-## $rep.state
-## [1] "breeding"
-## 
-## $freq
-## [1] "150050"
-## 
-## $start
-## [1] "2021-06-10"
-## 
-## $end
-## [1] "2021-06-14"
-## 
-## $duration_min
-## [1] "0.012"
-## 
-## $duration_max
-## [1] "0.025"
-
-
-
-

raw data processing

-

In this step, the raw data recorded on the stations is combined into one file per station. NOTE: This part of the tutorial is only valid for data recorded with tRackIT stations. Stations of other designs may have a different data structure. On tRackIT stations, new csv files are created for the current run every time the station reboots. Over the course of one recording period, a large number of individual files can be created, which are to be combined in the next step and saved within the project structure. The read.logger.data.tRackIT() function does this for us. In order for it to do this, the path to the folder containing the raw data must have been specified in the initProject() function (argument logger_data_raw=) and the data must be present in the folder in the following structure: logger_data_raw/name-of-station/radiotracking/files.csv.

- -

This step also only needs to be carried out once per project or per data collection per project.

-
-
-

Processing of individuals for activity classification

-

Processing of individuals involves

-
    -
  • Filtering
  • -
  • calculation of variables for ML-classifiction
  • -
  • Classification
  • -
  • Aggregation
  • -
-
-

Filtering

-

Individuals are filterd from the raw data compiled in the prvious step by frequency, signal length, and start and end of the tagging period using the filter.tRackIT() function.

- -
-
-

calculation of variables for ML-classifiction

-

Here we use the activity.vars.tRackIT() function for ML-variable calculation. I initially divides the data set of each station into 5-minute intervals. For each interval, it selects the receiver with the most data entries out of 4 possible receivers. If available, the receiver with the second most entries is also selected. Then it calculates various predictor variables that mirror the variation of the signal strengths over time by applying rolling windows to the classified VHF-data recorded by the tRackIT- stations. In order to smooth out potentially distracting fluctuations in the signal, a hampel filter as well as a mean and a max filter is applied to the raw data of the main receiver in a rolling window of +/- 10 data entries. For the raw data as well as the smoothed data the variance, standard deviation, kurtosis, skewness and sum of squares, also in a rolling window of +/- 10 data points, are calculated.

- -
-
-

Classification

-

Here we use the activity.vars.predict.tRackIT() function to apply the trained models to the data created in the step before. The two models (m_r1.rds, m_r2.rds, both>1 kb) from the directory tRackIT_activity/modelsneed to be stored in the extdata directory of the tRackIT package.

- -

The classified files are stored in wood_pecker/data/individuals/woodpecker/classification/. Lets take a look:

- -

-
-
-

Aggregation

-

Finally, the data can be aggregated by choosing the most frequent class value in a given time interval using the function activity.aggregate.tRackIT(). Here we chose an inteval of 1 Minute. The aggregated data is in projroot/data/individuals/woodpecker/classification/.

- -
-
-

Validation

-

For the validation of applicability to birds we attached a transmitter on the back of a middle spotted woodpecker. Next, we set up a daylight variant of our custom-made video recorder units - “BatRack” units - in front of its nesting tree to automatically record videos of its behavior (Gottwald & Lampe et al. 2021; https://nature40.github.io/BatRack/ (vid 2)). BatRacks consisted of a VHF antenna and an video unit connected to a raspberry pi mini computer. The camera unit was automatically triggered by the VHF-signal of the woodpeckers transmitter and started recording if the VHF-signal strength exceeded a threshold of -60 dBW, i.e. when the bird flew close to its nesting tree and the BatRack-system. A typical recorded sequence consisted of flying, hopping up the stem and a very short feeding sequence where the bird sat still at the entrance of the nest. Since the feeding sequence was in most cases shorter than 3 consecutive vhf-signals we classified all signals that were simultaneously recorded by one or more of the tRackIT-stations as active. In order to generate sufficient inactive sequences, we sampled 1000 random data points from signals recorded by one or more tRackIT-stations each night between 0 and 2 a.m. over four consecutive nights during the BirdRack observation period. Lets see how well our model performs:

- -
## 
-##  active passive 
-##    8310     431
- -
## 
-##  active passive 
-##      30    8770
- -
##          Sensitivity          Specificity       Pos Pred Value 
-##            0.9506921            0.9965909            0.9964029 
-##       Neg Pred Value            Precision               Recall 
-##            0.9531573            0.9964029            0.9506921 
-##                   F1           Prevalence       Detection Rate 
-##            0.9730109            0.4983182            0.4737472 
-## Detection Prevalence    Balanced Accuracy 
-##            0.4754575            0.9736415
-

##confusion matrix

- -

-
-
- - - -
-
- -
- - - - - - - - - - - - - - - - From e09e185e98b20c99c372510f5a50593c0b372380 Mon Sep 17 00:00:00 2001 From: JannisGottwald <53615516+JannisGottwald@users.noreply.github.com> Date: Fri, 16 Sep 2022 12:37:40 +0300 Subject: [PATCH 2/5] Add files via upload --- rmd/Model_training_tuning_and_validation.Rmd | 754 +++++++++++++++++++ rmd/Vignette_tRackIT.Rmd | 349 +++++++++ rmd/bat_data_HGAM_tutorial.Rmd | 513 +++++++++++++ 3 files changed, 1616 insertions(+) create mode 100644 rmd/Model_training_tuning_and_validation.Rmd create mode 100644 rmd/Vignette_tRackIT.Rmd create mode 100644 rmd/bat_data_HGAM_tutorial.Rmd diff --git a/rmd/Model_training_tuning_and_validation.Rmd b/rmd/Model_training_tuning_and_validation.Rmd new file mode 100644 index 0000000..189294a --- /dev/null +++ b/rmd/Model_training_tuning_and_validation.Rmd @@ -0,0 +1,754 @@ +--- +title: "Radom Forest model training, tuning and validation for activity classification" +author: + affiliation: Philipps-University of Marburg, Environmental informatics + corresponding author email: + name: Jannis Gottwald +date: "`r format(Sys.time(), '%d %m, %y')`" +bibliography: "references.bib" +output: + html_document: + highlight: tango + theme: united + toc: yes + toc_depth: 4 + toc_float: yes + pdf_document: + toc: yes + toc_depth: '4' + word_document: + toc: yes + toc_depth: '4' +editor_options: + markdown: + wrap: 72 +--- + +```{r setup, include = F} +knitr::opts_chunk$set(echo = T, message = F, warning = F, tidy = T, + fig.width=12, fig.height=12) +``` + +```{r klippy, echo=FALSE, include=TRUE} +klippy::klippy() # Allow copy code options in html file +getwd() +``` + + + +## Introduction + +Small movements of tagged animals result in discernible variations in the strength of the received signal (@Cochran1965-zk; @Kjos1970-nw) that reflect changes in the angle and distance between the transmitter and receiver. @Kays2011-kb proposed a method for automatically classifying active and passive behaviour based on a threshold difference in the signal strength of successive VHF signals recorded by a customised automatic radio-tracking system. However, machine learning (ML) algorithms are optimised for the recognition of complex patterns in a dataset and are typically robust against factors that influence signal propagation, such as changes in temperature and humidity, physical contact with conspecifics and/or multipath signal propagation (@Alade2013-oi). Accordingly, a ML model trained with a dataset encompassing the possible diversity of signal patterns related to active and passive behaviour can be expected to perform at least as well as a threshold-based approach. +In this work, we built on the methodology of Kays et al. (2011) by calibrating two random forest models (1 for data comming from only one receiver and one for data coming from at least two receivers), based on millions of data points representing the behaviours of multiple tagged individuals of two temperate bat species (Myotis bechsteinii, Nyctalus leisleri). + +The method was tested by applying it to independent data from bats, humans, and a bird species and then comparing the results with those obtained using the threshold-based approach of Kays et al. (2011) applying a threshold value of 2.5 dBw signalstrength difference suggested by [Schofield et al 2018](https://doi.org/10.1642/AUK-17-229.1). + +In order to make our work comprehensible, code and data are made available to all interested parties. Data for model training can be found[here](https://data.uni-marburg.de/handle/dataumr/149). Data for evaluation is stored [here](https://data.uni-marburg.de/handle/dataumr/151). + +This resource contains the following steps: + ++ Training two models in conjunction with forward feature selection ++ Parameter tuning ++ Validation using three independent data sets ++ Comparison to a threshold based approach as proposed in Kays et al. 2011 + + +But before we get started: + + + +### Why Random Forest? + +Although deep learning methods have been successfully applied to several ecological problems where large amounts of data are available (@Christin2019-xq), we use a random forest model due to the following reasons: + ++ (a) developing a (supervised) deep learning method requires considerable effort for selecting an appropriate neural network architecture, choosing an appropriate framework to implement the neural network and training, validating, testing, and refining the neural network (@Christin2019-xq) ++ (b) essentially, we have to solve a binary classification task based on tabular data; in this setting, tree ensemble methods such as random forests seem to have clear advantages - they are less computationally intensive, easy to implement, robust, and at least as performant as deep learning (@Shwartz-Ziv2022-eu) ++ (c) in a large study comparing 179 classifiers applied to the 121 classification data sets of the UCI repository, random forests are the best classifiers in over 90% of the cases (@Fernandez-Delgado2014-te). + +## Model training and tuning + +For model training and tuning we use the `caret` R-Package (Kuhn 2008). For the forward feature selection we use the `CAST` R-Package developed by Meyer et al. 2018. + + +Additional packages needed are: `randomForest`,`ranger`, `doParallel` , `MLeval`, `data.table`, `dplyr`, `plyr` + +Load packages +```{r message=FALSE} +library(caret); library(randomForest);library(ranger); library(doParallel);library(MLeval);library(CAST);library(data.table);library(dplyr);library(plyr) + +``` + + +### Data for model training +Only one antenna is necessary to classify VHF signals into active vs. passive states (Kays et al. 2011). However, agreement between receivers of the same station provides additional information and can improve the reliability of the classification. +Our groundtruth dataset was balanced by randomly down-sampling the activity class with the most data to the amount of data contained by the class with the least data. These balanced datasets were then split into 50% training data and 50% test data for data originating from one receiver. The same procedure was used for data derived from the signals of two receivers, resulting in two training and two test datasets. From a total of 3,243,753 VHF signals, 124,898 signals were assigned to train the two-receiver model and 294,440 signals to train the one-receiver model (Table 1). + +### Feature selection +Since not all variables are equally important to the model and some may even be misleading,we performed a forward feature selection on 50% of the training data. The forward feature selection algorithm implemented in the R package CAST (@Meyer2018-rj) selects the best pair of all possible two variable combinations by evaluating the performance of a k-fold cross-validation (CV). The algorithm iteratively increases the number of predictors until no improvement of the performance is achieved by adding further variables. + + +#### 1 Receiver model +```{r} + +# get data and check class distribution +data_1<-readRDS("model_tuning/data/batsTrain_1_receiver.rds") + +table(data_1$Class) + +``` + + +```{r eval = FALSE} + +#forward feature selection + +predictors<-names(data_1[, -ncol(data_1)]) + + +cl<-makeCluster(10) + +registerDoParallel(cl) + +ctrl <- trainControl(## 10-fold CV + method = "cv", + number = 10) + + +#run ffs model with 10-fold CV +set.seed(10) + +ffsmodel <- ffs(predictors=data_1[,predictors],response = data_1$Class,method="rf", + metric="Kappa", + tuneLength = 1, + trControl=ctrl, + verbose = TRUE) + +ffsmodel$selectedvars + + +saveRDS(ffsmodel, "model_tunig/models/m_r1.rds") + +stopCluster(cl) + +``` + + + +#### Results feature selection + +Red dots display two-variables combinations, dots with the colors from yellow to pink stand for models to each of which another variable has been added. Dots with a black border mark the optimal variable combination in the respective iteration. + +````{r message=FALSE} + +m1<-readRDS("model_tuning/models/m_r1.rds") + +print(m1) + +print(plot_ffs(m1)) + +```` + +#### Variable importance 1 receiver model + +``` {r messages=FALSE} +plot(varImp(m1)) +``` + +#### 2 Receivers model + +```{r} + +#get data and check class distribution +data_2<-readRDS("model_tuning/data/batsTrain_2_receivers.rds") + +table(data_2$Class) +``` + +```{r eval = FALSE} + +predictors<-names(data_2[, -ncol(data_2)]) + + +cl<-makeCluster(10) + +registerDoParallel(cl) + +ctrl <- trainControl(## 10-fold CV + method = "cv", + number = 10) + +#run ffs model +set.seed(10) + +ffsmodel <- ffs(predictors=data_2[,predictors],response = data_2$Class,method="rf", + metric="Kappa", + tuneLength = 1, + trControl=ctrl, + verbose = TRUE) + +ffsmodel$selectedvars + +saveRDS(ffsmodel, "model_tuning/models/m_r2.rds") + +stopCluster(cl) + +``` + +#### Results feature selection + +Red dots display two-variables combinations, dots with the colors from yellow to pink stand for models to each of which another variable has been added. Dots with a black border mark the optimal variable combination in the respective iteration. + +``` {r messages=FALSE} + +m2<-readRDS("model_tuning/models/m_r2.rds") +print(m2) +print(plot_ffs(m2)) + +``` + +#### Variable importance 2 receivers model + +``` {r messages=FALSE} +plot(varImp(m2)) +``` + +### Model tuning + +Random forest is an algorithm which is far less tunable than other algorithms such as support vector machines (@Probst2019-hl) and is known to provide good results in the default settings of existing software packages (Fernández-Delgado et al., +2014). +Even though the performance gain is still low, tuning the parameter mtry provides the biggest average improvement of the AUC (0.006) (Probst et al.2018). Mtry is defined as the number of randomly drawn candidate variables out of which each split is selected when growing a tree. Here we reduce the existing predictor variables to those selected by the forward feature selection and iteratively increase the number of randomly drawn candidate variables from 1 to the total number of selcted variables. Other parameters, such as the number of trees are held constant according to default settings in the packages used. + +#### Tuning mtry on the 1 receiver model + +```{r eval = FALSE} + +#reduce to ffs variables +predictors<-names(data_1[, c(m1$selectedvars, "Class")]) +batsTune<-data_1[, predictors] + +#tune number of variable evaluated per tree- number of trees is 500 +ctrl <- trainControl(## 10-fold CV + method = "cv", + number = 10, + verboseIter = TRUE) + ) + +tunegrid <- expand.grid( + mtry = 1:(length(predictors)-1), # mtry specified here + splitrule = "gini" + ,min.node.size = 10 +) + +tuned_model <- train(Class~., + data=batsTune, + method='rf', + metric='Kappa', + tuneGrid=tunegrid, + ntree=1000, + trControl=ctrl) + +saveRDS(tuned_model,"model_tunig/models/m_r1_tuned.rds") +```` + + + +#### Results model tuning 1 receiver + + +````{r messages=FALSE} + +m1_tuned<-readRDS("model_tuning/models/m_r1_tuned.rds") + +print(m1_tuned) + +```` + + + +#### Tuning mtry on the 2 receivers model + + +```{r eval = FALSE} + + +#reduce to ffs variables +predictors<-names(data_2[, c(m2$selectedvars, "Class")]) +batsTune<-data_2[, predictors] + +#tune number of variable evaluated per tree- number of trees is 1000 +ctrl <- trainControl(## 10-fold CV + method = "cv", + number = 10, + verboseIter = TRUE +) + + +tunegrid <- expand.grid( + mtry = 1:(length(predictors)-1), # mtry specified here + splitrule = "gini" + ,min.node.size = 10 +) +tuned_model_2 <- train(Class~., + data=batsTune, + method='rf', + metric='Kappa', + tuneGrid=tunegrid, + ntree=1000, + trControl=ctrl) +print(tuned_model_2) + + +saveRDS(tuned_model_2,"model_tuning/models/m_r2_tuned.rds") + + +``` + +#### Results model tuning 2 receivers + +``` {r messages=FALSE} +m2_tuned<-readRDS("model_tuning/models/m_r2_tuned.rds") +print(m2_tuned) +``` + + +#### Results and discussion model training and tuning + +Both models ( based on data from 1 receiver and 2 receivers) had very high performance metrics (Kappa, Accuracy) with slightly better results for the 2 receivers model.Tuning the mtry parameter did not increase the performance which indicates that for our use case default settings are a good choice. + +## Model evaluation + +For the Validation of the model performance and applicability to species with different movement behaviour (speed etc. than bats) we generated three different data sets. ++ 1. We put 50% of our bat data aside ++ 2. We collected ground truth data of a tagged medium spotted woodpecker ++ 3. We simulated different movement intensities by humans carrying transmitters through the forest + +In this section we will test how well the models perform in terms of different performance metrics such as F-Score, Accuracy, ROC-AUC + +### Bats + +We first take a look at te 50% test data that has been put aside for evaluation. Here we actually perform the prediction using the two trained models. For the woodpecker and human walk data set we will use already predicted data that has been processed by script `validation_woodpecker` and `validation_human_activity`. + +#### Test data collected by one Receiver + +``` {r} + +# Testdata 1 receiver +Test_1<-readRDS("validation/bats/data/batsTest_1_receiver.rds") +print(table(Test_1$Class)) + +# Default names as expected in Caret +Test_1$obs<-factor(Test_1$Class) + +#get binary prediction +pred1<-predict(m1, Test_1) +Test_1$pred<-factor(pred1) + +#probabilities +prob<-predict(m1, Test_1, type="prob") +Test_1<-cbind(Test_1, prob) + +```` + +````{r eval=FALSE} + +#calculate roc-auc + +roc1 <- MLeval::evalm(data.frame(prob, Test_1$obs)) +saveRDS(roc1, "validation/bats/results/roc_1receiver.rds") + + +```` + + +#### Performance metrics for 1-receiver test data of bats + +````{r} +#create confusion matrix +cm_r1<- confusionMatrix(factor(Test_1$pred), factor(Test_1$Class)) +print(cm_r1) +print(cm_r1$byClass) + +```` + + + +#### ROC-AUC for 1-receiver test data of bats + +````{r} +# + +#twoClassSummary(Test_1, lev = levels(Test_1$obs)) + roc1 <- readRDS("validation/bats/results/roc_1receiver.rds") +print(roc1$roc) +```` + + + +#### Bats: Test data collected by two receivers + + +``` {r messages=FALSE} +#two receivers +Test_2<-readRDS("validation/bats/data/batsTest_2_receivers.rds") + +table(Test_2$Class) + +Test_2$obs<-Test_2$Class +#get binary prediction +pred2<-predict(m2, Test_2) +Test_2$pred<-pred2 +#probabilities +prob2<-predict(m2, Test_2, type="prob") +Test_2<-cbind(Test_2, prob2) + + +```` + +````{r eval=FALSE} +#calculate roc-auc +roc2 <- MLeval::evalm(data.frame(prob2, Test_2$obs)) + +saveRDS(roc2, "validation/bats/results/roc_2receivers.rds") + + +```` + +#### Performance metrics for 2-receiver test data of bats + +````{r} +cm_r2<- confusionMatrix(factor(Test_2$pred), factor(Test_2$obs)) +print(cm_r2) +print(cm_r2$byClass) + +```` + + + +#### ROC-AUC for 2-receiver test data of bats + +```` {r, eval=FALSE} +# +#twoClassSummary(data_2, lev = levels(data_2$obs)) +roc2 <- readRDS("validation/bats/results/roc_2receivers.rds") + +print(roc2$roc) + +```` + +### Woodpecker + +``` {r messages=FALSE} +#two receivers +wp<-readRDS("validation/woodpecker/data/woodpecker_groundtruth.rds") + +wp$obs<-as.factor(wp$observed) +wp$pred<-as.factor(wp$prediction) + +``` + + +#### Performance metrics woodpecker + +```` {r messages=FALSE} +#create confusion matrix +cm_wp<- confusionMatrix(wp$pred, wp$obs) +print(cm_wp) +print(cm_wp$byClass) + + +```` + +#### ROC-AUC Woodpecker + +```` {r messages=FALSE} +print(twoClassSummary(wp, lev = levels(wp$obs))) +roc_wp <- MLeval::evalm(data.frame(wp[, c("active", "passive")], wp$obs), plots=c("r")) + +#print(roc_wp$roc) +```` + + +### Human activity + +``` {r messages=FALSE} +#two receivers +hm<-readRDS("validation/human/data/human_walk_groundtruth.rds) +hm$obs<-factor(hm$observation) +hm$pred<-factor(hm$prediction) + +``` + + +#### Performance metrics human activity + +````{r messages=FALSE} +#create confusion matrix +cm_hm<- confusionMatrix(factor(hm$pred), factor(hm$obs)) +print(cm_hm) +print(cm_hm$byClass) + +```` + +#### ROC-AUC human activity + +````{r messages=FALSE} +twoClassSummary(hm, lev = levels(hm$obs)) +roc_hm <- MLeval::evalm(data.frame(hm[, c("active", "passive")], hm$obs),plots=c("r")) +#print(roc_hm$roc) +```` + +### Results random-forest model validation + +Regardless of whether the models were tested on independent test data from bats or on data from other species (human, woodpecker), the performance metrics were always close to their maxima. + + +## Comparison to a threshold based approach + +The results of the ML-based approach were compared with those of a threshold-based approach (Kays et al. 2011)by calculating the difference in the signal strength between successive signals for all three test datasets (bats, bird, humans). We applied a threshold of 2.5 dB which was deemed appropriate to optimally separate active and passive behaviours in previous [studies](https://doi.org/10.1642/AUK-17-229.1) . In addition, the optimize-function of the R-package stats (R Core Team, 2021) was used to identify the value of the signal strength difference that separated the training dataset into active and passive with the highest accuracy. This value was also applied to all three test datasets. + +### Bats + +#### Threshold optimisation + +To find the threshold value that optimizes the accuracy (data is balanced) when separating the data into active and passive, we first calculated the signal strength difference of consecutive signals in the complete bats data set, than separated 50 % balanced test and train data and finally used the optimize function from the R base package to determine the best threshold. + +````{r messages=FALSE} +#get all bat data +trn<-fread("validation/bats/data/train_2020_2021.csv") + +#calculate signal strength difference per station +dtrn<-plyr::ldply(unique(trn$station), function(x){ + + tmp<-trn[trn$station==x,] + tmp<-tmp[order(tmp$timestamp),] + tmp<-tmp%>%group_by(ID)%>% + mutate(Diff = abs(max_signal - lag(max_signal))) + return(tmp) + }) + +##data clean up +dtrn<-dtrn[!is.na(dtrn$Diff),] +dtrn<-dtrn[!(dtrn$behaviour=="active" & dtrn$Diff==0),] + +##factorize +dtrn$behaviour<-as.factor(dtrn$behaviour) +table(dtrn$behaviour) + +#balance data +set.seed(10) + +tdown<-downSample(x = dtrn, + y = dtrn$behaviour) + +#create 50% train and test + +trainIndex <-createDataPartition(tdown$Class, p = .5, + list = FALSE, + times = 1) + +dtrn <- tdown[ trainIndex,] +dtst <- tdown[-trainIndex,] + +#optimize seperation value based on accuracy (remeber data is balanced) + +value<-dtrn$Diff +group<-dtrn$behaviour + +accuracy = Vectorize(function(th) mean(c("passive", "active")[(value > th) + 1] == group)) +ac<-optimize(accuracy, c(min(value, na.rm=TRUE), max(value, na.rm=TRUE)), maximum=TRUE) + +ac$maximum +```` + + +#### Performance metrics based on optimized threshold + +````{r messages=FALSE} + +#classify data by optimized value +dtst$pred<-NA +dtst$pred[dtst$Diff>ac$maximum]<-"active" +dtst$pred[dtst$Diff<=ac$maximum]<-"passive" + +#calc confusion matrix +dtst$pred<-factor(dtst$pred) +cm<-confusionMatrix(factor(dtst$Class), factor(dtst$pred)) + +print(cm) +print(cm$byClass) +```` + + +#### Performance metrics based on 2.5 dB threshold from the literature + +````{r messages=FALSE} +#2.5 dB value from the literature +dtst$pred<-NA +dtst$pred[dtst$Diff>2.5]<-"active" +dtst$pred[dtst$Diff<=2.5]<-"passive" + +dtst$pred<-factor(dtst$pred) +cm<-confusionMatrix(dtst$Class, dtst$pred) +print(cm) +print(cm$byClass) +```` + +### Woodpecker + +Since activity observations are not continuous but signal recording on the tRackIT-Stations is, we first have to calculate the signal strength difference on the raw data and than match it to the ground truth observations + +````{r messages=FALSE} + +#list raw signals +wp<-list.files("validation/woodpecker/data/raw/", full.names = TRUE) + + +#calculate signal strength difference +wp_tst<-plyr::ldply(wp, function(x){ + + tmp<-fread(x) + tmp<-tmp[order(tmp$timestamp),] + tmp<-tmp%>%mutate(Diff = abs(max_signal - lag(max_signal))) + return(tmp) +}) + +wp_tst$timestamp<-lubridate::with_tz(wp_tst$timestamp, "CET") + +#get observations and merge by timestamp + +wp_gtruth<-readRDS("validation/woodpecker/data/woodpecker_groundtruth.rds") + +wp_tst<-merge(wp_gtruth, wp_tst, all.x = TRUE) + +```` + + +#### Performance metrics based on optimized threshold + + +````{r messages=FALSE} + +wp_tst$pred<-NA +wp_tst$pred[wp_tst$Diff>ac$maximum]<-"active" +wp_tst$pred[wp_tst$Diff<=ac$maximum]<-"passive" + +wp_tst$pred<-factor(wp_tst$pred) +wp_tst$observed<-factor(wp_tst$observed) + +cm<-confusionMatrix(factor(wp_tst$observed), factor(wp_tst$pred)) + +print(cm) +print(cm$byClass) + +```` + + + +#### Performance metrics based on 2.5 dB threshold from the literature + +````{r messages=FALSE} + +#evaluate with 2.5 dB value from the literature +wp_tst$pred<-NA +wp_tst$pred[wp_tst$Diff>2.5]<-"active" +wp_tst$pred[wp_tst$Diff<=2.5]<-"passive" + +wp_tst$pred<-factor(wp_tst$pred) +cm<-confusionMatrix(wp_tst$observed, wp_tst$pred) +print(cm) +print(cm$byClass) + +```` + +### Humans + +Human activity observations are also not continuous so we have to calc signal strength diff for each individual on the raw data + +````{r messages=FALSE} + +hm_dirs<-list.dirs("validation/human/data/", full.names = TRUE) +hm_dirs<-hm_dirs[grep("raw", hm_dirs)] +hm_tst<-plyr::ldply(hm_dirs, function(d){ + + fls<-list.files(d, full.names = TRUE) + + tmp_dat<-plyr::ldply(fls, function(x){ + + tmp<-fread(x) + tmp<-tmp[order(tmp$timestamp),] + tmp<-tmp%>%mutate(Diff = abs(max_signal - lag(max_signal))) + return(tmp) +}) + + return(tmp_dat)}) + +#get obesrvations and merge +hm_gtruth<-readRDS("validation/human/data/human_walk_groundtruth.rds") +hm_tst<-merge(hm_gtruth, hm_tst, all.x = TRUE) +hm_tst<-hm_tst[!duplicated(hm_tst$timestamp),] + +```` + + + +#### Performance metrics based on optimized threshold + +````{r messages=FALSE} + +#evaluate based on optimized threshold +hm_tst$pred<-NA +hm_tst$pred[hm_tst$Diff>ac$maximum]<-"active" +hm_tst$pred[hm_tst$Diff<=ac$maximum]<-"passive" + +hm_tst$pred<-factor(hm_tst$pred) +hm_tst$observed<-factor(hm_tst$observation) + +cm<-confusionMatrix(hm_tst$observed, hm_tst$pred) + +print(cm) +print(cm$byClass) +#print(cm$table) + + +```` + + + +#### Performance metrics based on 4 dB threshold from the literature + + +````{r messages =FALSE} +#evaluate based on 2.5 dB value from the literature + +hm_tst$pred<-NA +hm_tst$pred[hm_tst$Diff>2.5]<-"active" +hm_tst$pred[hm_tst$Diff<=2.5]<-"passive" + +hm_tst$pred<-factor(hm_tst$pred) +cm<-confusionMatrix(hm_tst$observed, hm_tst$pred) + +print(cm) +print(cm$byClass) +print(cm$table) + +```` +## Comparison of the threshold based approach and the random-forest Model + +When calibrating the threshold based approach on an adequate train data set,ii is generally able to separate active and passive behavior but performance metrics (F1=0.79, 0.78, 0.88; bats, woodpecker, human) are between 10 and 20 points worth and more variable than our random forest model (F1= 0.97, 0.97, 0.98; bats,woodpecker,human). With F-scores between 0.46 and 0.58 the threshold value proposed in the literature performed significantly worth. + +Since only the test data set of the bats is balanced but the woodpecker data is slightly imbalanced and the human activity data set is highly imbalanced lets also take a look at a metric that takes the data distribution into account: + +Cohen’s kappa is defined as: + +`K=(p_0-p_e)/(1-p_e)` + +where p_0 is the overall accuracy of the model and p_e is the measure of the agreement between the model predictions and the actual class values as if happening by chance. + +Cohen’s kappa is always less than or equal to 1. Values of 0 or less, indicate that the classifier is not better than chance. @Landis1977-re provide a way to characterize values. According to their scheme a value < 0 is indicating no agreement , 0–0.20 slight agrement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement , and 0.81–1 as almost perfect agreement. + +Kappa values based on the 2.5 dB separation value from the literature ranged between 0.3 (humans) and 0.38 (woodpecker), i.e. a fair agreement. For the optimized threshold Kappa values were significantly better in all cases (0.46, 0.58, 0.53; bats, woodpecker, humans); i.e. moderate agreement. However, even the best Kappa value for the threshold based approach only showed a moderate agreement while all Kappa values based on the random-forest model showed an almost perfect agreement ( 0.94, 0.94, 0.90 ; bats, woodpecker, humans ). + + +# REFERENCES + + diff --git a/rmd/Vignette_tRackIT.Rmd b/rmd/Vignette_tRackIT.Rmd new file mode 100644 index 0000000..e2a578f --- /dev/null +++ b/rmd/Vignette_tRackIT.Rmd @@ -0,0 +1,349 @@ +--- +title: "Processing radio-tracking data with the tRackIT R-Package" +author: "Jannis Gottwald" +date: "`r Sys.Date()`" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Vignette Title} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r setup, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +The tRackIT R-Package provides functionalities for the processing of data recorded in local automatic radio-tracking studies. It is specifically tailored to data recorded with one of the sensors from the [tRackIT ecosystem](https://dx.doi.org/10.18420/informatik2021-035) (tRackIT-Stations, BatRack), but can also be used for other systems. +The functionalities provided in the package cover project and individual management, raw signal data processing and the generation of high-level information such as the [calculation of locations]( https://doi.org/10.1111/2041-210X.13294) and the [classification of behavioral states](https://doi.org/10.1101/2022.03.22.485147) based on pattern in the recorded vhf-signals. It provides a default data structure to guarantee easy exchangeability of data and analysis scripts between scientists. + +Note that filtering of raw data relies on tags, that can be distinguished by frequency and not by codes. If you want to work with coded tags, the data must be filtered beforehand. This may change in future versions of the package. + +## Getting startet + +The package uses functionalities from the [telemetr](https://github.com/barryrowlingson/telemetr) R-Package developed by Barry Rowlingson. It provides all methods for the localization of a transmitter described in [this article](https://doi.org/10.2307/1268030) using fortran in the background. To make use of the dependencies however, some adjustments to the package had to be conducted, which is why the version used in the tRackIT R-package is hostet under the Nature40 github account. Before the tRackIT package can be installed, please install the telemtr package as follows: + +```{r eval=FALSE} +library(remotes) + +Sys.setenv("R_REMOTES_NO_ERRORS_FROM_WARNINGS" = "true") + +remotes::install_github("Nature40/telemetr") +``` + + +We also make use of very fast c++ based [rolling windows](https://github.com/andrewuhl/RollingWindow) which are not hostet on cran, yet. Please install the package as follows: + +```{r eval=FALSE} +devtools::install_github("andrewuhl/RollingWindow") +``` + + +Now you can install the tRackIT R-package + +```{r eval=FALSE} +devtools::install_github("Nature40/tRackIT") + +``` + +```{r} + +library(tRackIT) + +``` + +## Test-data and models + +To check out the functionalities of the package using the package vignette, we recommend to download the [test data](https://doi.org/10.17192/fdr/104) and [trained models]( https://doi.org/10.17192/fdr/79) for activity classification. Models need to be unzipped and stored in the extdata folder of the installed tRackIT-package. Store the test data in your working directory and unzip them. + +## Set up a project + +First set your working directory to `your/path/tRackIT_test_data` + +```{r, eval=FALSE} +setwd("your/path/tRackIT_test_data") +``` + +To initialize a project with the tRackIT R-Package, some meta information, such as meta data about the spatial positions of your tracking stations, pointing directions of your antennas and information about the tagged individuals has to be provided. You can read the provided meta data for tracking stations from `data/reference_data/`. + +```{r} +#get data +stations<-data.table::fread("data/reference_data/stations_mof_2021.csv") + +#check data structure + +``` + + +Since we are only processing one individual, we create the meta data manually. + +```{r} +#create data.frame +tag_df<-data.frame(id="woodpecker", freq=150050, start="2021-06-10", end="2021-06-14") +``` + +Now lets initialize a tRackIT-project using the `tRackIT::initProject()`function. For the individuals (tag_df) we have to specify the column names for the ID (id_col), the start of the tagging period (start), the end of the tagging period (end) and the frequency of the transmitter in khz (freq). For the stations metadata we have to specify the column conatining the name of the station (s_col), the x coordinates (x_col), the y coordinates (y_col), the column carrying the identifier for the receiving device (r_col) and the column indicating the orientation of the antennas (o_col). Additionally the epsg code of the station coordinates as well as the time zone of the study are mandatory. + +```{r} +projroot<-paste0(getwd(), "/") + +#init Project +test_project<-initProject(projroot = projroot, + logger_data_raw=".", + tags=tag_df, + id_col="id", + start_col="start", + end_col="end", + freq_col="freq", + stations=stations, + s_col="station", + x_col="X", + y_col="Y", + r_col="receiver", + o_col="orientation", + epsg=4326, + tz="CET") + + +``` + +Now we have created a tRackIT-project. Check out the root folder of the project to get familiar with the data structure. There is also a `_projectFile.rds` that stores all the information of the project. You can access this file using the `getProject()` function. This makes the exchange of project much easier. + + +```{r eval=FALSE} +#get the project +test_project<-getProject(projroot = projroot) + +#filepath list +test_project$path + +#stations +test_project$stations + +#tags +test_project$tags + +#epsg and tz +test_project$epsg +test_project$tz + +``` + + +Now we initialize an individual. You can provide further information such as sex, reproductive state, age and weight to the meta data of the individual. The duration arguments (in milliseconds) specify the expected length of signals and are necessary to seperate signals from noise. + + +```{r} +woodpecker<-initAnimal(projList = test_project,projroot= projroot,saveAnml = TRUE, animalID = "woodpecker", species = "Dendrocoptes medius", sex = "m", age = "adult", weight = 36, rep.state = "breeding", freq = test_project$tags$frequency[1], start = test_project$tags$start[1], end = test_project$tags$end[1], duration_min = 0.012, duration_max = 0.4 ) + +``` + + +Again, a .rds file with all the information is stored and can be acessed via the `getAnimal()` function. + + +```{r eval=FALSE} +#get id +woodpecker<-getAnimal(projroot = projroot, animalID = "woodpecker") + +#meta +woodpecker$meta + +#filepath list + +woodpecker$path + + +``` + +## Process data + +As starting point the tRackIT-package expects one .csv file per tracking-station with raw and unfiltered recordings. The files you want to process should all be in one folder. The natural folder for that would be: + +```{r} + +test_project$path$csv + +``` + +However, you can specify any other folder. In a first step, we want to filter these files by Frequency, signal duration and runtime of the transmitter. Make shure that there is one file per station and that they have the following columns: timestamp= time of received signal-expected format: “%Y-%m-%d %H:%M:%S (e.g. 1970-01-29 15:30:22:11), duration=signal length of the recorded signal, frequency= frequency in kHz (e.g. 150150.2), max= the max signal strength in dBW of the recorded signal (peak) , receiver= name of the receiver device (usually 0,1,2,3 for a 4 Antenna station ), station= name of the station as character string + +```{r eval=FALSE} + +filter_tRackIT(animal = woodpecker, freq_er=2, path_to_data = test_project$path$csv, d_min = woodpecker$meta$duration_min, d_max = woodpecker$meta$duration_max, start = woodpecker$meta$start, end = woodpecker$meta$end, freq = woodpecker$meta$freq) +``` + +Filtered files are stored here: + +```{r} +woodpecker$path$filtered +``` + +The filtered data can now be used for activity classification and position tracking + +### Activity classification + +Here, the processing steps for classification of VHF signals into fundamental behaviours with the help of specially trained machine-learning models (Gottwald et al. 2022) will be presented. The dataset used in this tutorial was created to test the transferability of our model trained on bat movements to observed movements of a middle spotted woodpecker. See chapter validation and the corresponding publication for details. + +#### variable calculation + +The trained models ecpect certain variables to be provided. We can calculate them with the `activity_vars_tRackIT()` function. To facilitae the use of the functionalities with other systems the names of the columns containing the tiemstamp (tcol), the signal strength (scol) and the receiving decvice (dcol) can be provided. + + +```{r eval=FALSE} +#calculate variables for activity classification +activity_vars_tRackIT(animal = woodpecker, t_col = "timestamp", s_col = "max", r_col = "receiver", tz = "UTC") +``` + + +#### activity classification + +Once the variables are calculated, you can classify them using our trained models. Make shure the [models]( https://doi.org/10.17192/fdr/79) are stored in the extdata folder of the installed tRackIT-Package. + +```{r} +#use trained models for activity classification +data<-activity_predict_tRackIT(animal = woodpecker, get_data = TRUE) +``` + +Lets have a look + +```{r out.width="100%"} +library(ggplot2) +library(lubridate) +#cut data to BirdRack obeservation period +data<-data[data$timestamp>="2021-06-10 06:00;00" & + data$timestamp<="2021-06-14 04:00:00",] +#convert to local timezone +data$timestamp<-lubridate::with_tz(data$timestamp, tzone=test_project$tz) + +#prepare night and day shading in plot +data$date<-as.Date(data$timestamp) +data$hour<-hour(data$timestamp) +data$start<-as.POSIXct(paste0(data$date, " 05:15:00"), tz="CET") +data$end<-as.POSIXct(paste0(data$date, " 21:30:00"), tz="CET") +data$hour_shade =ifelse(data$hour >= 21 | data$hour <= 5, "day", "night") +data$hour_shade[data$date=="2021-06-14"]<-"night" + +##plot +ggplot()+geom_rect(data=data, (aes(xmin=start, xmax=end,ymin=min(max), ymax=max(max), fill=factor(hour_shade))))+scale_fill_manual(values=c( "grey90", "white"), guide="none")+ geom_point(data, mapping=aes(x=timestamp, y=max, color=prediction))+ggthemes::scale_color_wsj()+ggtitle("Woodpecker activity")+ ylab("signal strength in dBW") + + + +``` + +We are looking at 648777 data points. +You can aggregate the data to a coarser time using the `activity_aggregate_tRackIT()` function. The function will select the most frequent value per time slot. Here we will aggregate the data to 1 minute bins. + +```{r out.width="100%"} + +data<-activity_aggregate_tRackIT(animal = woodpecker, avalue = 1, tzone = "UTC", get_data = TRUE) + + + +data<-data[data$timestamp>="2021-06-10 06:00;00" & + data$timestamp<="2021-06-14 04:00:00",] +#convert to local timezone +data$timestamp<-lubridate::with_tz(data$timestamp, tzone=test_project$tz) + +#prepare night and day shading in plot +data$date<-as.Date(data$timestamp) +data$hour<-hour(data$timestamp) +data$start<-as.POSIXct(paste0(data$date, " 05:15:00"), tz="CET") +data$end<-as.POSIXct(paste0(data$date, " 21:30:00"), tz="CET") +data$hour_shade =ifelse(data$hour >= 21 | data$hour <= 5, "day", "night") +data$hour_shade[data$date=="2021-06-14"]<-"night" + +##plot +ggplot()+geom_rect(data=data, (aes(xmin=start, xmax=end,ymin=min(max), ymax=max(max), fill=factor(hour_shade))))+scale_fill_manual(values=c( "grey90", "white"), guide="none")+ geom_point(data, mapping=aes(x=timestamp, y=max, color=prediction))+ggthemes::scale_color_wsj()+ggtitle("Woodpecker activity")+ ylab("signal strength in dBW") + + + + +``` + + +### Position tracking + +The principle of position calculation with the tRackIT-system is similar to its manual counterpart. First we calculate bearings per single stations using the algorithms provided in Gottwald et al. 2019 and then make use of different position calculation algorithms described [here](https://doi.org/10.2307/1268030). + + +Before we start, let´s reduce the data to a shorter time period- otherwise these processing steps would take to long. + +```{r eval=FALSE} +#create a directory to store time filtered data + +dir.create("data/individuals/woodpecker/filtered_one_day/") + +filtered<-list.files(woodpecker$path$filtered, full.names = TRUE) + +lapply(filtered, function(x){ + tmp<-data.table::fread(x) + tmp<-tmp[tmp$timestamp>="2021-06-10 08:00:00" & tmp$timestamp<="2021-06-10 18:00:00",] + + if(nrow(tmp)>0){ + data.table::fwrite(tmp, paste0("data/individuals/woodpecker/filtered_one_day/", basename(x))) + +}}) + +``` + +Lets start with bearing calculation. We first list all the filtered files and conduct time matching between the different receivers per station. + +```{r eval=FALSE} + +filtered<-list.files("data/individuals/woodpecker/filtered_one_day/", full.names = TRUE) + +lapply(filtered, function(x) time_match_logger_tRackIT(animal=woodpecker,path_to_data = x)) +``` + +The time-matched data is now used to claculate the bearings according to the algorithm presented in Gottwald et al. 2019 + +```{r eval=FALSE} +matched<-list.files(woodpecker$path$logger_timematch, full.names = TRUE) + +lapply(matched, function(x) calc_bearings_tRackIT(path_to_data =x, dbLoss = 21, animal = woodpecker, projList = test_project )) + +``` + +To detect outliers in the beaerings caused by missing signals or noise, we use a hampel filter, that replaces outliers by the median within a sliding window of 20 seconds + +```{r eval=FALSE} +bearings<-list.files(woodpecker$path$bearings, full.names = TRUE) + +lapply(bearings, function(x) hampel_time(path_to_data = x, col = "linear", k = 20, animal = woodpecker)) + +```` + + +Now the positions can be calculated using different methods described in Lenth 1981. Here we accept all bearing within a time window (tw) of 5 seconds for the calculation of positions + +```{r eval=FALSE} +tRackIT(animal = woodpecker, projList = test_project, tw = 5, be_col = "bearings_filtered", path_to_directory = woodpecker$path$bearings_filtered) + +```` +Since there is always noise in the data and sharp or flat intersections between bearings may cause positions with high spatial errors, we will apply some filtering to the data. + +```{r eval=FALSE} +filter.triangs.tRackIT(animal = woodpecker, projList = test_project, rerange = 400, nbi = 0, speed = 50, tz = "UTC", filter_speed = TRUE, filter_distance = TRUE, filter_biang = TRUE, tw = 5) + +```` +Lets have a look + +
+```{r message=F, warning=FALSE, fig.width=4} + +triangs<-data.table::fread(paste0(woodpecker$path$triangulations, "/triangulations_time_window_5_filtered.csv")) + +sp::coordinates(triangs) <- c("x", "y") + sp::proj4string(triangs) <- sp::CRS(paste0("+init=epsg:", test_project$epsg)) # WGS 84 + + +mapview::mapview(triangs)@map +```` +
+The woodpeckers nesting tree (red dot in Map) is located close to the narrow part of the location distribution + +![Map of the Marburg open Forest](H:/projects/repositories/MS_activity_classification/tRackIT_test_data/Map.jpg){width=30%} + diff --git a/rmd/bat_data_HGAM_tutorial.Rmd b/rmd/bat_data_HGAM_tutorial.Rmd new file mode 100644 index 0000000..29db7da --- /dev/null +++ b/rmd/bat_data_HGAM_tutorial.Rmd @@ -0,0 +1,513 @@ +--- +title: "Timing of activity probability in Bechstein's and Leisler's bats: Data wrangling and checks procedure (Updated: )" +author: + + name: Raphaël Royauté, Jannis Gottwald + corresponding author email: + affiliation: Senckenberg Biodiversity and Climate research Centre +date: "`r format(Sys.time(), '%d %m, %y')`" +output: + html_document: + toc: true # table of content true + toc_depth: 4 # upto 4 depths of headings (specified by #, ##, ### and ####) + toc_float: true # Floating TOC + #number_sections: true ## if you want number sections at each table header + theme: united # themes + highlight: tango # specifies the syntax highlighting style +editor_options: + markdown: + wrap: 72 +--- + +```{r setup, include = F} +knitr::opts_chunk$set(echo = T, message = F, warning = F, tidy = T, + fig.width=12, fig.height=12) +``` + + +## Introduction + +This tutorial shows how probability of activity curves can be extracted from the VHF signal data after classification of 1 min intervals into active or passive states. I'm using Hierarchical Generalized Additive Models following the [Pedersen et al. 2021](https://peerj.com/articles/6876/) article. The data needed for the execution of this tutorial can be downloaded [here](https://doi.org/10.17192/fdr/108) + +### Packages required + +Before you start, make sure that you have the following packages +installed: `tidyverse`, `data.table`,`lubridate`, `hms`, `mgcv`, `gratia`, `itsadug`, `mgcViz`, `ggthemes`, `viridis` + +Additionally we will use functionalities from the `tRackIT` R package that is not hosted on cran yet. Please check the README in the [tRackIT](https://github.com/Nature40/tRackIT) GitHub repository for installation instructions + +Load the packages + +```{r} +library(tidyverse); library(data.table) +library(lubridate); library(hms) +library(mgcv); library(gratia); #library(itsadug) +library(mgcViz); library(ggthemes); library(viridis) +library(tRackIT) +``` + +## 1. Data importation and wrangling + +The tRackIT-package [GitHub page](https://nature40.github.io/tRackIT/) shows the complete workflow to get from raw signals to activity classifications with a 1 Minute resolution. We will skip that part here and start with the classified data of bats that have been tracket from 2018 to 2021 (`all_bats_aggregated`). There is one file per individual, that contains the timestamp, the activity class per timestamp and the ID. We will need some more info such as timing in relation to sunset and sunrise as well as species, sex etc. To do so we will use two functionalities of the tRackIT-package. The **time_of_day()** function claculates timedifference tu sunset and sunrise etc, the **add.ID.info()** function adds meta-info for the individual to the data. This section of the tutorial can only be executed if the [ecological_case_study_bat_activity](https://data.uni-marburg.de/handle/dataumr/152) directory has been downloaded. It has >270Gb and can´t be directly downloaded via the user interface. Check the README of the repository for instructions. However, you can skip that part and continue with the next chunk of code using already processed [data](https://doi.org/10.17192/fdr/108). + +```{r, eval=FALSE} + +#set coordinates for sunset/sunrise calculation + +Lat<-50.844868 +Lon<-8.663649 + +#Loop through years +for(y in c(2018, 2019, 2020, 2021)){ +#print(y) +#get projectfile per year +pr<-getProject(projroot=paste0("bats_for_activity/",y, "/")) + +#remove ids with no data at all +pr$tags<-pr$tags[pr$tags$ID!="Mbec180518_2",] +pr$tags<-pr$tags[pr$tags$ID!="mbec150155_m",] + +#loop through individuals +for(id in pr$tags$ID){ + + #print(id) + + anml<-getAnimal(projList = pr, animalID = id) + #get activity classification aggregated to 1 Minute + activity_files<-list.files(anml$path$classification,pattern="agg", full.names = TRUE) + + if(length(activity_files)>0){ + + activity_1Min<-data.table::fread(fls[1]) + + #calculate daytime infos + activity_1Min<-time_of_day(data = activity_1Min, Lat=Lat, Lon = Lon, tcol = "timestamp", tz = "CET", activity_period = "nocturnal") + + #add id info +activity_1Min<-add.ID.Info( data=activity_1Min, animal=anml) + +#number of tracking days +activity_1Min$n_days<-as.numeric(abs(difftime(as.Date(activity_1Min$start_datetime), as.Date(activity_1Min$stop_datetime))))+1 + +#binary activity classification +activity_1Min$activity<-ifelse(activity_1Min$prediction=="active", 1,0) + +#correction of typo +activity_1Min$species[activity_1Min$species=="mbec"]<-"Mbec" + + data.table::fwrite(activity_1Min, paste0("all_bats_aggregated/",anml$meta$animalID, "_",as.character(y), "_aggregated_1_Minute.csv")) + + } +}} + + +#merge all data +all_aggregated_files<-list.files("all_bats_aggregated/", full.names = TRUE) + +all_bats<-plyr::ldply(fls, function(x){data.table::fread(x)}) + +all_bats<-all_bats[all_bats$species=="Mbec" | all_bats$species=="Nlei",] + + + +#get info which ids should be excluded +check_ids<-data.table::fread("bats_inspect_id.csv") +excld<-check_ids[check_ids$exclude_individual=="Y",] + +#exclude ids from data +all_bats<-all_bats[!(all_bats$ID %in% excld$ID),] + +#account for rep state transition + +df_rep<-read.csv("df_rep_state_transition.csv") +for (id in unique(df_rep$ID)){ + + if (df_rep$rep2[df_rep$ID==id]=="pregnant"){ + + all_bats$rep.state[all_bats$ID==id & all_bats$year==df_rep$year[df_rep$ID==id] & all_bats$timestamp>=df_rep$start_rep1[df_rep$ID==id]]<-"lactating"} + + if (df_rep$rep2[df_rep$ID==id]=="post-lactating"){ + + all_bats$rep.state[all_bats$ID==id & all_bats$year==df_rep$year[df_rep$ID==id] & all_bats$timestamp>=df_rep$start_rep2[df_rep$ID==id]]<-"post-lactating"} + +} + + + +#save data +data.table::fwrite(all_bats,"all_bats_aggregated.csv") + + + +``` + +### 1.1 Data importation + + +```{r} + +df_1min <- fread("all_bats_aggregated.csv", stringsAsFactors = T) # import data + + +glimpse(df_1min) +``` + +### 1.2 Data wrangling + +```{r} +df_1min$species<-as.character(df_1min$species) +df_1min$species[df_1min$species=="Mbec"]<-"M.bechsteinii" +df_1min$species[df_1min$species=="Nlei"]<-"N.leisleri" +df_1min$species<-as.factor(df_1min$species) + +df_1min$month_f <- factor(month(df_1min$date)) +df_1min$year_f <- factor(df_1min$year) +df_1min$date <- date(df_1min$date) +df_1min$hour <- hour(df_1min$timestamp) +df_1min$week <- week(df_1min$timestamp) + +min_set <- min(df_1min$time_to_set) +max_set <- max(df_1min$time_to_set) +df_1min$start.event <- ifelse(df_1min$time_to_set==min_set,T,F) # Identify start of the time series +df_1min$ydate_f <- as.factor(df_1min$ydate) +df_1min$date_f <- as.factor(df_1min$date) + +K.time_of_day <- length(unique(df_1min$time_to_rise)) + +df_1min <- df_1min %>% data.frame() +``` + +Exclude retagged individuals and extract sample sizes + +```{r} +df_1min<- df_1min[!(df_1min$ID =="Nlei20211" & df_1min$ydate >= 200),] +df_1min<- df_1min[!(df_1min$ID =="h146487" & df_1min$ydate >= 150),] + +# Sample sizes +df_1min %>% + group_by(species) %>% + summarise(nID = n_distinct(ID), + nObs = n(), + meanDays = mean(n_days)) +df_1min %>% + summarise(nID = n_distinct(ID), + nObs = n(), + meanDays = mean(n_days)) +``` + +*NOTE* Some individuals were tagged twice within the same year. We want to avoid these situations and reduce the sampling period to the first tagging event. The full details of this analysis can be found under `filename`, section 0.2. + +### 1.3 Visual inspection +We can plot visually inspect the data by presenting the probability of activity over time of the day. Here, it is easier to calculate this probability by time intervals of 15 minutes for easier viuslization + +```{r} +df_15min <- df_1min %>% + #filter(species == "M.bechsteinii" | species == "N.leisleri") %>% + mutate(interval = as_hms(floor_date(timestamp, unit = "15minutes"))) %>% + group_by(ID, species, year, ydate, hour, interval) %>% + summarise(n_intervals = length(activity), + n_active = length(activity[activity == 1]), + n_passive = length(activity[activity == 0]), + time_to_set = mean(time_to_set)) # calculate average time to sunset for that 15 minute interval + +df_15min %>% + ggplot(aes(x = time_to_set, y = n_active/n_intervals, color = species)) + + geom_point(alpha = 0.1) + + geom_smooth() + scale_color_wsj() + theme_bw(14) + + facet_wrap(~species) + geom_hline(yintercept = 0.5,linetype = "dashed") + + ylim(0, 1) + ylab("Activity probability (15 min interval)") + + xlab("Time to sunset (h)") +``` + +We can also inspect each tagged individual in the same way +```{r} +df_15min %>% + filter(species == "N.leisleri") %>% + ggplot(aes(x = time_to_set, y = n_active/n_intervals)) + + geom_point(alpha = 0.2) + + geom_smooth() + scale_color_wsj() + theme_bw(14) + + facet_wrap(~ID) + geom_hline(yintercept = 0.5,linetype = "dashed") + + ylim(0, 1) + ylab("Activity probability (15 min interval)") + + xlab("Time to sunset (h)") + +df_15min %>% + filter(species == "M.bechsteinii") %>% + ggplot(aes(x = time_to_set, y = n_active/n_intervals)) + + geom_point(alpha = 0.2) + + geom_smooth() + scale_color_wsj() + theme_bw(14) + + facet_wrap(~ID) + geom_hline(yintercept = 0.5,linetype = "dashed") + + ylim(0, 1) + ylab("Activity probability (15 min interval)") + + xlab("Time to sunset (h)") +``` + + +## 2. HGAM for species comparison of daily activity patterns +We fit a Hierarchical Generalized Additive Model to compare whether Bechstein's and Leisler's bats differ significantly in their daily activity patterns. We assume that the probability of activity is a non-linear function of time of the day, here centered around time of sunset (t = 0). +We use a binomial error term since our activity column is a string of 0 and 1 (i.e the bat is either marked as passive for that minute or active). + +### 2.1 Fit models +We use circular spline functions to constrain the activity probability to be equal at 00:00 and 23:59 (argument `bs = "cc"`). We also need to account for the fact that individuals were measured repeatedely but in differnet years of monitoring. The simplest way to do that is to include individuals and date as random intercepts with the argument `bs = "re"`. Note that there are many flavors for specifying random effects with HGAMs which apply more or less penalty to the random effects and allow them to deviate from population level trends. Here, we are mainly interested in species differences and assume that there is a general time of activity function that is species specific. Note that more complex random effect structures gave functionally similar results and did not affect conclusions. [Pedersen et al. 2021](https://peerj.com/articles/6876/) provides a great and exhasutive tutorials on the different ways to approach random effects within the GAM framework. +There are two other arguments that require ou attention. First, we need to account for the fact that there observations are likely to be highly autocorrelated because they are taken at 1 min intervals. This value has to be set manually and we show our procedure for investigating autocorrelation in our analysis R script (see `filename`, section 1.1). Next, we need to select the degree of complexity of our smoothing terms: `k`. After inspection with the `k.check` function, setting k to 120 was the highest complexity we could set without overfitting the data (see `filename`, section 1.2 for details). + +To test for species difference in activity patterns, we fit two models. Model M0 assumes that both species have the same global activity patterns while model M1 allows both species to follow their own trend (argment `by = species` within the spline function) + + +Set autocorrelation (`r1`) and basis dimension (`k`) of the smooth term + +```{r} +r1 <- 0.5765725 +k <- 120 + +``` + +Fit model M0 + +```{r} +fit.gam.M0 <- bam(activity ~ + s(time_to_set, bs = c("cc"), k = k) + + s(ID, bs = "re") + s(date_f, bs = "re"), + rho = r1, AR.start = start.event, + data = df_1min, + method = "fREML", discrete=T, family = "binomial", + knots=list(time_to_rise=c(min_set, max_set))) +summary(fit.gam.M0) +#draw(fit.gam.M0) +``` + +Fit model M1 +```{r} +fit.gam.M1 <- bam(activity ~ + s(time_to_set, bs = c("cc"), k = k, by = species) + + s(ID, bs = "re") + s(date_f, bs = "re"), + rho = r1, AR.start = start.event, + data = df_1min, + method = "fREML", discrete=T, family = "binomial", + knots=list(time_to_rise=c(min_set, max_set))) +summary(fit.gam.M1) +#draw(fit.gam.M1) +``` + +Compare M0 and M1 + +```{r} +anova(fit.gam.M0, fit.gam.M1) + +AIC(fit.gam.M0, fit.gam.M1) %>% + mutate(delta.AIC = AIC - min(AIC)) %>% arrange(delta.AIC) +``` + +### 2.2 Plot models + +```{r} +# Calculate hourly interval data +df_1h <- df_1min %>% + mutate(interval = as_hms(floor_date(timestamp, unit = "60minutes"))) %>% + group_by(ID, species, year, ydate, hour, interval) %>% + summarise(n_intervals = length(activity), + n_active = length(activity[activity == 1]), + n_passive = length(activity[activity == 0]), + time_to_set = mean(time_to_set)) # calculate average time to sunset for that 15 minute interval +df_1h$p_act <- df_1h$n_active/df_1h$n_intervals + +fit.values <- evaluate_smooth(fit.gam.M1, "s(time_to_set)", n = 244, + overall_uncertainty = T, + unconditional = T) +draw(fit.values) + +b0 <- coef(fit.gam.M1)[1] +Fig5 <- ggplot(data = fit.values, + aes(x = time_to_set, y = plogis(est+b0), + color = species, group = species)) +Fig5a <- Fig5 + + geom_point(data = df_1h, alpha = .1, + aes(x = time_to_set, + y = p_act)) + + geom_ribbon(aes(ymin = plogis(est+b0 - 2 * se) , + ymax = plogis(est+b0 + 2 * se)), + fill = "grey", color = "grey") + + geom_line(size = .5) + + geom_hline(yintercept = 0.5, linetype = "dashed") + + scale_color_wsj() + theme_bw(14) + + xlab("") + + ylab("Activity probability \n") + + ylim(0, 1) + + theme(legend.position = c(.1,.745)) +Fig5a + +fit.delta <- difference_smooths(fit.gam.M1, "s(time_to_set)", n = 244) + +Fig5b <- ggplot(data = fit.delta, + aes(x = time_to_set, y = diff)) + + geom_ribbon(aes(ymin = lower, + ymax = upper), color = "grey", + alpha = 0.3) + + geom_line() + geom_hline(yintercept = 0, linetype = "dashed") + + theme_bw(14) + theme(legend.position = "none") + + xlab("Time since sunset (h)") + + ylab("Activity difference \n (M.bechsteinii - N.leisleri)") + +Fig5 <- Fig5a / Fig5b +Fig5 +``` + +### 2.3 Extract timing of activity values +The following section shows how different aspects of the dialy activity patterns can be extracted from the HGAM output. We focus here on determining the timinig for onset and end of the daily activity period, the timing of peak activity and the intensity of activity during the night. +```{r} +# Onset of activity up time: When does p(activity) > 0.5 first +# End of activity time: When does p(activity) > 0.5 last +fit.values %>% group_by(species) %>% + filter(plogis(est+b0) > .5) %>% + summarise(a.onset = as_hms(min(time_to_set)*60), + a.end = as_hms(max(time_to_set)*60)) +# Peak activity: what are the two highest values for p(activity)? +fit.values %>% group_by(species) %>% + filter(plogis(est+b0) == max(plogis(est+b0))) %>% + summarise(peak.a = max(plogis(est+b0)), + peak.a.low = plogis(est+b0-2*se), + peak.a.up = plogis(est+b0+2*se)) + +# Peak activity timing: time of day whith maximum p(activity) +fit.values %>% group_by(species) %>% + filter(plogis(est+b0) == max(plogis(est+b0))) %>% + group_by(species) %>% + summarise(t.peak = as_hms(time_to_set*60)) + +# Activity density: area under the curve when p(activity) > 0.5 +fit.values %>% group_by(species) %>% + filter(plogis(est+b0) > .5) %>% + summarise(auc = bayestestR::auc(time_to_set, plogis(est+b0), + method = "spline"), + auc.low = bayestestR::auc(time_to_set, plogis(est+b0-2*se), + method = "spline"), + auc.up = bayestestR::auc(time_to_set, plogis(est+b0+2*se), + method = "spline")) + + +``` + +## 3. HGAM for comparing activity patterns reproductive periods +We can also use HGAM to compare the timing of daily activity depending on the reproductive status of individuals within each species. The models used here are very similar to those used in section 2. with the main difference that we are now comparing a model that assumes a common activity pattern for all statuses (model 0) and one that allows reproductive statuses to have differing smoothing functions. + +Exclude individuals of unknown reproductive status +```{r} +df_1min %>% filter(rep.state != "unknown") %>% + group_by(species) %>% + summarise(nID = n_distinct(ID), + nObs = n()) + +df_1min %>% filter(rep.state != "unknown") %>% + group_by(species, rep.state) %>% + summarise(nID = n_distinct(ID), + nObs = n()) + +# Remove individuals of unknown status +df_1min.Mb <- df_1min %>% filter(species == "M.bechsteinii" & rep.state != "unknown") %>% droplevels() +df_1min.Nl <- df_1min %>% filter(species == "N.leisleri" & rep.state != "unknown") %>% droplevels() +``` + +### 3.1 Bechstein's bats + +```{r} +fit.gam.Mb.0 <- bam(activity ~ + s(time_to_set, bs = c("cc"), k = k) + + s(ID, bs = "re") + #, k = 25 + s(date_f, bs = "re"), #k = 173 + rho = r1, AR.start = start.event, + data = df_1min.Mb, + method = "fREML", discrete=T, family = "binomial", + knots=list(time_to_rise=c(min_set, max_set))) +summary(fit.gam.Mb.0) + +### M1: difference in reproductive status on average & in smooth ---- +fit.gam.Mb.1 <- bam(activity ~ rep.state + + s(time_to_set, bs = c("cc"), k = k, + by = rep.state) + + s(ID, bs = "re") + s(date_f, bs = "re"), + rho = r1, AR.start = start.event, + data = df_1min.Mb, + method = "fREML", discrete=T, family = "binomial", + knots=list(time_to_rise=c(min_set, max_set))) +summary(fit.gam.Mb.1) + +AIC(fit.gam.Mb.0, fit.gam.Mb.1) %>% + mutate(delta.AIC = AIC - min(AIC)) %>% arrange(delta.AIC) +``` + + + +### 3.2 Leisler' bats +```{r} +fit.gam.Nl.0 <- bam(activity ~ + s(time_to_set, bs = c("cc"), k = k) + + s(ID, bs = "re") + s(date_f, bs = "re"), + rho = r1, AR.start = start.event, + data = df_1min.Nl, + method = "fREML", discrete=T, family = "binomial", + knots=list(time_to_rise=c(min_set, max_set))) +summary(fit.gam.Nl.0) + +fit.gam.Nl.1 <- bam(activity ~ rep.state + + s(time_to_set, bs = c("cc"), k = k, + by = rep.state) + + s(ID, bs = "re") + s(date_f, bs = "re"), + rho = r1, AR.start = start.event, + data = df_1min.Nl, + method = "fREML", discrete=T, family = "binomial", + knots=list(time_to_rise=c(min_set, max_set))) +summary(fit.gam.Nl.1) + +AIC(fit.gam.Nl.0, fit.gam.Nl.1) %>% + mutate(delta.AIC = AIC - min(AIC)) %>% arrange(delta.AIC) +``` + +### 3.3 Plot figure +```{r} +fit.values.Mb <- evaluate_smooth(fit.gam.Mb.1, "s(time_to_set)", n = 244, + overall_uncertainty = T, + unconditional = T) + +fit.values.Nl <- evaluate_smooth(fit.gam.Nl.1, "s(time_to_set)", n = 244, + overall_uncertainty = T, + unconditional = T) + +b0 <- coef(fit.gam.Mb.1)[1] +Fig6a <- ggplot(data = fit.values.Mb, + aes(x = time_to_set, y = plogis(est+b0), + fill = rep.state, group = rep.state)) + + geom_ribbon(aes(ymin = plogis(est+b0 - 2 * se) , + ymax = plogis(est+b0 + 2 * se)), + alpha = .5) + + geom_line(size = .5) + + geom_hline(yintercept = 0.5, linetype = "dashed") + + scale_fill_wsj() + theme_bw(14) + + #facet_wrap(~rep.state) + + xlab("time since sunset (h)") + + ylab("Activity probability \n") + + ylim(0, 1) + + #theme(legend.position = "none") + + theme(legend.position = c(.15,.745), + legend.title=element_blank()) + + ggtitle("M.bechsteinii") + +b0 <- coef(fit.gam.Nl.1)[1] +Fig6b <- ggplot(data = fit.values.Nl, + aes(x = time_to_set, y = plogis(est+b0), + fill = rep.state, group = rep.state)) + + geom_ribbon(aes(ymin = plogis(est+b0 - 2 * se) , + ymax = plogis(est+b0 + 2 * se)), + alpha = .5) + + geom_line(size = .5) + + geom_hline(yintercept = 0.5, linetype = "dashed") + + scale_fill_wsj() + theme_bw(14) + + #facet_wrap(~rep.state) + + xlab("time since sunset (h)") + + ylab("Activity probability \n") + + ylim(0, 1) + + #theme(legend.position = "none") + + theme(legend.position = "none") + + ggtitle("N.leisleri") + +Fig6 <- Fig6a + Fig6b +Fig6 +``` + From c9e42e0b6a5fbc2e99a33de9f5cfa49ee90fb46e Mon Sep 17 00:00:00 2001 From: JannisGottwald <53615516+JannisGottwald@users.noreply.github.com> Date: Tue, 20 Sep 2022 10:51:09 +0300 Subject: [PATCH 3/5] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 1bff8d3..fa5189c 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,7 @@ devtools::install_github("Nature40/tRackIT") ## Test data and models -To check out the functionalities of the package using the package vignette, we recommend to download the test data and [trained models]( https://doi.org/10.17192/fdr/79) for activity classification. Models need to be unzipped and stored in the extdata folder of the installed tRackIT-package. +To check out the functionalities of the package using the package vignette, we recommend to download the [test data](https://data.uni-marburg.de/handle/dataumr/172) and [trained models]( https://doi.org/10.17192/fdr/79) for activity classification. Models need to be unzipped and stored in the extdata folder of the installed tRackIT-package. We also we provide the following [tutorials] (https://github.com/Nature40/tRackIT/tree/main/rmd) describing the workflow for [model tuning and evaluation](https://nature40.github.io/tRackIT_activity_classification_model_tuning_and_evaluation/) for activity classification. You can also check the [reproducible script](https://nature40.github.io/tRackIt_activity_ecological_case_study/) for the case study analysis shown in the paper. From f483bfa82285d4c8fec61c23c62a95fbe8cbcb0e Mon Sep 17 00:00:00 2001 From: JannisGottwald <53615516+JannisGottwald@users.noreply.github.com> Date: Tue, 20 Sep 2022 10:51:35 +0300 Subject: [PATCH 4/5] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index fa5189c..bf1b4bf 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,7 @@ devtools::install_github("Nature40/tRackIT") ## Test data and models -To check out the functionalities of the package using the package vignette, we recommend to download the [test data](https://data.uni-marburg.de/handle/dataumr/172) and [trained models]( https://doi.org/10.17192/fdr/79) for activity classification. Models need to be unzipped and stored in the extdata folder of the installed tRackIT-package. We also we provide the following [tutorials] (https://github.com/Nature40/tRackIT/tree/main/rmd) describing the workflow for [model tuning and evaluation](https://nature40.github.io/tRackIT_activity_classification_model_tuning_and_evaluation/) for activity classification. You can also check the [reproducible script](https://nature40.github.io/tRackIt_activity_ecological_case_study/) for the case study analysis shown in the paper. +To check out the functionalities of the package using the package vignette, we recommend to download the [test data](https://data.uni-marburg.de/handle/dataumr/172) and [trained models]( https://doi.org/10.17192/fdr/79) for activity classification. Models need to be unzipped and stored in the extdata folder of the installed tRackIT-package. We also we provide the following [tutorials](https://github.com/Nature40/tRackIT/tree/main/rmd) describing the workflow for [model tuning and evaluation](https://nature40.github.io/tRackIT_activity_classification_model_tuning_and_evaluation/) for activity classification. You can also check the [reproducible script](https://nature40.github.io/tRackIt_activity_ecological_case_study/) for the case study analysis shown in the paper. From d9d9afd81ab6715979024e5679efe05981cb40bc Mon Sep 17 00:00:00 2001 From: JannisGottwald <53615516+JannisGottwald@users.noreply.github.com> Date: Tue, 20 Sep 2022 10:52:42 +0300 Subject: [PATCH 5/5] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index bf1b4bf..d76f6fe 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,7 @@ devtools::install_github("Nature40/tRackIT") ``` -## Test data and models +## Test data, models and tutorials To check out the functionalities of the package using the package vignette, we recommend to download the [test data](https://data.uni-marburg.de/handle/dataumr/172) and [trained models]( https://doi.org/10.17192/fdr/79) for activity classification. Models need to be unzipped and stored in the extdata folder of the installed tRackIT-package. We also we provide the following [tutorials](https://github.com/Nature40/tRackIT/tree/main/rmd) describing the workflow for [model tuning and evaluation](https://nature40.github.io/tRackIT_activity_classification_model_tuning_and_evaluation/) for activity classification. You can also check the [reproducible script](https://nature40.github.io/tRackIt_activity_ecological_case_study/) for the case study analysis shown in the paper.