p8105_mtp_zdz2101.Rmd

---
title: "p8105_mtp_zdz2101"
author: "Zelos Zhu"
date: "10/22/2018"
output: github_document
---

##Load Packages
```{r Load Libraries, messages = FALSE}
library(tidyverse)
library(readr)
library(wordcountaddin)
library(patchwork)
wordcountaddin::text_stats("p8105_mtp_zdz2101.Rmd")
```

##Read Data/Tidying
```{r Reading in data and tidying, message = FALSE}
accelero_data <- read_csv("data/p8105_mtp_data.csv") %>%
  gather(., key = "activity", value = "activity_value", 3:1442) %>%
  mutate(day = factor(day, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")),
         day_id = (7 * week) - (7 - as.numeric(day)),
         minute = as.numeric(str_replace_all(activity, "activity.","")),
         hour = floor((minute - 1)/60)) %>%
  select(day_id, week, day, hour, activity, minute, activity_value) %>%
  arrange(week,day)

sample_n(accelero_data, 10) #show day_id works

#day_id is just a linear transform of week and day, sunday week 1 is day 1, sunday week 2 is day 8, etc.
#hour is made by minute: 1-60 are midnight, 61-120 are 1am, etc. I used another clever trick to transform hour to be 0-23
```

This dataset has `r nrow(accelero_data)` observations and `r ncol(accelero_data)` variables. The number of observations is expected as there is 1 reading/minute, 60 minutes/hour, 24 hours/day, 7 days/week, and 47 weeks worth of accelerometer readings (product of these 4 numbers gets you `r nrow(accelero_data)`). Variables are: 

1) **day_id**: day of study, assuming week starts on sunday
2) **week**: week of study 
3) **day**: weekday, recoded to factor
4) **hour**: hour of day (military time)
5) **activity**: essentially is the column names of the original file, used as "key" for the gather function, left here to "preserve original observations"
6) **minute**:  minute of day
7) **activity_value**: accelerometer activity reading

#Aggregate data by day and discover some trends
```{r Day Aggregate, fig.height = 16, fig.width = 20, message = FALSE}
day_aggregate_df <- accelero_data %>%
  group_by(day_id, week, day) %>%
  summarize(total_activity = sum(activity_value),
            mean = mean(activity_value),
            min = quantile(activity_value, 0),
            q1 = quantile(activity_value, 0.25),
            median = quantile(activity_value, 0.50),
            q3 = quantile(activity_value, 0.75),
            max = quantile(activity_value, 1)
            )
  
unreasonable_days <- day_aggregate_df$day_id[which(day_aggregate_df$total_activity == 1440)]

day_aggregate_plot <- day_aggregate_df %>%
  filter(total_activity != 1440) %>% #---------remove days w/ invalid readings
  ggplot(., aes(x = day_id, y = total_activity)) + 
  geom_point(aes(color = day), alpha = 0.5) + 
  geom_smooth() +
  ylab("Total Activity") +
  xlab("Day of Study") + 
  theme(legend.position = "bottom") +
  ggtitle("Total activity of each day") + 
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        legend.title=element_text(size=14), 
        legend.text=element_text(size=14))

by_day_plot <- day_aggregate_df %>%
  filter(total_activity != 1440) %>%
  ggplot(., aes(x = day_id, y = total_activity, color = day)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth() + 
  ylab("Total Activity") +
  xlab("Day of Study") + 
  theme(legend.position = "none") +
  ggtitle("Total activity measure by weekdays") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        strip.text = element_text(size = 14)) +
  facet_wrap(~day, nrow = 2)

day_aggregate_plot + by_day_plot
```

Plotting total activity by day, as shown on the left, we see an upward trend in activity throughout the course of the study. It dips around day 150, but the trend overall throughout 47 weeks is still positive. Based on the figure on the right, the patient's activity still increases over time regardless of what day it is suggesting weekday does not play a role in our findings. It could be argued the patient's activity on Thursdays looks stagnant. 

I discovered 18 days: day_id's `r unreasonable_days` where total activity is "1440" meaning there was no activity over the entirety of those days. I suspect the patient was not wearing the accelerometer or accelerometer malfunction, so I excluded these days from figures.

#Aggregate by hour/24 hour profiles
```{r Aggregate by hour/"activity profiles", fig.height = 16, fig.width = 20}
hour_aggregate <- accelero_data %>%
  filter(!day_id %in% unreasonable_days) %>%
  group_by(day_id, week, day, hour) %>%
  summarize(hourly_activity_total = sum(activity_value),
            hourly_activity_mean = mean(activity_value)) %>%
  mutate(sleep = ifelse(hour %in% c(0:6,23), "asleep", "awake"))

hourly_profiles <- ggplot(hour_aggregate, aes(x = as.factor(hour), y = hourly_activity_total, color = sleep)) +
  geom_boxplot() + 
  xlab("Hour of the Day") + 
  ylab("Total Activity") +
  ggtitle("Total activity based on hour-by-hour profiles") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        strip.text = element_text(size = 14))

hourly_profiles_by_day <- ggplot(hour_aggregate, aes(x = as.factor(hour), y = hourly_activity_total, color = sleep)) +
  geom_boxplot() + 
  xlab("Hour of the Day") + 
  ylab("Total Activity") +
  ggtitle("Hour-by-hour profiles by day") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        strip.text = element_text(size = 14)) +
  facet_wrap(~day, nrow = 2)

hourly_profiles/hourly_profiles_by_day
```

It seems logical to get a sense of when our patient is awake and doing things. Assuming 8 hours of sleep, I made the 8 hours with the lowest median total activity readings a different color to signify "asleep" hours. In general, he starts his day around 7-8am, ramps up activity which stabilizes from 10am until 8pm, then "winds down" his day and probably asleep by 11pm-12am. This routine does not seem to change too much by day of the week either though there is a handful of Friday nights and Sunday afternoons where he is clocking in some high activity. 

#Additional Exploratory Stuff
```{r Finding peak times, fig.height = 18, fig.width = 16}
top_one_percent <- quantile(accelero_data$activity_value, 0.99)

peaks_df<- accelero_data %>%
  filter(activity_value > top_one_percent) %>%
  select(day_id, hour, minute, activity_value) %>%
  group_by(day_id) %>%
  arrange(day_id, minute)

peak_times_plot <- ggplot(peaks_df, aes(x = minute, y = activity_value, color = as.factor(day_id))) +
  geom_line() + 
  scale_x_continuous(breaks = seq(1,1440,60), labels = 0:23) + 
  xlab("Hour of Day") +
  ylab("Activity Value") +
  ggtitle("Spaghetti plot of Top 1% of Accelerometer Readings") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        strip.text = element_text(size = 14)) +
  geom_hline(yintercept =  2500, color = "black", linetype = "dashed") +
  theme(legend.position="none")

exercise_df <- peaks_df %>%
  filter(activity_value > 2500) %>%
  mutate(study = ifelse(day_id > 165, "Second Half", "First Half"))

study_half_counts <- exercise_df %>%
  group_by(study) %>%
  count()

exercise_hyp_plot <- ggplot(exercise_df, aes(x = minute, y = activity_value, color = study)) + 
  geom_jitter(alpha = 0.5) + 
  scale_x_continuous(breaks = seq(1,1440,60), labels = 0:23) + 
  xlab("Hour of Day") +
  ylab("Activity Value") + 
  ggtitle("Scatter plot of top readings based on two periods") +
  theme(axis.text = element_text(size = 16),
        axis.title = element_text(size = 18),
        plot.title = element_text(size = 20, face = "bold"),
        strip.text = element_text(size = 14)) +
  facet_grid(~study)

peak_times_plot/exercise_hyp_plot
```

Attempting to track exercise, I assumed the highest 1% of activity is during exercise. It was clear from the top figure there was still "noise" below the activity value ~2500; each "spaghetti" line was meant to distinguish a particular day. So I filtered for readings above 2500 (arbitrary, easily adjustable). The bottom panels show scatter plots of activity against time of day where the left is data from the first half of the study, days 1 - 165, and the right is data from the second half of the study, days 165 - 329. 

I suspect the patient likes to exercise during three periods: 11-1pm, 3-6pm or around 8pm, the 3-6pm slot becoming **much** more popular as time passed. The second half plot has much more points, generally higher too, than the first which may suggest the patient is exercising more frequently and intensely over time.