-
Notifications
You must be signed in to change notification settings - Fork 0
/
p8105_mtp_zdz2101.Rmd
179 lines (152 loc) · 8.74 KB
/
p8105_mtp_zdz2101.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
title: "p8105_mtp_zdz2101"
author: "Zelos Zhu"
date: "10/22/2018"
output: github_document
---
##Load Packages
```{r Load Libraries, messages = FALSE}
library(tidyverse)
library(readr)
library(wordcountaddin)
library(patchwork)
wordcountaddin::text_stats("p8105_mtp_zdz2101.Rmd")
```
##Read Data/Tidying
```{r Reading in data and tidying, message = FALSE}
accelero_data <- read_csv("data/p8105_mtp_data.csv") %>%
gather(., key = "activity", value = "activity_value", 3:1442) %>%
mutate(day = factor(day, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")),
day_id = (7 * week) - (7 - as.numeric(day)),
minute = as.numeric(str_replace_all(activity, "activity.","")),
hour = floor((minute - 1)/60)) %>%
select(day_id, week, day, hour, activity, minute, activity_value) %>%
arrange(week,day)
sample_n(accelero_data, 10) #show day_id works
#day_id is just a linear transform of week and day, sunday week 1 is day 1, sunday week 2 is day 8, etc.
#hour is made by minute: 1-60 are midnight, 61-120 are 1am, etc. I used another clever trick to transform hour to be 0-23
```
This dataset has `r nrow(accelero_data)` observations and `r ncol(accelero_data)` variables. The number of observations is expected as there is 1 reading/minute, 60 minutes/hour, 24 hours/day, 7 days/week, and 47 weeks worth of accelerometer readings (product of these 4 numbers gets you `r nrow(accelero_data)`). Variables are:
1) **day_id**: day of study, assuming week starts on sunday
2) **week**: week of study
3) **day**: weekday, recoded to factor
4) **hour**: hour of day (military time)
5) **activity**: essentially is the column names of the original file, used as "key" for the gather function, left here to "preserve original observations"
6) **minute**: minute of day
7) **activity_value**: accelerometer activity reading
#Aggregate data by day and discover some trends
```{r Day Aggregate, fig.height = 16, fig.width = 20, message = FALSE}
day_aggregate_df <- accelero_data %>%
group_by(day_id, week, day) %>%
summarize(total_activity = sum(activity_value),
mean = mean(activity_value),
min = quantile(activity_value, 0),
q1 = quantile(activity_value, 0.25),
median = quantile(activity_value, 0.50),
q3 = quantile(activity_value, 0.75),
max = quantile(activity_value, 1)
)
unreasonable_days <- day_aggregate_df$day_id[which(day_aggregate_df$total_activity == 1440)]
day_aggregate_plot <- day_aggregate_df %>%
filter(total_activity != 1440) %>% #---------remove days w/ invalid readings
ggplot(., aes(x = day_id, y = total_activity)) +
geom_point(aes(color = day), alpha = 0.5) +
geom_smooth() +
ylab("Total Activity") +
xlab("Day of Study") +
theme(legend.position = "bottom") +
ggtitle("Total activity of each day") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
legend.title=element_text(size=14),
legend.text=element_text(size=14))
by_day_plot <- day_aggregate_df %>%
filter(total_activity != 1440) %>%
ggplot(., aes(x = day_id, y = total_activity, color = day)) +
geom_point(alpha = 0.5) +
geom_smooth() +
ylab("Total Activity") +
xlab("Day of Study") +
theme(legend.position = "none") +
ggtitle("Total activity measure by weekdays") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
strip.text = element_text(size = 14)) +
facet_wrap(~day, nrow = 2)
day_aggregate_plot + by_day_plot
```
Plotting total activity by day, as shown on the left, we see an upward trend in activity throughout the course of the study. It dips around day 150, but the trend overall throughout 47 weeks is still positive. Based on the figure on the right, the patient's activity still increases over time regardless of what day it is suggesting weekday does not play a role in our findings. It could be argued the patient's activity on Thursdays looks stagnant.
I discovered 18 days: day_id's `r unreasonable_days` where total activity is "1440" meaning there was no activity over the entirety of those days. I suspect the patient was not wearing the accelerometer or accelerometer malfunction, so I excluded these days from figures.
#Aggregate by hour/24 hour profiles
```{r Aggregate by hour/"activity profiles", fig.height = 16, fig.width = 20}
hour_aggregate <- accelero_data %>%
filter(!day_id %in% unreasonable_days) %>%
group_by(day_id, week, day, hour) %>%
summarize(hourly_activity_total = sum(activity_value),
hourly_activity_mean = mean(activity_value)) %>%
mutate(sleep = ifelse(hour %in% c(0:6,23), "asleep", "awake"))
hourly_profiles <- ggplot(hour_aggregate, aes(x = as.factor(hour), y = hourly_activity_total, color = sleep)) +
geom_boxplot() +
xlab("Hour of the Day") +
ylab("Total Activity") +
ggtitle("Total activity based on hour-by-hour profiles") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
strip.text = element_text(size = 14))
hourly_profiles_by_day <- ggplot(hour_aggregate, aes(x = as.factor(hour), y = hourly_activity_total, color = sleep)) +
geom_boxplot() +
xlab("Hour of the Day") +
ylab("Total Activity") +
ggtitle("Hour-by-hour profiles by day") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
strip.text = element_text(size = 14)) +
facet_wrap(~day, nrow = 2)
hourly_profiles/hourly_profiles_by_day
```
It seems logical to get a sense of when our patient is awake and doing things. Assuming 8 hours of sleep, I made the 8 hours with the lowest median total activity readings a different color to signify "asleep" hours. In general, he starts his day around 7-8am, ramps up activity which stabilizes from 10am until 8pm, then "winds down" his day and probably asleep by 11pm-12am. This routine does not seem to change too much by day of the week either though there is a handful of Friday nights and Sunday afternoons where he is clocking in some high activity.
#Additional Exploratory Stuff
```{r Finding peak times, fig.height = 18, fig.width = 16}
top_one_percent <- quantile(accelero_data$activity_value, 0.99)
peaks_df<- accelero_data %>%
filter(activity_value > top_one_percent) %>%
select(day_id, hour, minute, activity_value) %>%
group_by(day_id) %>%
arrange(day_id, minute)
peak_times_plot <- ggplot(peaks_df, aes(x = minute, y = activity_value, color = as.factor(day_id))) +
geom_line() +
scale_x_continuous(breaks = seq(1,1440,60), labels = 0:23) +
xlab("Hour of Day") +
ylab("Activity Value") +
ggtitle("Spaghetti plot of Top 1% of Accelerometer Readings") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
strip.text = element_text(size = 14)) +
geom_hline(yintercept = 2500, color = "black", linetype = "dashed") +
theme(legend.position="none")
exercise_df <- peaks_df %>%
filter(activity_value > 2500) %>%
mutate(study = ifelse(day_id > 165, "Second Half", "First Half"))
study_half_counts <- exercise_df %>%
group_by(study) %>%
count()
exercise_hyp_plot <- ggplot(exercise_df, aes(x = minute, y = activity_value, color = study)) +
geom_jitter(alpha = 0.5) +
scale_x_continuous(breaks = seq(1,1440,60), labels = 0:23) +
xlab("Hour of Day") +
ylab("Activity Value") +
ggtitle("Scatter plot of top readings based on two periods") +
theme(axis.text = element_text(size = 16),
axis.title = element_text(size = 18),
plot.title = element_text(size = 20, face = "bold"),
strip.text = element_text(size = 14)) +
facet_grid(~study)
peak_times_plot/exercise_hyp_plot
```
Attempting to track exercise, I assumed the highest 1% of activity is during exercise. It was clear from the top figure there was still "noise" below the activity value ~2500; each "spaghetti" line was meant to distinguish a particular day. So I filtered for readings above 2500 (arbitrary, easily adjustable). The bottom panels show scatter plots of activity against time of day where the left is data from the first half of the study, days 1 - 165, and the right is data from the second half of the study, days 165 - 329.
I suspect the patient likes to exercise during three periods: 11-1pm, 3-6pm or around 8pm, the 3-6pm slot becoming **much** more popular as time passed. The second half plot has much more points, generally higher too, than the first which may suggest the patient is exercising more frequently and intensely over time.