-
Notifications
You must be signed in to change notification settings - Fork 1
/
bipolar_study_subject_selection_abcd5.0.py
309 lines (247 loc) · 15.6 KB
/
bipolar_study_subject_selection_abcd5.0.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
# ---
# jupyter:
# jupytext:
# formats: ipynb,py:percent
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.14.1
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# %% [markdown]
# The purpose of this notebook is to select a reasonable collection of subjects for a dimensional neuroimaging study of the onset of BP symptoms.
# %%
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
# %% [markdown]
# The tabular ABCD data is obtained by logging in and using the download link at the [this NDA study page](https://nda.nih.gov/study.html?id=2147). Extract the zip file.
# %%
data_basedir = Path("/home/ebrahim/data/abcd/abcd-5.0-tabular-data-extracted")
# %% [markdown]
# For the 5.0 release, the data dictionary has moved to [this nifty web explorer](https://data-dict.abcdstudy.org/). For the purpose of my explorations I found it easiest to select nothing in the graph (in order to apply no filters) then select the checkbox to check *all* variables on the right. Then use the tool to download a csv data dictionary for all the variables:
# %%
data_dictionary_path = Path("/home/ebrahim/data/abcd/abcd-5.0-data-dictionary.csv")
data_dictionary = pd.read_csv(data_dictionary_path, index_col=['table_name','var_name'])
# %% [markdown]
# I think that the imaging metadata table `fmriresults01.txt` is still downloaded the way as with ABCD 4.0 (creating a data package and using the NDA download manager or a similar tool). It's mentioned [here](https://wiki.abcdstudy.org/release-notes/imaging/overview.html) in the 5.0 release notes, but no file by the name `fmriresults01.txt` is to be found in the extracted tabular data csv files.
# %%
dmri_info_table_path = Path("/home/ebrahim/data/abcd/Package_1217694/fmriresults01.txt")
# %% [markdown]
# # Load tables
# %%
# Demographic data
demo = pd.read_csv(data_basedir/"core/abcd-general/abcd_p_demo.csv")
ldemo = pd.read_csv(data_basedir/"core/abcd-general/abcd_y_lt.csv",index_col=['src_subject_id','eventname'],parse_dates=['interview_date'])
# raw survey data for bipolar
bp= pd.read_csv(data_basedir/"core/mental-health/mh_p_ksads_bp.csv", low_memory=False) # parent survey
bp_youth = pd.read_csv(data_basedir/"core/mental-health/mh_y_ksads_bip.csv", low_memory=False) # youth survey
# mental health data
mh = pd.read_csv(data_basedir/"core/mental-health/mh_y_ksads_ss.csv")
# DMRI imaging data
dmri = pd.read_csv(dmri_info_table_path,sep='\t',header=0,skiprows=[1],parse_dates=['interview_date'])
# %% [markdown]
# In the 4.0 notebook we removed a couple of subject keys from the tables because their 2 and 3 year follow ups occured around the same interview age. This may have been an error that was corrected in 5.0 because it's no longer the case when we look in the table `ldemo`.
# %% [markdown]
# # Subject Selection
# %% [markdown]
# There are up to 3 time points for these interviews:
# %%
bp.eventname.value_counts()
# %% [markdown]
# Look at counts for the 1-year follow-up data... it looks like release 5.0 dropped a lot of the 1-year follow-up data that was in 4.0! That's for the raw bipolar interview questions. Something must have gone wrong in the 5.0 release of the raw KSADS interview questions here. Same issue in the youth survey:
# %%
bp_youth.eventname.value_counts()
# %% [markdown]
# But the general diagnostics table appears to be okay still:
# %%
mh.eventname.value_counts()
# %% [markdown]
# In fact we see here that there are up to *five* time points. Somehow we have raw answers to KSADS questions in for only baseline and 2-year follow-up (and in rare cases also 1-year follow-up), while we have the KSADS symptoms and diagnostics across all five time points.
#
# I'll use the KSADS diagnostics table `mh` instead of the KSADS raw answers tables `bp,bp_youth` to do subject selection this time, since something is wrong with the raw answers tables.
# %% [markdown]
# Our largest number of subjects to select is the number who have mental health info from at least one time point:
# %%
mh.src_subject_id.nunique()
# %% [markdown]
# Or we could only keep the subjects who did at least 4 of the mental health interviews:
# %%
sum(mh.groupby('src_subject_id').apply(lambda x : len(x)) >= 4)
# %% [markdown]
# We can instead focus on subjects who had any BP symptom as indicated by KSADS diagnistics, with a sampling of the same number of healthy controls. Let's look what are the bipolar-related columns in the KSADS diagnostics table:
# %%
mh_dd = data_dictionary.loc['mh_y_ksads_ss']
for i,r in mh_dd[mh_dd.var_label.str.contains("Bipolar", case=False)].iterrows():
print(f"{i}: {r.var_label}")
# %% [markdown]
# Hand-pick based on whether they seem to indicate that some sort of BP may be present:
# %%
bp_elementnames = [
'ksads_2_207_t', # Symptom - Psychomotor Agitation in Bipolar Disorder, Present
'ksads_2_208_t', # Symptom - Psychomotor Agitation in Bipolar Disorder, Past
'ksads_2_215_t', # Symptom - Impairment in functioning due to bipolar, Present
'ksads_2_216_t', # Symptom - Impairment in functioning due to bipolar, Past
'ksads_2_217_t', # Symptom - Hospitalized due to Bipolar Disorder, Present
'ksads_2_218_t', # Symptom - Hospitalized due to Bipolar Disorder, Past
'ksads_2_830_t', # Diagnosis - Bipolar I Disorder, current episode manic (F31.1x)
'ksads_2_831_t', # Diagnosis - Bipolar I Disorder, current episode depressed, F31.3x
'ksads_2_832_t', # Diagnosis - Bipolar I Disorder, currently hypomanic F31.0
'ksads_2_833_t', # Diagnosis - Bipolar I Disorder, most recent past episode manic (F31.1x)
'ksads_2_834_t', # Diagnosis - Bipolar I Disorder, most recent past episode depressed (F31.1.3x)
'ksads_2_835_t', # Diagnosis - Bipolar II Disorder, currently hypomanic F31.81
'ksads_2_836_t', # Diagnosis - Bipolar II Disorder, currently depressed F31.81
'ksads_2_837_t', # Diagnosis - Bipolar II Disorder, most recent past hypomanic F31.81
'ksads_2_838_t', # Diagnosis - Unspecified Bipolar and Related Disorder, current (F31.9)
'ksads_2_839_t', # Diagnosis - Unspecified Bipolar and Related Disorder, PAST (F31.9)
'ksads2_2_803_t', # Diagnosis - Other Specified Bipolar and Related Disorder (F31.9) Meets criteria for Bipolar II disorder except duration criteria
'ksads2_2_931_t', # Diagnosis - Bipolar II Disorder, Currently Depressed F31.81
'ksads2_2_932_t', # Diagnosis - Other Specified Bipolar and Related Disorder (F31.9) Meets criteria for Bipolar II disorder except never had major depressive episode
'ksads2_2_933_t', # Diagnosis - Other Specified Bipolar and Related Disorder (F31.9) Recurrent manic or hypomanic episodes of shorter duration than minimum criteria
'ksads2_2_936_t', # Bipolar II Disorder, Currently Depressed (in partial remission) F31.81
'ksads2_2_937_t', # Bipolar I Disorder, Currently Depressed (in partial remission) F31.3x
'ksads2_2_802_t', # Diagnosis - Bipolar II Disorder, most recent episode hypomanic F31.81
'ksads2_2_193_t', # Symptom - Psychomotor Agitation in Bipolar Disorder, Present
'ksads2_2_194_t', # Symptom - Psychomotor Agitation in Bipolar Disorder, Past
'ksads2_2_201_t', # Symptom - Impairment in functioning due to bipolar, Present
'ksads2_2_202_t', # Symptom - Impairment in functioning due to bipolar, Past
'ksads2_2_203_t', # Symptom - Hospitalized due to Bipolar Disorder, Present
'ksads2_2_204_t', # Symptom - Hospitalized due to Bipolar Disorder, Past
'ksads2_2_798_t', # Diagnosis - Bipolar I Disorder, current episode manic (F31.1x)
'ksads2_2_799_t', # Diagnosis - Bipolar I Disorder, current episode depressed, F31.3x
'ksads2_2_800_t', # Diagnosis - Bipolar I Disorder, most recent episode manic (F31.9)
'ksads2_2_801_t', # Diagnosis - Bipolar II Disorder, currently hypomanic F31.81
]
# %% [markdown]
# Hand-pick a subset based on whether they are more directly indicative of a bipolar diagnosis:
# %%
bp_elementnames2 = [
'ksads_2_830_t', # Diagnosis - Bipolar I Disorder, current episode manic (F31.1x)
'ksads_2_831_t', # Diagnosis - Bipolar I Disorder, current episode depressed, F31.3x
'ksads_2_832_t', # Diagnosis - Bipolar I Disorder, currently hypomanic F31.0
'ksads_2_833_t', # Diagnosis - Bipolar I Disorder, most recent past episode manic (F31.1x)
'ksads_2_834_t', # Diagnosis - Bipolar I Disorder, most recent past episode depressed (F31.1.3x)
'ksads_2_835_t', # Diagnosis - Bipolar II Disorder, currently hypomanic F31.81
'ksads_2_836_t', # Diagnosis - Bipolar II Disorder, currently depressed F31.81
'ksads_2_837_t', # Diagnosis - Bipolar II Disorder, most recent past hypomanic F31.81
'ksads2_2_931_t', # Diagnosis - Bipolar II Disorder, Currently Depressed F31.81
'ksads2_2_936_t', # Bipolar II Disorder, Currently Depressed (in partial remission) F31.81
'ksads2_2_937_t', # Bipolar I Disorder, Currently Depressed (in partial remission) F31.3x
'ksads2_2_802_t', # Diagnosis - Bipolar II Disorder, most recent episode hypomanic F31.81
'ksads2_2_798_t', # Diagnosis - Bipolar I Disorder, current episode manic (F31.1x)
'ksads2_2_799_t', # Diagnosis - Bipolar I Disorder, current episode depressed, F31.3x
'ksads2_2_800_t', # Diagnosis - Bipolar I Disorder, most recent episode manic (F31.9)
'ksads2_2_801_t', # Diagnosis - Bipolar II Disorder, currently hypomanic F31.81
]
# %% [markdown]
# Here is the number of subjects that at some time had a "1" in each of the two respective sets of columns:
# %%
mh['some_bp_thing'] = mh[bp_elementnames].apply(lambda x : (x==1).any(), axis=1)
mh['some_bp_thing2'] = mh[bp_elementnames2].apply(lambda x : (x==1).any(), axis=1)
print(mh.groupby('src_subject_id').some_bp_thing.agg(func='any').sum())
print(mh.groupby('src_subject_id').some_bp_thing2.agg(func='any').sum())
# %% [markdown]
# That last one seems like a good number to take: 878 subjects. Let's now restrict to the subset that also have good longitudinal data.
# %%
# indicates whether a given *subject* had at least one interview that contained a positive diagnostic
subject_positive_bp = mh.groupby('src_subject_id').some_bp_thing2.agg(func='any')
# indicates whether a given subject did at least four of the 5 mental health interviews
subject_interview_complete = mh.groupby('src_subject_id').apply(lambda x : len(x)) >= 4
# indicates whether a given subject did at least two scans
subject_scans_complete = dmri.groupby(['subjectkey']).apply(lambda x : x.interview_age.nunique()) == 2
# %% [markdown]
# If we want to include the subjects that satisfy all three of these criteria, then here's how many subjects we get:
# %%
(subject_positive_bp & subject_interview_complete & subject_scans_complete).sum()
# %% [markdown]
# However after looking at the sorts of subjects we get from this and when the BP indicators show up... a large number seem to only have BP indicated at the baseline interview and then never again at a later interview. Let's strengthen our filter to only include those who had some BP indication beyond just the baseline interview. It just seems not believable to have the prevalence be so much higher at baseline; that was likely a case of overdiagnosing early on. We may later want to restrict the filtering even more to only include those whose BP indication comes from KSADS 2, so from the 3-year follow-up onward, since KSADS 2 is known to be better at not overdiagnosing.
# %%
subject_positive_bp_beyond_baseline = mh[mh.eventname!="baseline_year_1_arm_1"].groupby('src_subject_id').some_bp_thing2.agg(func='any')
# %%
subject_positive_bp_beyond_baseline.sum()
# %%
(subject_positive_bp_beyond_baseline & subject_interview_complete & subject_scans_complete).sum()
# %%
subject_inclusion_criteria = pd.concat(
{'positive_bp':subject_positive_bp_beyond_baseline,
'interview_complete':subject_interview_complete,
'scans_complete':subject_scans_complete
}, axis=1)
subject_inclusion_criteria = subject_inclusion_criteria.replace(np.nan, False)
# %% [markdown]
# If we then want to include the same number of healthy controls, then we can take a random sample that is stratified based on the demographics of these 190. (Or should it be stratified based on the demographics of the full abcd study cohort?)
# %% [markdown]
# # COVID-19 Lockdown Effects
#
# ## Data Collection Timing
# Let's now look at the interview dates for those included subjects.
# %%
included_subjects = subject_inclusion_criteria[subject_inclusion_criteria.all(axis=1)].index
# %%
time_index_mapping = {
'baselineYear1Arm1' : 0,
'2YearFollowUpYArm1' : 1,
'4YearFollowUpYArm1' : 2,
}
dmri['time_index'] = dmri.derived_files.apply(lambda x : time_index_mapping[x.split('/')[-1].split('_')[1]])
# %%
def year_and_month_to_month_offset(year, month, day):
return (year-2018)*12 + month + day/30.5
dmri_second_scan_included_subjects = dmri[dmri.subjectkey.isin(included_subjects) & (dmri.time_index==1)]
dmri_second_scan_included_unique_subjects = dmri_second_scan_included_subjects.groupby('subjectkey').apply(lambda x : x.iloc[0])
month_offsets = dmri_second_scan_included_unique_subjects.interview_date.apply(lambda x : year_and_month_to_month_offset(x.year,x.month, x.day))
lockdown_month_offset = year_and_month_to_month_offset(2020,3,20)
# %%
bins = range(8,38,2)
plt.figure(figsize=(4,2))
plt.hist(month_offsets, bins=bins)
plt.plot([lockdown_month_offset,lockdown_month_offset],[0,100], label="Lockdown")
plt.xticks(ticks = bins)
plt.ylim(0,25)
plt.yticks(ticks=range(0,30,10))
plt.xlabel("Months since Jan 2018")
plt.title("2-Year Follow-Up Scan Dates (ABCD 5.0)")
plt.legend()
plt.show()
# %%
(month_offsets > lockdown_month_offset).value_counts()
# %%
dmri[dmri.subjectkey.isin(included_subjects)].groupby('subjectkey').time_index.agg(lambda x : (x!=1).all()).sum()
# %% [markdown]
# This is great. Of the 190 selected BP subjects, 61 of them had their second scan after lockdown measures, and 108 of them before. (And 21 of them did not have a 2-year follow-up scan at all, but just had baseline and 4-year follow-up ones.) That means we can _ignore_ lockdown effects by restricting to those 108 and it's not a serious loss at all. And we can also _observe_ lockdown effects if we want to, because 61 is a decent number of positive post-lockdown cases to have.
# %% [markdown]
# Let's now check the same for the mental health interview data.
# %%
time_index_mapping = {
'baseline_year_1_arm_1' : 0,
'1_year_follow_up_y_arm_1' : 1,
'2_year_follow_up_y_arm_1' : 2,
'3_year_follow_up_y_arm_1' : 3,
'4_year_follow_up_y_arm_1' : 4,
}
mh['time_index'] = mh.eventname.apply(lambda x : time_index_mapping[x])
def join_interview_date_from_ldemo_table(df):
return df.join(ldemo[['interview_date']], on=['src_subject_id','eventname'])
mh_time1_included_subjects = mh[mh.src_subject_id.isin(included_subjects) & (mh.time_index==1)]
mh_time1_included_subjects = join_interview_date_from_ldemo_table(mh_time1_included_subjects)
month_offsets1 = mh_time1_included_subjects.interview_date.apply(lambda x : year_and_month_to_month_offset(x.year,x.month, x.day))
mh_time2_included_subjects = mh[mh.src_subject_id.isin(included_subjects) & (mh.time_index==2)]
mh_time2_included_subjects = join_interview_date_from_ldemo_table(mh_time2_included_subjects)
month_offsets2 = mh_time2_included_subjects.interview_date.apply(lambda x : year_and_month_to_month_offset(x.year,x.month, x.day))
# %%
bins = range(8,38,2)
plt.figure(figsize=(4,2))
plt.hist(month_offsets2, bins=bins)
plt.plot([lockdown_month_offset,lockdown_month_offset],[0,100], label="Lockdown")
plt.xticks(ticks = bins)
plt.ylim(0,25)
plt.yticks(ticks=range(0,30,10))
plt.xlabel("Months since Jan 2018")
plt.title("2-Year Follow-Up Interview Dates (ABCD 5.0)")
plt.legend()
plt.show()