-
Notifications
You must be signed in to change notification settings - Fork 2
/
project.Rmd
127 lines (84 loc) · 12.6 KB
/
project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
title: Project Description
---
For the project you will create, in groups of 2 or 3, a thorough analysis of a particular dataset using a multiple regression model.
There are three final delivarables for the project:
* A final report describing your problem, the analysis you conducted, and your conclusions (submitted as a group)
* Supplementary materials with R output and diagnostic plots for your model (submitted as a group)
* A group dynamic report (submitted individually)
The report and supplementary materials should both be generated by an R markdown document, and I will ask you to submit both the R markdown document and the resulting knitted pdf.
### Deadlines
* **5:00 PM Tues Nov 12:** Group memberships. [Fill out this google form](https://forms.gle/Q1QnBwyYaQsnETn37) with your preference for group members: . Groups should have 2 or 3 members. Only one person per proposed group needs to do this. If you don't have a group, just fill out the form and let me know that. I will assign you to a group. **Note that I reserve the right to shuffle group members and you may not get your preferred group!**
* **5:00 PM Sat Nov 16:** Project proposals and description of data sources. Submit by uploading to google drive and sending me an email (cc'ing all group members) letting me know you have done that.
* **5:00 PM Tue Nov 26:** At this check point, you should have written at least a couple of paragraphs introducing the problem you are working on, read in your data set, and made at least 2 plots of the data. Submit by uploading to google drive and sending me an email (cc'ing all group members) letting me know you have done that.
* **5:00 PM Sat Dec 07:** At this check point, you should have fit a multiple regression model, checked residual diagnostics, and taken steps to address any evident problems. Submit by uploading to google drive and sending me an email (cc'ing all group members) letting me know you have done that.
* **11:59 AM (noon) Tue Dec 17:** Final submission of R markdown file and pdf for report and supplementary materials. Submit by uploading to google drive and sending me an email (cc'ing all group members) letting me know you have done that.
* **11:59 AM (noon) Tue Dec 17:** Group dynamic report by email, **sent only to me**.
### Project Proposals
You must propose two distinct projects. This is so that if one of them isn't feasible, we can go with the other. Put your preferred project first. Each proposed project should have the following three elements:
1. A question that you find interesting and which may be addressed (at least in part) through the analysis of data. Your question should be complex enough that there are at least 3 explanatory variables to consider. Your response variable should be quantitative. Recent projects have considered the following questions:
\begin{itemize}
\item How is the state's murder rate related to its demographics and social characteristics?
\item How is the percentage of Massachusetts high school seniors going on to four-year colleges influenced by town and school characteristics?
\item What association is there between air pollution and mortality?
\item How can we predict real estate prices in Massachusetts?
\end{itemize}
2. A data set that can be used to answer the question you posed in part 1. I'm looking for either a link to the data set, or an attached spreadsheet or similar file.
3. A description of which specific variable in your data set will be your response (this variable should be quantitative!!) and which will be your explanatory variables.
Your proposals do not have to be extensive! A paragraph or two for each proposal is fine. I just need enough detail to decide if your proposed project is feasible.
Count on brainstorming a few serious ideas before you can groom one of them into a mature proposal.
For the most part, the choice of topic is left up to you. Try to pick something that's interesting yet substantial and worth studying. Please try to avoid time series data since we have not studied models to handle.
### Finding Data
Finding the right data to answer your particular question is part of your responsibility for this assignment. Public data sets are available from hundreds of different websites, on virtually any topic. You might not be able to find the exact data that you want -- but you should be able to find data that is relevant to your topic. You may also want to refine your research question so that it can be more clearly addressed by the data that you found. But be creative! Go find the data that you want!
Below is a list of places to get started, but this list should be considered grossly non-exhaustive:
* Gapminder http://www.gapminder.org
* Data.gov https://www.data.gov/
* StatLib at Carnegie Mellon http://lib.stat.cmu.edu/datasets/
* U.S. Bureau of Labor Statistics https://www.bls.gov/
* U.S. Census Bureau https://www.census.gov/
* University of Edinburgh http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html
* Comprehensive Epidemiologic Data Resource (CEDR) https://oriseapps.orau.gov/cedr/
* Kaggle: https://www.kaggle.com/
* UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/
* Google public data repository: https://www.google.com/publicdata/directory
* World Health Organization data: http://www.who.int/gho/database/en/
* Center for Disease Control data: http://www.cdc.gov/datastatistics/
Keep the following in mind as you select your topic and dataset:
* You need to have enough data to make meaningful inferences. There is no magic number of individuals required for all projects. But aim for at least 200 individuals and make sure there are at least 20 individuals in each category of each of your categorical variables (if you have any).
* You need to measure a quantitative outcome, with at least two other variables included in the dataset (ideally at least one of which is quantitative). This will allow you to use multiple linear regression for your primary analyses. While categorical outcome data are interesting and important, we will only discuss methods for categorical response variables briefly in this course.
### Guidelines for the final report
Overall, the project report should be written in clear, concise prose. No R code should be shown (I will show you how to hide R code that is in an R markdown document so that it runs but is not displayed in the knitted document).
We will use a structure that is similar to a standard scientific report, though your write up will likely be somewhat shorter than a typical journal article. Please follow the structure below:
1. Title
2. Summary: an introduction to the problem you are addressing, brief description of the methods you consider, and summary of the results. Aim for 1 paragraph.
3. Data: a brief summary of key features of the dataset. You should define each variable that will be used (to the level that it is possible to do this, given the information provided by your data source). Also include a few plots showing a few key insights about the data set. Note that there will probably not be enough space to present every plot you make during the course of conducting your analysis; you will have to select a small number of the most informative plots to include. These plots should be briefly discussed in the text. At least a few sentences of context and description of the dataset should be included (how were the data collected? What was measured?), and the number of observations in the data set should be stated. Aim for about 1-2 pages. There should be enough detail that the scope of conclusions from your analysis can be assessed.
4. Methods: a description of the statistical model used in your analysis. Describe any transformations or other special things you had to do. Aim for a page or less.
5. Results: a presentation of your results. This should include a paragraph or two stating the results of the analysis with minimal interpretation. Aim for less than a page.
6. Discussion: summarize your work, its limitations, and possible future steps/improvements. Address the answers to the problem you outlined in your summary and the scope of your conclusions. This can be a page or two.
7. References: cite all sources in a standard format.
Items one through 6 above will probably require between 5 and 10 pages, including figures and tables. Please do not go over 10 pages. If your report is looking like it will be less than 5 pages please run it by me and make sure you're discussing everything in enough detail. You should not change the font size or margins from the defaults for R markdown documents.
### Group Dynamic Report
Ideally, all group members would be equally involved and able and committed to the project. In reality, it doesn't always work that way. I'd like to reward people fairly for their efforts in this group endeavor, because it's inevitable that there will be variation in how high a priority people put on this class and how much effort they put into this project.
To this end I will ask each of you (individually) to describe how well (or how poorly!) your project group worked together and shared the load. Also give some specific comments describing each member's overall effort. Were there certain group members who really put out exceptional effort and deserve special recognition? Conversely, were there group members who really weren't carrying their own weight? And then, at the end of your assessment, estimate the percentage of the total amount of work/effort done by each member. (Be sure your percentages sum to 100\%!)
For example, suppose you have 3 group members: X, Y and Z. In the (unlikely) event that each member contributed equally, you could assign:
* 33.3\% for member X,
* 33.3\% for member Y, and
* 33.3\% for member Z
Or in case person Z did twice as much work as each other member, you could assign:
* 25\% for member X,
* 25\% for member Y, and
* 50\% for member Z
Or if member Y didn't really do much at all, you could assign:
* 45\% for member X,
* 10\% for member Y, and
* 45\% for member Z
I'll find a fair way to synthesize the (possibly conflicting) assessments within each group. And eventually I'll find a way to fairly incorporate this assessment of effort and cooperation in each individual's overall grade. Don't pressure one another to give everyone glowing reports unless it's warranted, and don't feel pressured to share your reports with one another. Just be fair to yourselves and to one another. Let me know if you have any questions or if you run into any problems.
**Because I will be accounting for relative effort of the group members, it is critical that you communicate with each other about expectations and give each other a chance to contribute!** If you are highly motivated, resist the urge to do the whole project yourself; ask your group members to contribute in specific ways. On the other hand, if you are busy, don't just step back and let others do the work -- be in touch with your group members about specific ways you would like to contribute and your time line. Communicate with each other early and often.
### Grading and Assessment Criteria
The project grade makes up 10% of the final grade for the class. Here are some things I'll be considering:
* Technical Mastery: Do you demonstrate that you understand the methods you are using? Does the submitted R code work correctly? Can I knit the submitted R markdown files to generate the submitted pdf files?
* Writing: How effectively does the written report communicate the goals, procedures,
and results of the study? Are the claims adequately discussed and supported? How well is the report structured and organized (this should not be a problem if you follow the structure I laid out!)? Are all of the figures and tables numbered and appropriately referenced? Does the writing style enhance what the group is trying to communicate? How well is the report edited?
* Design: Are the variables chosen appropriately and defined clearly, and is it clear how they were measured/observed? Is there sufficient data to make meaningful conclusions?
* Statisical Analysis: Are the chosen analyses appropriate for the variables/relationships under investigation, and are the assumptions underlying these analyses met? Do the analyses involve fitting and interpreting a multiple regression model? Are the analyses carried out correctly? Was the appropriateness of the model assessed using diagnostic plots? Is there an effective mix of graphical, numerical, and inferential analyses?
* Conclusions: Are the stated conclusions supported and justified by the analysis? Can the effects of lurking variables be controlled for (if not, is that discussed as a limitation of the analysis)? Is the scope of conclusions properly addressed?