- June 22: Discuss project ideas with instructional team
- Past student projects
- Public data sources
- Data science competitions: Kaggle, DrivenData, CrowdANALYTIX, TunedIT, InnoCentive
- June 24: Project question and dataset
- Project question by Jason Knobloch
- Project question by Jennifer Lambert
- Project question by Alex Lee
- July 13: First project presentation
- First presentation by Chandler McCann
- First presentation by Nathan Danielsen
- July 27: Draft paper
- August 3: Peer review
- August 10/12: Final project presentation and paper
- Final presentation by Austin Brown
- Final paper by Kerry Jones
The final project should represent significant original work applying data science techniques to an interesting problem. Final projects are individual attainments, but you should be talking frequently with your instructor and classmates about them.
Address a data-related problem in your professional field or a field you're passionate about. If you have a strong interest in the subject matter, you'll create a better project and it will be a lot more fun for you!
Here's a collection of past projects from GA Data Science students that may help to stimulate your thinking. You're welcome to use public data or private data, though with private data, you'll have to be careful about what you release. Competing in a Kaggle competition (including past competitions) is also a project option, in which case the data will be provided for you.
By June 22, you should talk with a member of the instructional team about your project idea(s). We can help you to choose between different ideas, advise you on the appropriate scope for your project, and ensure that your project question might reasonably be answerable using the data science tools and techniques taught in the course. (There is nothing you have to turn in for this milestone.)
Create a GitHub repository for your project. It should include a short write-up that answers these questions:
- What is the question you hope to answer?
- What data are you planning to use to answer that question?
- What do you know about the data so far?
- Why did you choose this topic?
You'll be giving a short presentation to the class about the work you have done so far, as well as your plans for the project going forward. Your presentation should use slides (or a similar format). Your slides, code, data, and visualizations should be included in your GitHub repository. Here are some questions that you should address in your presentation:
- What data have you gathered, and how did you gather it?
- Which areas of the data have you cleaned, and which areas still need cleaning?
- What steps have you taken to explore the data?
- What insights have you gained from your exploration?
- Will you be able to answer your question with this data, or do you need to gather more data (or adjust your question)?
- How might you use modeling to answer your question?
- Please submit a link to your repository (with slides) no later than 6pm on Monday. I'll be copying your slides to my computer before class begins. Please don't Slack your materials to me unless you are having problems with GitHub.
- Everyone will be presenting from my computer, so your slides should be in a format that can be easily read on any computer (PDF, PowerPoint, Google Slides, IPython Notebook).
- You will have exactly 6 minutes to present, followed by 1 minute of questions.
- Tell your story in an engaging fashion.
- Make sure your project question is crystal clear to every person in the room in the first minute.
- It is critical that you practice delivering your presentation and time yourself.
- If you find that your presentation is longer than 6 minutes, the solution is not to speak more quickly. Instead, focus your presentation around the most interesting aspects of your project.
If it's not practical to include your entire dataset in your GitHub repository, you should link to your data source and provide a sample of the data. (GitHub has a size limit of 100 MB per file and 1 GB per repository.) If your data is private, you can either include an "anonymized" version of your data or create a private GitHub repository.
A draft of your project paper is due, along with the data, well-commented code, and visualizations. It should be written with a technical audience in mind. Your paper should include the following components:
- Problem statement and hypothesis
- Description of your data set and how it was obtained
- Description of any pre-processing steps you took
- What you learned from exploring the data, including visualizations
- How you chose which features to use in your analysis
- Details of your modeling process, including how you selected your models and validated them
- Your challenges and successes
- Possible extensions or business applications of your project
- Conclusions and key learnings
Your peers and instructional team will be providing feedback. However, the paper should stand "on its own", and should not depend upon the reader remembering your first presentation. The easier your paper is to follow, the more useful feedback you will receive! As well, if your reviewers can actually run your code on the provided data, they will be able to give you better feedback on your code.
You will provide project feedback to two of your peers, according to the peer review guidelines.
Your project repository on GitHub should contain the following:
- Project paper: any format (PDF, Markdown, etc.)
- Presentation slides: any format except for Keynote (PDF, PowerPoint, Google Slides, IPython Notebook, etc.)
- Code: commented Python scripts, and any other code you used in the project
- Visualizations: integrated into your paper and/or slides
- Data: data files in "raw" or "processed" format
- Data dictionary (aka "code book"): description of each variable, including units
- Please submit a link to your repository (with slides) by 6pm on the day you are presenting.
- Regardless of which day you are presenting, your repository should also contain the other required project components by 6pm on the last day of class.
- You will have exactly 12 minutes to present, followed by 2 minutes of questions. Practice your presentation and time yourself!
- Your presentation should start with a recap of the key information from the previous presentation (including your project question), but you should spend the majority of your presentation discussing what has happened since then.
- If your presentation is too long, focus it around the most interesting aspects of your project, rather than trying to include every last detail.
- Tell your story in an engaging fashion.
- You are welcome to invite your friends and family members to attend.