This article collects public knowledge on how to start learning field commonly addressed as Data Science. We are adressing what is called DS, what spheres and disciplined are covered by it and suggest recipes for stepping on path of getting familiar with this field.
Data Science is a term describing everything related to processing, storing and mining any information. This includes different disciplines, cources and spheres of our life. General undestanding (helicopter overview) can be derived from reading Google's brilliant web comic (~5 min) and reading great Vas3k's post Machine Learning for Everyone (~15 min).
There is now way to be taught to be data scientist, but you can learn how to become one yourself. There is no right way, but there is a way, which was adopted by a number of data scientists and it lies with online courses (MOOC). With this article we aim at anwering those common questions:
- Where to start with data science?
- How to become a data scientist?
We are looking up at the awesome resource Teach Yourself CS and aim at providing useful and actionable insights on how to learn skills and get knowledge to get onto data-driven way. However if you don't like our guide there are some alternatives included into the appendix.
We want to provide most inclusive and open information, therefore we do not explicitly distinquish skills between what one would name different job specialisations: ML Engineers, Data Engineers, Deep Learning specialists, Data Analytics etc.
We assume that you have at least some background with programming, if not, you can address the aforementioned Teach Yourself CS to learn basics of programming and algorithms.
Then you can get at least an overview of these topics, ideally studying suggested courses and/or watching videos. If you need any learning path, you can address the suggested by our fellow community member or any other alternative articles at the end of the article.
Why matters: you need to know fundamental math stuff to understand what's happening on the low level.
Book: Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares
Web page: The Matrix Calculus You Need For Deep Learning
Courses:
- Single variable calculus
- Differentiation calculus
- Integration calculus
- Calculus applications
- Linear algebra course
- Probability theory
Playlists:
- Mathematics for Machine Learning: Linear Algebra by Imperial College London
- Discrete Math (Full Course: Sets, Logic, Proofs, Probability, Graph Theory, etc) by Dr. Trefor Bazett
Why matters: you need to know how to code.
Courses list: Open Source Society University
Why matters: research can go wrong if you don't check for fundamental flaws.
Courses:
Why matters: general concepts of how computers can generalize.
Courses:
- ODS Machine Learning Course course page, Kaggle Intro post
- Andrew Ng’s Machine Learning
- CS229 @ Stanford
- COMS W4995 Applied Machine Learning
- Google's crash-course on ML
- (Recommended for coders with at least 1 year of experience) Introduction for machine learning for coders with an intro post
Why matters: neural networks tend to be unreasonably effective sometimes.
Courses:
- Neural Networks for Machine Learning
- Practical Deep Learning for Coders, v3 with intro blog post
- Deep Learning with Catalyst
Why matters: NLP allows to percieve sentiment, extract knowledge, perform search and machine translation.
Courses:
- CS224d: Deep Learning for Natural Language Processing from Stanford
- A Code-First Introduction to Natural Language Processing by fast.ai
- Coursera NLP specialization
Why matters: CV allows to classify images, segment them, identify objects and process visual information.
Course: CS231n: Convolutional Neural Networks for Visual Recognition from Stanford
Why matters: Reinforcement Learning or RL covers self-driving / autonomous vehicles as well as any other acting agents in any environment.
Courses:
Why matters: graphs are the best way to model relationship in your data (friendships, particle interactions, object positions, etc.).
Courses:
Surveys:
- Graph Neural Networks: A Review of Methods and Applications
- A Comprehensive Survey on Graph Neural Networks
Practice:
Data Engineering is about converting a Data Science research from thoughts, insights and research into a production project. It means efficiently using computers and building reliable distributed architectures to perform data conversion, ETL, Batch and Stream processing.
Books:
- Designing Data-Intensive Applications
- DevOps Handbook
- Release It!
- Microservices Patterns
- Streaming Data
- Clean Architecture
- Data Science at the Command Line
- Machine Learning Bookcamp
Blogs:
- Confluent
- Distributed systems for fun and profit
- Netflix TechBlog
- AWS Big Data Blog
- Top 50 Statistics Blogs of 2019
Courses:
- Big Data Analysis with Scala and Spark
- MIT 6.824: Distributed Systems
- Big Data Specialization
- Importing Data in Python (Part 1, Part 2)
Certifications:
There are some common pitfals and 'hacks' which any data scientist will encounter. Below is a cherry-picked collection of great articles on the matters:
Online book on how to work on feature engineering.
Link: https://bookdown.org/max/FES/
Now some #entrylevel material, which still might be useful to review, because repetitio est mater studiorum.
Link: https://www.analyticsvidhya.com/blog/2016/06/bayesian-statistics-beginners-simple-english/
Article includes not only great explanation of what is #pvalue, but how it works and how it can be used to make a correct conclusions.
Play with three interactive visualizations and develop your intuition for optimizing model parameters.
Link: https://www.deeplearning.ai/ai-notes/optimization/
Great intro into #statistics basics.
Link: https://freakonometrics.hypotheses.org/57649
Fine-tuning and feature extraction with PyTorch
Link: https://medium.com/analytics-vidhya/transfer-learning-in-pytorch-f7736598b1ed
If you're getting started in Data Science, you need to start with the basic building building block of Neural Networks - a Perceptron. To understand what it is, there's this good link to get started with.
Time series — data, with points having timestamps. Some might think that #timeseries are mostly used in algorithmic trading, but they often used in malware detection, network data analysis or any other field, dealing with some flow of time-labeled data. These two resources provide deep and easy #introduction into #TS analysis.
Github: https://github.com/akshaykapoor347/Time-series-modeling-basics
Data Camp presentation: https://s3.amazonaws.com/assets.datacamp.com/production/course_5702/slides/chapter3.pdf
It is better to study Kalman filter in advance because knowing about it can save lots of time.
Github: Link
Exploratory Data Analysis — stage of finding out distribution of the data, volume, number of missing values and all the other characteristics of the available dataset.
Part 1: https://towardsdatascience.com/hitchhikers-guide-to-exploratory-data-analysis-6e8d896d3f7e Part 2: https://towardsdatascience.com/hitchhikers-guide-to-exploratory-data-analysis-part-2-36ab72201e1d
Great introduction and tutorial. With code in PyTorch and TensorFlow
Nice notes on softmax cross entropy loss and how to implement it in numpy.
Link: https://deepnotes.io/softmax-crossentropy
Make sure you save the link (or this message) to show it to people without great technical background for it is one of the best and clear explanations there is.
Link: https://cloud.google.com/products/ai/ml-comic-1/
This book provides basic knowledge about using NumPy, Pandas, Matplotlib and Scikit-Learn with Jupyter Notebook for beginners from scratch. The link below leads to a repository containing the entire book and training materials.
Link: https://github.com/jakevdp/PythonDataScienceHandbook
A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow.