A data pipeline to extract Reddit data from r/dataengineering.
Output is a Google Data Studio report, providing insight into the Data Engineering official subreddit.
Project was based on an interest in Data Engineering and the types of Q&A found on the official subreddit.
It also provided a good opportunity to develop skills and experience in a range of tools. As such, project is more complex than required, utilising dbt, airflow, docker and cloud based storage.
- Extract data using Reddit API
- Load into AWS S3
- Copy into AWS Redshift
- Transform using dbt
- Create PowerBI or Google Data Studio Dashboard
- Orchestrate with Airflow in Docker
- Create AWS resources with Terraform
Follow below steps to setup pipeline. I've tried to explain steps where I can. Feel free to make improvements/changes.
NOTE: This was developed using an M1 Macbook Pro. If you're on Windows or Linux, you may need to amend certain components if issues are encountered.
As AWS offer a free tier, this shouldn't cost you anything unless you amend the pipeline to extract large amounts of data, or keep infrastructure running for 2+ months. However, please check AWS free tier limits, as this may change.
First clone the repository into your home directory and follow the steps.
git clone https://github.com/AnMol12499/Reddit-Analytics-Integration-Platform.git
To begin using the project, follow these steps:
- Overview
- Reddit API Configuration
- AWS Account
- Infrastructure with Terraform
- Configuration Details
- Docker & Airflow
- dbt
- Dashboard
- Final Notes & Termination
- Improvements
Project Structure: The project's structure includes directories for infrastructure (Terraform), configuration (AWS and Airflow), data extraction (Python scripts), and optional steps like dbt and BI tools integration.
Customization: Feel free to customize the project by modifying configurations, adding new data sources, or integrating additional tools as needed.