Skip to content

A full pipeline for downloading, cleaning and enriching the history of planetpython.org

Notifications You must be signed in to change notification settings

itielshwartz/python-station-backend

Repository files navigation

Python station backend

About

  • The backend behind : python-station

  • Full data pipeline to scrape http://planetpython.org

  • Output: Every Github (Python) project featured on the history of planetpython.

  • Also includes data enrichment using Github + Reddit + Hackernews APi.

How does it work?

  1. Download the pages from planetPython.org clone

  2. Use BeautifulSoup to transform raw page into posts

  3. Use Github API to get basic project data (And filter no python projects)

  4. Use Praw (Reddit) + HN Api + Github Trending to enrich data

  5. Show data using Github pages + Vue.js

How to run?

  • Clone the project
  • python3 -m venv ./venv && source venv/bin/activate && pip install -r requirements.txt
  • venv/bin/python pipeline.py --pages-to-download 5
  • To download Reddit data you need to fill in your reddit creds in: requests_utils.py
  • If you get limit on your Github requests you need to fill in your Github creds in: requests_utils.py

Pipeline Flow chart

+-------------------+
| Download Pages    |
+---------+---------+
          |
+---------v---------+
|Transform to Posts |
+---------+---------+
          |
+---------v---------+
|Extract projects   |
+---------+---------+
          |
+---------v---------+
|Enrich Using Apis  |
+---------+---------+
          |
+---------v----------+
|Deploy Using Github |
| Pages              |
+--------------------+

Development

Want to contribute? Great! Feel free to open PR/Issue :)

License

MIT - Free Software, Hell Yeah!

About

A full pipeline for downloading, cleaning and enriching the history of planetpython.org

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages