This is a web scraper built to scrape course information from the Syllabus Search Database at Waseda University.
- Python 3, version >= 3.6.2.
- pip3 (package manager for Python3)
- jq
- MongoDB shell version >= 3.6.0.
- Robo 3T (Optional but recommended)
NOTE: Currently, this guide is written for Mac users. The general procedure also applies to Window users, but the input commands might be different.
Installations for Python 3, MongoDB shell, and Robo 3T are pretty straight-forward, so they are not covered.
After installing Python 3, run the command below inside your terminal/command line to check it's available. Note that it is python3, not python.
python3 --version
Run the next command to check if pip3 is available.
pip3 --version
For more questions, follow the detailed guide here and replace all python, pip command with python3 and pip3.
Install virtual environment for Python 3 by running
pip3 install virtualenv
Create a folder that will be used as a virtual environment for this project.
mkdir wsl-scraper-venv
Initialize and activate the environment.
virtualenv wsl-scraper-venv
source wsl-scraper-venv/bin/activate
Clone this project into the virtual environment folder, and install dev dependencies.
cd wsl-scraper-venv
git clone https://github.com/wasetime/waseda-syllabus-scraper.git
pip3 install -r requirements-dev.txt
Run the following command to check if MongoDB shell is available.
mongo --version
You should see an output like MongoDB shell version v3.6.0.
Remember that you must start the database process before scraping so that the scraped data can be exported to MongoDB.
All the schools are listed in data/academics.json. You can specify the schools you want to scrape in data/academics_to_scrape.json.
Type the following command inside your terminal
cd server
bash scrape.sh
Remember to set the environment variable PYTHONIOENCODING=UTF-8
in PyCharm so that
the utf-8 characters are displayed properly instead of u\xxx
Depending on the target you selected, the scraping process may take a few minutes.
After finish scraping, you can deactivate the virtual environment by typing the following command
deactivate
You can use mongo shell (pure CLI) or Robo3T (provides a great GUI) to validate if the interested data is scraped and stored correctly in MongoDB.
The Waseda syllabus database only provides data related to courses. In order to obtain classroom and building information, we have extract and group them into separate collections. This can be done using MongoDB's Aggregation Framework.
This project contains a aggregate.js
file that helps automating the entire aggregation process. However, it is necessary to change some variables inside before starting.
Currently, there is no written guide for this section, but you can follow the comments in aggregate.js
to tweak and customize your own aggregation process.
Type the following command inside your terminal to use mongo shell to load the aggregation script.
mongo localhost:27017/syllabus /path/to/aggregate.js
It should return true
if the aggregation is successful.
If you have obtained the desired results, congratulations! Or if you encountered some troubles during scraping or aggregating the data, feel free to submit an issue. :)
This project is deployed on a remote server, and uses continuous integration and deployment
with the help Travis CI. You can learn more about the setup process at
continuous_deployment.md
Unfortunately, currently there are no unit tests created to ensure the code quality.
This project is deployed on a remote server, which is set up to scrape,
aggregate, and export the data automatically to a remote database.
You can learn more about the setup process at
automated_scraping.md
- Python3 - The language used.
- Scrapy - The scraping framework used.
- MongoDB - The database used for storing results.
Submit an issue or a pull request!
- Oscar Wang - Initial work - OscarWang114
- Taihei Sato - Add a new url - tsato815
This project is licensed under the MIT License - see the LICENSE.md file for details