A project for scraping student/advisor relationships from the Math Genealogy website.
- Install a recent version of Python. These instructions were verified for Python 3.7.7. Your mileage may vary with other versions. You can check the currently installed version with
python --version
- Create a virtual environment in the root directory of the project:
python -m venv venv
- Activate the virtual environment with
source venv/bin/activate
- Upgrade pip:
pip install --upgrade pip
- Install the project as an editable package by running the following from the project root:
pip install -e .
- Install additional dependencies
pip install -r requirements.development.txt
- Follow the official installation instructions if you do not already have Docker installed.
- In the project root directory, create a file called .env with the following contents:
export ENVIRONMENT="dev" export POSTGRES_CONNECTION_DEV="postgresql://postgres:postgres@localhost:5432/postgres"
- Run a PostgreSQL database server in a docker container by running the following command in the project root directory:
docker compose up --build -d
- Check that docker compose ran correctly with
docker ps
. You should see two containers running:math-genealogy-scraper-pgadmin-1
andmath-genealogy-scraper-postgres-1
. - In a new terminal, check that you can connect to the database by running the following command in the project root directory:
docker compose exec postgres psql -U postgres
- Check that the database is in a clean state with no extra tables with:
\l \c postgres \dt
You should not see any tables with "student" or "advisor" in the name.
- Keep the psql terminal running. You will need it in a minute. When you are done, you can exit the psql prompt with
\q
.
cd
into themath_genealogy/backend
directory and run the following command:alembic upgrade head
- In your psql cli, check that several new tables were created with
\dt
.
cd
intomath_genealogy/scrapers
and run the following command:scrapy crawl math_genealogy
- Let the scraper run for a little bit.
- You can check on the progress by querying the number of mathematicians and student-advisor relationships in the database:
SELECT COUNT(*) FROM mathematicians; SELECT COUNT(*) FROM student_advisor;