query_git.py
parallelize_query.py
download.py
parallelize_download.py
extract_data.py
parallelize_extract.py
- analysis_notebooks
- data
- json
- notebooks
- repos
- logs
- csv
Querying git requires the use of at least one personal access token. These can be acquired in developer settings. For more information see the (GitHub API)[https://developer.github.com/v3/auth/].
Access tokens should be saved as environment variables with the prefix GITHUB_TOKEN. For instance: export GITHUB_TOKEN3="..."
.
# in parallel across multiple github tokens
# will run with nohup, progress saved in query_{#}.log
python3 parallelize_query.py min max [--update]
# -- or --
# all at once on one github token
python3 query_git.py min max [--update]
min
andmax
are limits for file sizes (in bytes) that will be queried. Files on GitHub range from 0 to 100,000,000 bytes (100MB).- Adding the
--update
flag looks for new or updated notebooks in the given size range. Without this flag, the program will not search size ranges that have already been searched.
# if query was run in parallel, use process.py to combine
python3 process.py [--update]
# in parallel across multiple github tokens
# will run with nohup, progress saved in download_{#}.log
python3 parallelize_download.py [--local]
# -- or --
# all at once on one github token
python3 download.py
(https://library.ucsd.edu/dc/object/bb2733859v)
python3 convert.py
- Assumes the structure described above with
convert.py
in root directory, three downloaded CSVs incsv
, and all downloaded notebooks (nb_0.ipynb, nb_1.ipynb, etc.) indata/notebooks
.
# in parallel across 10 workers (does not rely on GitHub tokens)
# will run with nohup, progress saved in extract_{#}.log
python3 parallelize_extract.py
# -- or --
# all at once
python3 extract_data.py