GitHub - slarrain/OpenDataCongress: Test Scraping of the XML Data on opendata.congreso.cl

slarrain / OpenDataCongress Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Test Scraping of the XML Data on opendata.congreso.cl

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
CongressDataAnalysis.py		CongressDataAnalysis.py
code_coalitioncluster.R		code_coalitioncluster.R
code_final.R		code_final.R
code_final2.R		code_final2.R
code_installpackages.R		code_installpackages.R
code_nthvote.R		code_nthvote.R
code_participationfrequency.R		code_participationfrequency.R
code_partycluster.R		code_partycluster.R
code_unity.R		code_unity.R
readme.txt		readme.txt
scraper.py		scraper.py

Repository files navigation

DAYMAN PROJECT
--------------

What is it?
-----------

The Dayman Project scrapes more than 19,000 XML pages from the Chilean House of Representatives and Analyzes the Data, producing a Loyalty Index and Graphics the data analyzed.

Documentation
-------------
Every Python and R file have explanatory comments written in them with details for each function

OUTPUT
------

The Project outputs several files. The important ones, that are part of the report are:
- CSV files (8): Loyalty Index for every representative, for every period, by Coalition and by Party.
- Rplots.PDF: report with the graphs with the Analyzed Data.

Dependencies:
-------------

Python: needs to be installed.
- Beautiful Soup 4
R: No need to install them. Just follow the usage instructions.
- ggplot2
- reshape
- gridExtra
- rjson

USAGE
-----

Instructors usage for VM machine:
---------------------------------
1. Run:
python CongressDataAnalysis.py
2. Run:
sudo R -f code_installpackages.R
3. Run:
R -f code_final2.R

General Instructions
--------------------
1. Run:
python scraper.py (will take several hours)
2. Run:
python CongressDataAnalysis.py
3. Run:
sudo R -f code_installpackages.R
4. Run:
R -f code_final2.R

Files
-----

Python: 2 files;
- Scraper.py: it scrapes the Chilean Congress Open Data Portal, retrives the information and produce JSON files, that we have already run and uploaded for you. In case you would like to run it yourself, this file needs to be run first and it takes several hours to finish.
- CongressDataAnalytics.py: It Analyzes the Data produced and generates several JSON files that are later used by R to plot Graphs.
It also generates 8 CSV files with the Loyalty Index for every representative, for each period and party and Coalition.

CSV: 8 files. List of Representatives with their Loyalty Index, by Party and by Coalition, for each period.
- 2 for each period. One for coalition, one for party. Lastnames, Names, Party, Score.

JSON:
Files created by the Python files for Data Analytics. No need to do anything with them. They are used by the Python files and the R files.
- 8 files pushed to Git created by scraper.py.
- 15 files created by CongresDataAnalytics.py. Created by the instructors.

PDF:
- 1 PDF created by R with the Report and Graphs Analyzing the Data.

PNG:
- 12 files generated by R. Used for the PDF report. Can be ignored or checked individually.

R:
- 8 files:
- 2 are used: code_installpackages.R and code_final2.R See usage for details.
- 6 are not used: for testing purposes and to run each data analysis by itself.

R DOCUMENTATION
---------------

Recommended Order to Run:
from command line type "R -f <filename>"
sudo R -f code_installpackages.R (only once)
R -f code_final.R
(any other files you want to run)

Files:
-code_installpackages.R- run this file first and only once to install the packages ggplot2,reshape,rjson, and gridExtra

Each of the following R files will produce plots that will be saved to a png file
-code_coalitioncluster.R
-code_partycluster.R
-code_unity.R
-code_nthvote.R
-code_participationfrequency.R

-code_final.R calls each of the plotting R files
-code_final2.R has the copied and pasted code of the plotting R files instead of calling the files (this is because it will also produce a pdf file that combines all the plots)