-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull requests to update population-city dataset #7
Conversation
from selenium.webdriver.chrome.options import Options | ||
from selenium.webdriver.support.ui import WebDriverWait | ||
from selenium.webdriver.support import expected_conditions as EC | ||
from bs4 import BeautifulSoup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@judeleonard where are you using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@judeleonard you imported BeautifulSoup
but I couldn't find where are you using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The beautifulSoup is used here https://github.com/judeleonard/population-city/blob/6f193c14dc59db3af4e6b78971b1202bb1d0fc9b/scripts/scraper.py#L39 as a flavour for reading the HTML table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@judeleonard am I correct that you are parsing the HTML table on the page? Do you think it is the right approach vs downloading the data in CSV?
Based on the initial task to update the data, I thought about fetching first table where the update happened and then use that to update each of the file in the data. |
Hi @anuveyatsu please what is the update on this task? However, I forgot also to mention that the API endpoint to the UN website for fetching the data was blocked due to an SSL certification issue, so I settled for selenium for this task. Could it be possible to assign me another task if that is possible? I would be more than happy to pick that up. Thanks! |
@judeleonard I'm not sure about "the API endpoint to the UN website for fetching the data was blocked due to an SSL certification issue", could you elaborate on that? We normally wouldn't fetch HTML pages if we can access the API. My initial comment was also about that - why would one parse HTML when there is a clearer way to fetch data? Also, this PR's title is not very descriptive for me. |
@anuveyatsu So after checking through the website I initially thought I could easily extract the link to the download button on the webpage but that was not the case. So I used their API documentation to understand how it can be used with the request library to retrieve the data. Below is a sample of my code that throws SSL error each time I make a request to get the data. import requests
from io import StringIO
import pandas as pd
url = "https://data.un.org/ws/rest/data/DF_UNData_UNFCC"
headers = {"Accept": "text/csv"}
response = requests.get(url, headers=headers)
data = response.json() |
@anuveyatsu could please review my PR for the issue #6