-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monkey #31
base: master
Are you sure you want to change the base?
Monkey #31
Conversation
Updating scrapers before introducing keywords funcionality
surveys = self.client.get_survey_lists() | ||
if 'data' not in surveys: | ||
raise ValueError( | ||
f"Surveys not found for search string {search_string}, data returned: {surveys}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd change this to f"Surveys not found from client. data returned: {surveys}"
as this part of the code doesn't do anything with the search_string yet
|
||
def get_survey_questions(self, survey_id: str) -> Dict[str, str]: | ||
"""Get the questions associated with the row number in a survey.""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this passing, do you have the questions somewhere?
"""Get id of respondent """ | ||
"""Note: multiple projects per respondent""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this into one note?
eg:
"""
Get id of respondent.
Note: multiple projects per respondent
"""
This'll make the Note a part of the docstring and visible for everyone who's mousing over the function
def get_questions_dict(self, survey_details): | ||
questions = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use a docstring:
"""Get questions and their answers.
Returns a Dict: {question_id : answer}
"""
def get_respondent_data(self, responses, survey_id, respondent_id, club_data_dict, entity) -> str: | ||
"""get data entities per respondent""" | ||
for page in responses[survey_id][respondent_id]['pages']: | ||
if page['id'] == '146876559': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this specific number? could use a documentation at least, or turn it into a variable if it could change later down the line
if len(question["headings"]) > 0: | ||
project_data_ids[question['id']] = question['headings'][0].get('heading', "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it always the first heading of question["headings"]
that contains important data, and never the second?
return club_data | ||
|
||
def get_project_data_ids(self, survey_details: dict): | ||
"""Get ids of the answers to questions on projects.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure what this returns - The dict has the format
{
"question_id" : "information_from_heading"
}
the docstring says Get ids of the answers to questions on projects.
Is the question_id
then also the id of the answer? What's the heading?
for x in unique_values: | ||
ids_per_value = [key for key, value in project_data_ids.items() if value == x] | ||
ids_per_value.sort() | ||
if len(ids_per_value) > 1: | ||
split_project_dict[x] = ids_per_value | ||
return split_project_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this does is it flips the keys and values in project_data_ids
and if there are duplicate values, it makes a new dict where the initial value
is the new key, and the new value is a list of the keys that belong to that value.
However, if there's 1 or less values, this just returns an empty dict
project_list.append({ | ||
'id' : 'respondent_id'+ str(i), | ||
'title' : self.get_project_data(responses, survey_id, respondent_id, split_project_data_dict, project_number = i, entity = 'Project Title'), | ||
'description' : self.get_project_data(responses, survey_id, respondent_id, split_project_data_dict, project_number = i, entity = 'Describe your project (at least 300 words)<br><br><em>- Context (What is the dilemma that the project is trying to tackle? Why is it important for this neighbourhood/group of people/the country?</em><br><em>- Activities (What did you do?)</em><br><em>- Results (What did you achieve? What did you create, produce, accomplish? Try to include numbers, if possible).</em><br><em>- Impact (What changed in the community? What did you learn yourself or as a team? Did you meet your own expectations)?</em>'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that this is a very long string for an entity
except Exception as e: | ||
self.logger.exception('Failed to get projects from current page') | ||
|
||
self.write_to_file(projects, str(search_string + respondent_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This'll write a separate file per respondent ; so sometimes 1 and sometimes 4 projects. I think we were using a one file per scrape methodology, so I'd say do something like create a projects list at the start of the scrape() method: projects = []
and then in the try: change it to
resp_projects = self.process_response(responses, survey_details, survey_id, respondent_id)
projects.extend(resp_projects)
responses = self.client.get_all_pages_response(survey_id) | ||
for response in responses: | ||
if not response.get("data", []): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will throw an error for the first item in the list that does not have data
param. Is this the intended behaviour?
If not, you could do something like this:
for response in responses:
data = response.get('data', [])
for response_data in data:
respondent_id = response_data.get("id", "")
responses_dict[survey_id][respondent_id] = response_data
Some issues I didnt know how to fix: