-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chicago: bugfixes #25
Conversation
Looks good to me! What's your current thinking on the omnibus relation? |
We need to specify a relationship between bills, so I'm just having it look for language that makes it pretty clear that we're talking about an omnibus relationship (the two I saw were "sundry" and "miscellaneous"). If it finds that, the omnibus version "replaces" the old version. I'm having it ignore related-bills when this relationship isn't obvious, but I don't think I saw any examples. I'm far from a municipal government expert, though, so go ahead and make changes if you think they're needed. |
Oh also since it ran for many hours, I think it's looking back over all of history. It might be worth writing something that makes it not do that, because it's not going to import anything into OCD if it never finishes. |
👍 on needing a way of not scraping the whole history everytime. @paultag and I have talked a little bit about that. Right now, AFAIK, pupa does not know about what already exists in the DB, which would seem to be necessary in order to have an update only mode. |
Pupa knows about sessions; is there a reason we're not using them? |
We could do sessions. That would help a little bit going forward. However, everything that is in current legistar is part of the same session. |
That could definitely be fine. The only reason I did anything about it is that the scraper was failing locally for me because related_bills requires a relation type, and the relation type of "pending" that was in there before was not an acceptable type, so you'll just have to be sure that you can store the information needed to do the post-scrape reconciliation with an appropriate relation type. |
@fgregg can we bucket by year? |
@rshorey yah, I have a PR for the pending type (pending may not be the best name, the idea here is that we can't figure out what the relation type is until after we finish scraping) opencivicdata/python-opencivicdata#23 @paultag we can't really bucket by year. There is not end of year expiration. Practically speaking if nothing has happened to a bill in over a year it's unlikely that anything will happen to it, but there are exceptions. Things to get shaken out of committee sometimes. |
Two ideas: |
@fgregg, sounds like you've given it much more thought than me, so I'd say you should probably ditch my PR and go with yours. Although if you haven't found it yet, you might want to pull out my fix for the onclick problem in the legistar scraper - I won't be offended if you just dump that into your branch. |
I have no pr. I think this is good to merge. |
This all works for me. 👍 @paultag, there are couple of deeper issues that we've talked about in this PR. They should not block, but should be addressed eventually. Some of these already have issues.
Until we figure out smart updates, which will be a while honestly. I would strongly urge you to schedule this as a weekly scrape not a nightly scrape. |
A little more context. It takes about 48 hours for us to scrape the current site. Actually, it should get a lot better real soon, because we are about to have a new legislative session in mid May. So that's will give us some time to figure out the smart scraping. Also, @datamade will definitely help make the needed changes to pupa to support omnibus bills and smart updates. LMK know what the best way to coordinate that is. |
I'll kick off a run today! |
💯 |
Talking about smart updates here: opencivicdata/pupa#169 |
This still running? |
@fgregg It was, it died with a traceback �[37m04:36:57 INFO pupa: save bill O2014-4150 in 2011 as bill_473248ae-e99c-11e4-896f-0242ac11008e.json�[0m
Traceback (most recent call last):
File "/usr/local/bin/pupa", line 9, in
load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
File "/opt/sunlightfoundation.com/pupa/pupa/cli/__main__.py", line 71, in main
subcommands[args.subcommand].handle(args, other)
File "/opt/sunlightfoundation.com/pupa/pupa/cli/commands/update.py", line 224, in handle
report['scrape'] = self.do_scrape(juris, args, scrapers)
File "/opt/sunlightfoundation.com/pupa/pupa/cli/commands/update.py", line 123, in do_scrape
report[scraper_name] = scraper.do_scrape(**scrape_args)
File "/opt/sunlightfoundation.com/pupa/pupa/scrape/base.py", line 102, in do_scrape
for obj in self.scrape(**kwargs) or []:
File "/opt/sunlightfoundation.com/scrapers-us-municipal/chicago/bills.py", line 104, in scrape
bill_session = self.session(legislation_summary['Intro\xa0Date'])
File "/opt/sunlightfoundation.com/scrapers-us-municipal/chicago/bills.py", line 17, in session
tzinfo=pytz.timezone(self.timezone)) :
TypeError: unorderable types: str() < datetime.datetime() PRs welcome :) |
K. working on this now pretty weird. On Thu, Apr 23, 2015 at 2:14 PM [email protected] wrote:
|
Would you be open to just scraping the past six months or so until we get the incremental stuff figured out.? |
Still haven't been able to get back that far as I'm getting timeout errors. @rshorey @paultag how do you suggest I handle those. Traceback (most recent call last):
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 374, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.4/http/client.py", line 1147, in getresponse
response.begin()
File "/usr/lib/python3.4/http/client.py", line 351, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.4/http/client.py", line 313, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.4/socket.py", line 371, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.4/ssl.py", line 746, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.4/ssl.py", line 618, in read
v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/adapters.py", line 370, in send
timeout=timeout
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 597, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 245, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/packages/six.py", line 310, in reraise
raise value
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 544, in urlopen
body=body, headers=headers)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 376, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 304, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
requests.packages.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='chicago.legistar.com', port=443): Read timed out. (read timeout=60)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/fgregg/public/municipal-scrapers-us/.env/bin/pupa", line 9, in <module>
load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/__main__.py", line 71, in main
subcommands[args.subcommand].handle(args, other)
File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/commands/update.py", line 224, in handle
report['scrape'] = self.do_scrape(juris, args, scrapers)
File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/commands/update.py", line 123, in do_scrape
report[scraper_name] = scraper.do_scrape(**scrape_args)
File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/scrape/base.py", line 102, in do_scrape
for obj in self.scrape(**kwargs) or []:
File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 118, in scrape
bill, votes = self.addDetails(bill, legislation_summary['url'])
File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 227, in addDetails
votes = self.addBillHistory(bill, history_table)
File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 167, in addBillHistory
result, votes = self.extractVotes(action_detail_url)
File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 127, in extractVotes
action_detail_page = self.lxmlize(action_detail_url)
File "/home/fgregg/public/municipal-scrapers-us/chicago/legistar.py", line 18, in lxmlize
entry = self.get(url).text
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 477, in get
return self.request('GET', url, **kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 270, in request
**kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/cache.py", line 66, in request
resp = super(CachingSession, self).request(method, url, **kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 92, in request
return super(ThrottledSession, self).request(method, url, **kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 177, in request
raise exception_raised
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 157, in request
resp = super(RetrySession, self).request(method, url, **kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/adapters.py", line 433, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='chicago.legistar.com', port=443): Read timed out. (read timeout=60) |
@fgregg, I'm working on changing the timeframe for the legislation search (well, I've done it, but am testing locally, but it's still quite slow). But while trying to run, I noticed there's a bug in the event scraper date/time parsing. I'll turn that off in my pull request to prevent it from crashing every time, but can you look into that when you get a chance, please? |
I'm trying to track down that issue but can't reproduce. Can you send me On Mon, Apr 27, 2015 at 10:18 AM Rachel [email protected] wrote:
|
@fgregg I'll try to get you that. In the meantime, do you know if there's an easy way to limit the date? I'm trying to pass a parameter to "ctl00$ContentPlaceHolder1$lstYearsAdvanced", which is the field I found by poking around in their post parameters, but it doesn't seem to be limiting the data in my scrape the way it is in the interface. Was thinking you might have another idea due to your familiarity. |
@fgregg I think the event problem is on this page - maybe that greyed out one? Again, I'm not super familiar with this code. |
Use CRITICAL level for Sentry logging
The Chicago scraper failed locally for me (and is still running for 12+ hours, so I haven't seen it finish). @fgregg, feel free to ditch this PR if this is overlap with work you've done.