Chicago: bugfixes #25

rshorey · 2015-04-07T14:21:59Z

The Chicago scraper failed locally for me (and is still running for 12+ hours, so I haven't seen it finish). @fgregg, feel free to ditch this PR if this is overlap with work you've done.

fgregg · 2015-04-07T15:46:09Z

Looks good to me! What's your current thinking on the omnibus relation?

rshorey · 2015-04-07T16:15:09Z

We need to specify a relationship between bills, so I'm just having it look for language that makes it pretty clear that we're talking about an omnibus relationship (the two I saw were "sundry" and "miscellaneous"). If it finds that, the omnibus version "replaces" the old version. I'm having it ignore related-bills when this relationship isn't obvious, but I don't think I saw any examples. I'm far from a municipal government expert, though, so go ahead and make changes if you think they're needed.

rshorey · 2015-04-07T16:16:01Z

Oh also since it ran for many hours, I think it's looking back over all of history. It might be worth writing something that makes it not do that, because it's not going to import anything into OCD if it never finishes.

fgregg · 2015-04-07T17:13:58Z

👍 on needing a way of not scraping the whole history everytime. @paultag and I have talked a little bit about that. Right now, AFAIK, pupa does not know about what already exists in the DB, which would seem to be necessary in order to have an update only mode.

paultag · 2015-04-07T17:46:10Z

Pupa knows about sessions; is there a reason we're not using them?

fgregg · 2015-04-07T17:48:42Z

We could do sessions. That would help a little bit going forward. However, everything that is in current legistar is part of the same session.

fgregg · 2015-04-07T17:58:22Z

@rshorey, @paultag and I have been talking about dealing with omnibus bills as a post scrape step. What are your thoughts about that.

rshorey · 2015-04-07T18:03:02Z

That could definitely be fine. The only reason I did anything about it is that the scraper was failing locally for me because related_bills requires a relation type, and the relation type of "pending" that was in there before was not an acceptable type, so you'll just have to be sure that you can store the information needed to do the post-scrape reconciliation with an appropriate relation type.

paultag · 2015-04-07T18:10:29Z

@fgregg can we bucket by year?

fgregg · 2015-04-07T18:21:02Z

@rshorey yah, I have a PR for the pending type (pending may not be the best name, the idea here is that we can't figure out what the relation type is until after we finish scraping) opencivicdata/python-opencivicdata#23

@paultag we can't really bucket by year. There is not end of year expiration.

Practically speaking if nothing has happened to a bill in over a year it's unlikely that anything will happen to it, but there are exceptions. Things to get shaken out of committee sometimes.

fgregg · 2015-04-07T18:27:02Z

Two ideas:
1.
We could leverage the legistar API, which has as last updated search parameter. But we would still need to know when the last time we scraped was. (which pupa does not seem to currently have a way to know).
2. On the website the search results seem to be returned in last updated order. We could stop scraping once the order of of a window of 100 bills on the website matches a windows of 100 bills in the DB. This seems like the most generic solution, since other cities do not have a legistar API.

rshorey · 2015-04-07T20:18:23Z

@fgregg, sounds like you've given it much more thought than me, so I'd say you should probably ditch my PR and go with yours. Although if you haven't found it yet, you might want to pull out my fix for the onclick problem in the legistar scraper - I won't be offended if you just dump that into your branch.

fgregg · 2015-04-08T05:39:47Z

I have no pr. I think this is good to merge.

rshorey · 2015-04-08T14:17:43Z

OK, @paultag I think you're the one with merge power here.

@fgregg you'll maybe want to revert back to the "pending" thing if/when we decide to go that direction on ocd.

fgregg · 2015-04-21T17:24:36Z

This all works for me. 👍

@paultag, there are couple of deeper issues that we've talked about in this PR. They should not block, but should be addressed eventually. Some of these already have issues.

Handling omnibus bills add ability to do post-write hooks on a per-jurisdiction basis pupa#86, add pending to bill relation types python-opencivicdata#23
Infrastructure for only smart updates. This probably will require some pupa infrastructure. Where should I open an issue for this.

Until we figure out smart updates, which will be a while honestly. I would strongly urge you to schedule this as a weekly scrape not a nightly scrape.

fgregg · 2015-04-21T17:34:16Z

A little more context.

It takes about 48 hours for us to scrape the current site. Actually, it should get a lot better real soon, because we are about to have a new legislative session in mid May. So that's will give us some time to figure out the smart scraping.

Also, @datamade will definitely help make the needed changes to pupa to support omnibus bills and smart updates. LMK know what the best way to coordinate that is.

Chicago: bugfixes

paultag · 2015-04-21T17:37:45Z

I'll kick off a run today!

fgregg · 2015-04-21T17:38:37Z

💯

fgregg · 2015-04-21T17:46:01Z

Talking about smart updates here: opencivicdata/pupa#169

fgregg · 2015-04-23T18:27:15Z

This still running?

paultag · 2015-04-23T19:14:22Z

@fgregg It was, it died with a traceback

�[37m04:36:57 INFO pupa: save bill O2014-4150 in 2011 as bill_473248ae-e99c-11e4-896f-0242ac11008e.json�[0m
Traceback (most recent call last):
  File "/usr/local/bin/pupa", line 9, in 
    load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
  File "/opt/sunlightfoundation.com/pupa/pupa/cli/__main__.py", line 71, in main
    subcommands[args.subcommand].handle(args, other)
  File "/opt/sunlightfoundation.com/pupa/pupa/cli/commands/update.py", line 224, in handle
    report['scrape'] = self.do_scrape(juris, args, scrapers)
  File "/opt/sunlightfoundation.com/pupa/pupa/cli/commands/update.py", line 123, in do_scrape
    report[scraper_name] = scraper.do_scrape(**scrape_args)
  File "/opt/sunlightfoundation.com/pupa/pupa/scrape/base.py", line 102, in do_scrape
    for obj in self.scrape(**kwargs) or []:
  File "/opt/sunlightfoundation.com/scrapers-us-municipal/chicago/bills.py", line 104, in scrape
    bill_session = self.session(legislation_summary['Intro\xa0Date'])
  File "/opt/sunlightfoundation.com/scrapers-us-municipal/chicago/bills.py", line 17, in session
    tzinfo=pytz.timezone(self.timezone)) :
TypeError: unorderable types: str() < datetime.datetime()

PRs welcome :)

fgregg · 2015-04-23T19:51:17Z

K. working on this now pretty weird.

On Thu, Apr 23, 2015 at 2:14 PM [email protected] wrote:

@fgregg https://github.com/fgregg It was, it died with a traceback

�[37m04:36:57 INFO pupa: save bill O2014-4150 in 2011 as bill_473248ae-e99c-11e4-896f-0242ac11008e.json�[0m
Traceback (most recent call last):
File "/usr/local/bin/pupa", line 9, in
load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
File "/opt/sunlightfoundation.com/pupa/pupa/cli/main.py", line 71, in main
subcommands[args.subcommand].handle(args, other)
File "/opt/sunlightfoundation.com/pupa/pupa/cli/commands/update.py", line 224, in handle
report['scrape'] = self.do_scrape(juris, args, scrapers)
File "/opt/sunlightfoundation.com/pupa/pupa/cli/commands/update.py", line 123, in do_scrape
report[scraper_name] = scraper.do_scrape(**scrape_args)
File "/opt/sunlightfoundation.com/pupa/pupa/scrape/base.py", line 102, in do_scrape
for obj in self.scrape(**kwargs) or []:
File "/opt/sunlightfoundation.com/scrapers-us-municipal/chicago/bills.py", line 104, in scrape
bill_session = self.session(legislation_summary['Intro\xa0Date'])
File "/opt/sunlightfoundation.com/scrapers-us-municipal/chicago/bills.py", line 17, in session
tzinfo=pytz.timezone(self.timezone)) :TypeError: unorderable types: str() < datetime.datetime()

PRs welcome :)

—
Reply to this email directly or view it on GitHub
#25 (comment)
.

fgregg · 2015-04-23T20:38:29Z

Would you be open to just scraping the past six months or so until we get the incremental stuff figured out.?

fgregg · 2015-04-25T14:59:41Z

Still haven't been able to get back that far as I'm getting timeout errors. @rshorey @paultag how do you suggest I handle those.

Traceback (most recent call last):
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 374, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.4/http/client.py", line 1147, in getresponse
    response.begin()
  File "/usr/lib/python3.4/http/client.py", line 351, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.4/http/client.py", line 313, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.4/socket.py", line 371, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.4/ssl.py", line 746, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.4/ssl.py", line 618, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/adapters.py", line 370, in send
    timeout=timeout
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 597, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 245, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/packages/six.py", line 310, in reraise
    raise value
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 544, in urlopen
    body=body, headers=headers)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 376, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 304, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
requests.packages.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='chicago.legistar.com', port=443): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fgregg/public/municipal-scrapers-us/.env/bin/pupa", line 9, in <module>
    load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/__main__.py", line 71, in main
    subcommands[args.subcommand].handle(args, other)
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/commands/update.py", line 224, in handle
    report['scrape'] = self.do_scrape(juris, args, scrapers)
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/commands/update.py", line 123, in do_scrape
    report[scraper_name] = scraper.do_scrape(**scrape_args)
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/scrape/base.py", line 102, in do_scrape
    for obj in self.scrape(**kwargs) or []:
  File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 118, in scrape
    bill, votes = self.addDetails(bill, legislation_summary['url'])
  File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 227, in addDetails
    votes = self.addBillHistory(bill, history_table)
  File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 167, in addBillHistory
    result, votes = self.extractVotes(action_detail_url)
  File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 127, in extractVotes
    action_detail_page = self.lxmlize(action_detail_url)
  File "/home/fgregg/public/municipal-scrapers-us/chicago/legistar.py", line 18, in lxmlize
    entry = self.get(url).text
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 477, in get
    return self.request('GET', url, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 270, in request
    **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/cache.py", line 66, in request
    resp = super(CachingSession, self).request(method, url, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 92, in request
    return super(ThrottledSession, self).request(method, url, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 177, in request
    raise exception_raised
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 157, in request
    resp = super(RetrySession, self).request(method, url, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/adapters.py", line 433, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='chicago.legistar.com', port=443): Read timed out. (read timeout=60)

rshorey · 2015-04-27T15:18:39Z

@fgregg, I'm working on changing the timeframe for the legislation search (well, I've done it, but am testing locally, but it's still quite slow). But while trying to run, I noticed there's a bug in the event scraper date/time parsing. I'll turn that off in my pull request to prevent it from crashing every time, but can you look into that when you get a chance, please?

fgregg · 2015-04-27T15:19:57Z

I'm trying to track down that issue but can't reproduce. Can you send me
page that's causing the error.

On Mon, Apr 27, 2015 at 10:18 AM Rachel [email protected] wrote:

@fgregg https://github.com/fgregg, I'm working on changing the
timeframe for the legislation search (well, I've done it, but am testing
locally, but it's still quite slow). But while trying to run, I noticed
there's a bug in the event scraper date/time parsing. I'll turn that off in
my pull request to prevent it from crashing every time, but can you look
into that when you get a chance, please?

—
Reply to this email directly or view it on GitHub
#25 (comment)
.

rshorey · 2015-04-27T15:55:51Z

@fgregg I'll try to get you that. In the meantime, do you know if there's an easy way to limit the date? I'm trying to pass a parameter to "ctl00$ContentPlaceHolder1$lstYearsAdvanced", which is the field I found by poking around in their post parameters, but it doesn't seem to be limiting the data in my scrape the way it is in the interface. Was thinking you might have another idea due to your familiarity.

rshorey · 2015-04-27T15:59:46Z

@fgregg I think the event problem is on this page - maybe that greyed out one? Again, I'm not super familiar with this code.

https://chicago.legistar.com/DepartmentDetail.aspx?ID=12357&GUID=4B24D5A9-FED0-4015-9154-6BFFFB2A8CB4

Use CRITICAL level for Sentry logging

rshorey added 2 commits April 3, 2015 13:28

minor chicago bugfixes

ad207a2

fixed some chicago scraper data format bugs

b0687ee

paultag added a commit that referenced this pull request Apr 21, 2015

Merge pull request #25 from rshorey/chicago

01f477d

Chicago: bugfixes

paultag merged commit 01f477d into opencivicdata:master Apr 21, 2015

fgregg mentioned this pull request Apr 21, 2015

Incremental updates opencivicdata/pupa#169

Open

rshorey deleted the chicago branch May 18, 2015 15:47

feydan pushed a commit to feydan/scrapers-us-municipal that referenced this pull request Nov 14, 2019

Merge pull request opencivicdata#25 from datamade/quiet-down

08d18ce

Use CRITICAL level for Sentry logging

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chicago: bugfixes #25

Chicago: bugfixes #25

rshorey commented Apr 7, 2015

fgregg commented Apr 7, 2015

rshorey commented Apr 7, 2015

rshorey commented Apr 7, 2015

fgregg commented Apr 7, 2015

paultag commented Apr 7, 2015

fgregg commented Apr 7, 2015

fgregg commented Apr 7, 2015

rshorey commented Apr 7, 2015

paultag commented Apr 7, 2015

fgregg commented Apr 7, 2015

fgregg commented Apr 7, 2015

rshorey commented Apr 7, 2015

fgregg commented Apr 8, 2015

rshorey commented Apr 8, 2015

fgregg commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

fgregg commented Apr 21, 2015

fgregg commented Apr 23, 2015

paultag commented Apr 23, 2015

fgregg commented Apr 23, 2015

fgregg commented Apr 23, 2015

fgregg commented Apr 25, 2015

rshorey commented Apr 27, 2015

fgregg commented Apr 27, 2015

rshorey commented Apr 27, 2015

rshorey commented Apr 27, 2015

Chicago: bugfixes #25

Chicago: bugfixes #25

Conversation

rshorey commented Apr 7, 2015

fgregg commented Apr 7, 2015

rshorey commented Apr 7, 2015

rshorey commented Apr 7, 2015

fgregg commented Apr 7, 2015

paultag commented Apr 7, 2015

fgregg commented Apr 7, 2015

fgregg commented Apr 7, 2015

rshorey commented Apr 7, 2015

paultag commented Apr 7, 2015

fgregg commented Apr 7, 2015

fgregg commented Apr 7, 2015

rshorey commented Apr 7, 2015

fgregg commented Apr 8, 2015

rshorey commented Apr 8, 2015

fgregg commented Apr 21, 2015

fgregg commented Apr 21, 2015

paultag commented Apr 21, 2015

fgregg commented Apr 21, 2015

fgregg commented Apr 21, 2015

fgregg commented Apr 23, 2015

paultag commented Apr 23, 2015

fgregg commented Apr 23, 2015

fgregg commented Apr 23, 2015

fgregg commented Apr 25, 2015

rshorey commented Apr 27, 2015

fgregg commented Apr 27, 2015

rshorey commented Apr 27, 2015

rshorey commented Apr 27, 2015