Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chicago: bugfixes #25

Merged
merged 2 commits into from
Apr 21, 2015
Merged

Chicago: bugfixes #25

merged 2 commits into from
Apr 21, 2015

Conversation

rshorey
Copy link
Contributor

@rshorey rshorey commented Apr 7, 2015

The Chicago scraper failed locally for me (and is still running for 12+ hours, so I haven't seen it finish). @fgregg, feel free to ditch this PR if this is overlap with work you've done.

@fgregg
Copy link
Contributor

fgregg commented Apr 7, 2015

Looks good to me! What's your current thinking on the omnibus relation?

@rshorey
Copy link
Contributor Author

rshorey commented Apr 7, 2015

We need to specify a relationship between bills, so I'm just having it look for language that makes it pretty clear that we're talking about an omnibus relationship (the two I saw were "sundry" and "miscellaneous"). If it finds that, the omnibus version "replaces" the old version. I'm having it ignore related-bills when this relationship isn't obvious, but I don't think I saw any examples. I'm far from a municipal government expert, though, so go ahead and make changes if you think they're needed.

@rshorey
Copy link
Contributor Author

rshorey commented Apr 7, 2015

Oh also since it ran for many hours, I think it's looking back over all of history. It might be worth writing something that makes it not do that, because it's not going to import anything into OCD if it never finishes.

@fgregg
Copy link
Contributor

fgregg commented Apr 7, 2015

👍 on needing a way of not scraping the whole history everytime. @paultag and I have talked a little bit about that. Right now, AFAIK, pupa does not know about what already exists in the DB, which would seem to be necessary in order to have an update only mode.

@paultag
Copy link
Contributor

paultag commented Apr 7, 2015

Pupa knows about sessions; is there a reason we're not using them?

@fgregg
Copy link
Contributor

fgregg commented Apr 7, 2015

We could do sessions. That would help a little bit going forward. However, everything that is in current legistar is part of the same session.

@fgregg
Copy link
Contributor

fgregg commented Apr 7, 2015

@rshorey, @paultag and I have been talking about dealing with omnibus bills as a post scrape step. What are your thoughts about that.

@rshorey
Copy link
Contributor Author

rshorey commented Apr 7, 2015

That could definitely be fine. The only reason I did anything about it is that the scraper was failing locally for me because related_bills requires a relation type, and the relation type of "pending" that was in there before was not an acceptable type, so you'll just have to be sure that you can store the information needed to do the post-scrape reconciliation with an appropriate relation type.

@paultag
Copy link
Contributor

paultag commented Apr 7, 2015

@fgregg can we bucket by year?

@fgregg
Copy link
Contributor

fgregg commented Apr 7, 2015

@rshorey yah, I have a PR for the pending type (pending may not be the best name, the idea here is that we can't figure out what the relation type is until after we finish scraping) opencivicdata/python-opencivicdata#23

@paultag we can't really bucket by year. There is not end of year expiration.

Practically speaking if nothing has happened to a bill in over a year it's unlikely that anything will happen to it, but there are exceptions. Things to get shaken out of committee sometimes.

@fgregg
Copy link
Contributor

fgregg commented Apr 7, 2015

Two ideas:
1.
We could leverage the legistar API, which has as last updated search parameter. But we would still need to know when the last time we scraped was. (which pupa does not seem to currently have a way to know).
2. On the website the search results seem to be returned in last updated order. We could stop scraping once the order of of a window of 100 bills on the website matches a windows of 100 bills in the DB. This seems like the most generic solution, since other cities do not have a legistar API.

@rshorey
Copy link
Contributor Author

rshorey commented Apr 7, 2015

@fgregg, sounds like you've given it much more thought than me, so I'd say you should probably ditch my PR and go with yours. Although if you haven't found it yet, you might want to pull out my fix for the onclick problem in the legistar scraper - I won't be offended if you just dump that into your branch.

@fgregg
Copy link
Contributor

fgregg commented Apr 8, 2015

I have no pr. I think this is good to merge.

@rshorey
Copy link
Contributor Author

rshorey commented Apr 8, 2015

OK, @paultag I think you're the one with merge power here.

@fgregg you'll maybe want to revert back to the "pending" thing if/when we decide to go that direction on ocd.

@fgregg
Copy link
Contributor

fgregg commented Apr 21, 2015

This all works for me. 👍

@paultag, there are couple of deeper issues that we've talked about in this PR. They should not block, but should be addressed eventually. Some of these already have issues.

Until we figure out smart updates, which will be a while honestly. I would strongly urge you to schedule this as a weekly scrape not a nightly scrape.

@fgregg
Copy link
Contributor

fgregg commented Apr 21, 2015

A little more context.

It takes about 48 hours for us to scrape the current site. Actually, it should get a lot better real soon, because we are about to have a new legislative session in mid May. So that's will give us some time to figure out the smart scraping.

Also, @datamade will definitely help make the needed changes to pupa to support omnibus bills and smart updates. LMK know what the best way to coordinate that is.

paultag added a commit that referenced this pull request Apr 21, 2015
@paultag paultag merged commit 01f477d into opencivicdata:master Apr 21, 2015
@paultag
Copy link
Contributor

paultag commented Apr 21, 2015

I'll kick off a run today!

@fgregg
Copy link
Contributor

fgregg commented Apr 21, 2015

💯

@fgregg
Copy link
Contributor

fgregg commented Apr 21, 2015

Talking about smart updates here: opencivicdata/pupa#169

@fgregg
Copy link
Contributor

fgregg commented Apr 23, 2015

This still running?

@paultag
Copy link
Contributor

paultag commented Apr 23, 2015

@fgregg It was, it died with a traceback

�[37m04:36:57 INFO pupa: save bill O2014-4150 in 2011 as bill_473248ae-e99c-11e4-896f-0242ac11008e.json�[0m
Traceback (most recent call last):
  File "/usr/local/bin/pupa", line 9, in 
    load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
  File "/opt/sunlightfoundation.com/pupa/pupa/cli/__main__.py", line 71, in main
    subcommands[args.subcommand].handle(args, other)
  File "/opt/sunlightfoundation.com/pupa/pupa/cli/commands/update.py", line 224, in handle
    report['scrape'] = self.do_scrape(juris, args, scrapers)
  File "/opt/sunlightfoundation.com/pupa/pupa/cli/commands/update.py", line 123, in do_scrape
    report[scraper_name] = scraper.do_scrape(**scrape_args)
  File "/opt/sunlightfoundation.com/pupa/pupa/scrape/base.py", line 102, in do_scrape
    for obj in self.scrape(**kwargs) or []:
  File "/opt/sunlightfoundation.com/scrapers-us-municipal/chicago/bills.py", line 104, in scrape
    bill_session = self.session(legislation_summary['Intro\xa0Date'])
  File "/opt/sunlightfoundation.com/scrapers-us-municipal/chicago/bills.py", line 17, in session
    tzinfo=pytz.timezone(self.timezone)) :
TypeError: unorderable types: str() < datetime.datetime()

PRs welcome :)

@fgregg
Copy link
Contributor

fgregg commented Apr 23, 2015

K. working on this now pretty weird.

On Thu, Apr 23, 2015 at 2:14 PM [email protected] wrote:

@fgregg https://github.com/fgregg It was, it died with a traceback

�[37m04:36:57 INFO pupa: save bill O2014-4150 in 2011 as bill_473248ae-e99c-11e4-896f-0242ac11008e.json�[0m
Traceback (most recent call last):
File "/usr/local/bin/pupa", line 9, in
load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
File "/opt/sunlightfoundation.com/pupa/pupa/cli/main.py", line 71, in main
subcommands[args.subcommand].handle(args, other)
File "/opt/sunlightfoundation.com/pupa/pupa/cli/commands/update.py", line 224, in handle
report['scrape'] = self.do_scrape(juris, args, scrapers)
File "/opt/sunlightfoundation.com/pupa/pupa/cli/commands/update.py", line 123, in do_scrape
report[scraper_name] = scraper.do_scrape(**scrape_args)
File "/opt/sunlightfoundation.com/pupa/pupa/scrape/base.py", line 102, in do_scrape
for obj in self.scrape(**kwargs) or []:
File "/opt/sunlightfoundation.com/scrapers-us-municipal/chicago/bills.py", line 104, in scrape
bill_session = self.session(legislation_summary['Intro\xa0Date'])
File "/opt/sunlightfoundation.com/scrapers-us-municipal/chicago/bills.py", line 17, in session
tzinfo=pytz.timezone(self.timezone)) :TypeError: unorderable types: str() < datetime.datetime()

PRs welcome :)


Reply to this email directly or view it on GitHub
#25 (comment)
.

@fgregg
Copy link
Contributor

fgregg commented Apr 23, 2015

Would you be open to just scraping the past six months or so until we get the incremental stuff figured out.?

@fgregg
Copy link
Contributor

fgregg commented Apr 25, 2015

Still haven't been able to get back that far as I'm getting timeout errors. @rshorey @paultag how do you suggest I handle those.

Traceback (most recent call last):
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 374, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.4/http/client.py", line 1147, in getresponse
    response.begin()
  File "/usr/lib/python3.4/http/client.py", line 351, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.4/http/client.py", line 313, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.4/socket.py", line 371, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.4/ssl.py", line 746, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.4/ssl.py", line 618, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/adapters.py", line 370, in send
    timeout=timeout
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 597, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 245, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/packages/six.py", line 310, in reraise
    raise value
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 544, in urlopen
    body=body, headers=headers)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 376, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 304, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
requests.packages.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='chicago.legistar.com', port=443): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fgregg/public/municipal-scrapers-us/.env/bin/pupa", line 9, in <module>
    load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/__main__.py", line 71, in main
    subcommands[args.subcommand].handle(args, other)
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/commands/update.py", line 224, in handle
    report['scrape'] = self.do_scrape(juris, args, scrapers)
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/commands/update.py", line 123, in do_scrape
    report[scraper_name] = scraper.do_scrape(**scrape_args)
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/scrape/base.py", line 102, in do_scrape
    for obj in self.scrape(**kwargs) or []:
  File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 118, in scrape
    bill, votes = self.addDetails(bill, legislation_summary['url'])
  File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 227, in addDetails
    votes = self.addBillHistory(bill, history_table)
  File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 167, in addBillHistory
    result, votes = self.extractVotes(action_detail_url)
  File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 127, in extractVotes
    action_detail_page = self.lxmlize(action_detail_url)
  File "/home/fgregg/public/municipal-scrapers-us/chicago/legistar.py", line 18, in lxmlize
    entry = self.get(url).text
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 477, in get
    return self.request('GET', url, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 270, in request
    **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/cache.py", line 66, in request
    resp = super(CachingSession, self).request(method, url, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 92, in request
    return super(ThrottledSession, self).request(method, url, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 177, in request
    raise exception_raised
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 157, in request
    resp = super(RetrySession, self).request(method, url, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/adapters.py", line 433, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='chicago.legistar.com', port=443): Read timed out. (read timeout=60)

@rshorey
Copy link
Contributor Author

rshorey commented Apr 27, 2015

@fgregg, I'm working on changing the timeframe for the legislation search (well, I've done it, but am testing locally, but it's still quite slow). But while trying to run, I noticed there's a bug in the event scraper date/time parsing. I'll turn that off in my pull request to prevent it from crashing every time, but can you look into that when you get a chance, please?

@fgregg
Copy link
Contributor

fgregg commented Apr 27, 2015

I'm trying to track down that issue but can't reproduce. Can you send me
page that's causing the error.

On Mon, Apr 27, 2015 at 10:18 AM Rachel [email protected] wrote:

@fgregg https://github.com/fgregg, I'm working on changing the
timeframe for the legislation search (well, I've done it, but am testing
locally, but it's still quite slow). But while trying to run, I noticed
there's a bug in the event scraper date/time parsing. I'll turn that off in
my pull request to prevent it from crashing every time, but can you look
into that when you get a chance, please?


Reply to this email directly or view it on GitHub
#25 (comment)
.

@rshorey
Copy link
Contributor Author

rshorey commented Apr 27, 2015

@fgregg I'll try to get you that. In the meantime, do you know if there's an easy way to limit the date? I'm trying to pass a parameter to "ctl00$ContentPlaceHolder1$lstYearsAdvanced", which is the field I found by poking around in their post parameters, but it doesn't seem to be limiting the data in my scrape the way it is in the interface. Was thinking you might have another idea due to your familiarity.

@rshorey
Copy link
Contributor Author

rshorey commented Apr 27, 2015

@fgregg I think the event problem is on this page - maybe that greyed out one? Again, I'm not super familiar with this code.

https://chicago.legistar.com/DepartmentDetail.aspx?ID=12357&GUID=4B24D5A9-FED0-4015-9154-6BFFFB2A8CB4

@rshorey rshorey deleted the chicago branch May 18, 2015 15:47
feydan pushed a commit to feydan/scrapers-us-municipal that referenced this pull request Nov 14, 2019
Use CRITICAL level for Sentry logging
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants