-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out how to automatically scrape #1
Comments
@domluna Hello! I think the issue here is that York uses Apple webobjects, a framework that saw it's last release like 6 years ago. x( Upon investigating the URL structure, and reading some ancient documentation, the apparent garbage in the URL seems to contain a wosid (WebObjects session ID) that uniquely identifies each user and their context, i.e: current page. This token seems to be only generated upon requesting the root of the app and gets expired pretty fast. I was able to get a proof of concept of fully automated parsing by improving @mlisbit native parser script. See: #2 Cheers! |
Geez 6 years! That's way before my time. If you found a way to around all of that, well that's just wonderful. So then if there a way uniquely identify a page consistently? So as an example say it encodes anthropology page as 10.2.3 or something like that. Does it always give back 10.2.3? If it has this property, we can make a mapping from the course type to the weird encoding and mine the pages that way. On Thu, Sep 18, 2014 at 3:56 PM, Rajitha Perera [email protected]
|
@domluna Those seem to be consistent, however there is no guarantee that they will stay the same in the future. For example, the endpoint '1.1.10.7' seems to be a method accepting two POST variables sessionPopUp & subjectPopUp that defines the semester and subject category. This information itself should be enough to get a working parser, at least from the way it's structured at the moment. |
The YorkU website is literally a clusterfuck for scraping, but it would be really awesome if we could automatically do it. I'm not even sure if this is completely possible due to the absurd html layout and the fact that the urls don't make any sense.
Accounting - https://w2prod.sis.yorku.ca/Apps/WebObjects/cdm.woa/20/wo/2Ut0tG0DUArPP653ACehWw/1.1.10.7
Biology -https://w2prod.sis.yorku.ca/Apps/WebObjects/cdm.woa/20/wo/2Ut0tG0DUArPP653ACehWw/1.1.10.7
Notice they're the same url! WTF!
Also I think it's putting cookies in the url because these urls will expire after a short while.
Anyway the html soup can be dealt with it's the url structure not making any sense that worries me. The structure we would want would be something like
https://www.yorku.ca/courses/2014-15/{Term}/{Subject}
but I guess that would make too much sense.
The text was updated successfully, but these errors were encountered: