Figure out how to automatically scrape #1

domluna · 2014-09-08T23:42:33Z

The YorkU website is literally a clusterfuck for scraping, but it would be really awesome if we could automatically do it. I'm not even sure if this is completely possible due to the absurd html layout and the fact that the urls don't make any sense.

Accounting - https://w2prod.sis.yorku.ca/Apps/WebObjects/cdm.woa/20/wo/2Ut0tG0DUArPP653ACehWw/1.1.10.7
Biology -https://w2prod.sis.yorku.ca/Apps/WebObjects/cdm.woa/20/wo/2Ut0tG0DUArPP653ACehWw/1.1.10.7

Notice they're the same url! WTF!

Also I think it's putting cookies in the url because these urls will expire after a short while.

Anyway the html soup can be dealt with it's the url structure not making any sense that worries me. The structure we would want would be something like

https://www.yorku.ca/courses/2014-15/{Term}/{Subject}

but I guess that would make too much sense.

rajiteh · 2014-09-18T19:56:19Z

@domluna Hello! I think the issue here is that York uses Apple webobjects, a framework that saw it's last release like 6 years ago. x(

Upon investigating the URL structure, and reading some ancient documentation, the apparent garbage in the URL seems to contain a wosid (WebObjects session ID) that uniquely identifies each user and their context, i.e: current page. This token seems to be only generated upon requesting the root of the app and gets expired pretty fast.

I was able to get a proof of concept of fully automated parsing by improving @mlisbit native parser script. See: #2

Cheers!

domluna · 2014-09-18T20:30:31Z

Geez 6 years! That's way before my time. If you found a way to around all of that, well that's just wonderful.

So then if there a way uniquely identify a page consistently?

So as an example say it encodes anthropology page as 10.2.3 or something like that. Does it always give back 10.2.3?

If it has this property, we can make a mapping from the course type to the weird encoding and mine the pages that way.
—
Sent from Mailbox

On Thu, Sep 18, 2014 at 3:56 PM, Rajitha Perera [email protected]
wrote:

@domluna Hello! I think the issue here is that York uses Apple webobjects, a framework that saw it's last release like 6 years ago. x(
Upon investigating the URL structure, and reading some ancient documentation, the apparent garbage in the URL seems to contain a wosid (WebObjects session ID) that uniquely identifies each user and their context, i.e: current page. This token seems to be only generated upon requesting the root of the app and gets expired pretty fast.
I was able to get a proof of concept of fully automated parsing by improving @mlisbit native parser script. See: #2

Cheers!

Reply to this email directly or view it on GitHub:
#1 (comment)

rajiteh · 2014-09-26T02:57:30Z

@domluna Those seem to be consistent, however there is no guarantee that they will stay the same in the future.

For example, the endpoint '1.1.10.7' seems to be a method accepting two POST variables

sessionPopUp & subjectPopUp that defines the semester and subject category. This information itself should be enough to get a working parser, at least from the way it's structured at the moment.

domluna added the enhancement label Sep 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out how to automatically scrape #1

Figure out how to automatically scrape #1

domluna commented Sep 8, 2014

rajiteh commented Sep 18, 2014

domluna commented Sep 18, 2014

Cheers!

rajiteh commented Sep 26, 2014

Figure out how to automatically scrape #1

Figure out how to automatically scrape #1

Comments

domluna commented Sep 8, 2014

rajiteh commented Sep 18, 2014

domluna commented Sep 18, 2014

Cheers!

rajiteh commented Sep 26, 2014