Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out how to automatically scrape #1

Open
domluna opened this issue Sep 8, 2014 · 3 comments
Open

Figure out how to automatically scrape #1

domluna opened this issue Sep 8, 2014 · 3 comments

Comments

@domluna
Copy link
Collaborator

domluna commented Sep 8, 2014

The YorkU website is literally a clusterfuck for scraping, but it would be really awesome if we could automatically do it. I'm not even sure if this is completely possible due to the absurd html layout and the fact that the urls don't make any sense.

Accounting - https://w2prod.sis.yorku.ca/Apps/WebObjects/cdm.woa/20/wo/2Ut0tG0DUArPP653ACehWw/1.1.10.7
Biology -https://w2prod.sis.yorku.ca/Apps/WebObjects/cdm.woa/20/wo/2Ut0tG0DUArPP653ACehWw/1.1.10.7

Notice they're the same url! WTF!

Also I think it's putting cookies in the url because these urls will expire after a short while.

Anyway the html soup can be dealt with it's the url structure not making any sense that worries me. The structure we would want would be something like

https://www.yorku.ca/courses/2014-15/{Term}/{Subject}

but I guess that would make too much sense.

@rajiteh
Copy link
Contributor

rajiteh commented Sep 18, 2014

@domluna Hello! I think the issue here is that York uses Apple webobjects, a framework that saw it's last release like 6 years ago. x(

Upon investigating the URL structure, and reading some ancient documentation, the apparent garbage in the URL seems to contain a wosid (WebObjects session ID) that uniquely identifies each user and their context, i.e: current page. This token seems to be only generated upon requesting the root of the app and gets expired pretty fast.

I was able to get a proof of concept of fully automated parsing by improving @mlisbit native parser script. See: #2

Cheers!

@domluna
Copy link
Collaborator Author

domluna commented Sep 18, 2014

Geez 6 years! That's way before my time. If you found a way to around all of that, well that's just wonderful.

So then if there a way uniquely identify a page consistently?

So as an example say it encodes anthropology page as 10.2.3 or something like that. Does it always give back 10.2.3?

If it has this property, we can make a mapping from the course type to the weird encoding and mine the pages that way.

Sent from Mailbox

On Thu, Sep 18, 2014 at 3:56 PM, Rajitha Perera [email protected]
wrote:

@domluna Hello! I think the issue here is that York uses Apple webobjects, a framework that saw it's last release like 6 years ago. x(
Upon investigating the URL structure, and reading some ancient documentation, the apparent garbage in the URL seems to contain a wosid (WebObjects session ID) that uniquely identifies each user and their context, i.e: current page. This token seems to be only generated upon requesting the root of the app and gets expired pretty fast.
I was able to get a proof of concept of fully automated parsing by improving @mlisbit native parser script. See: #2

Cheers!

Reply to this email directly or view it on GitHub:
#1 (comment)

@rajiteh
Copy link
Contributor

rajiteh commented Sep 26, 2014

@domluna Those seem to be consistent, however there is no guarantee that they will stay the same in the future.

For example, the endpoint '1.1.10.7' seems to be a method accepting two POST variables

sessionPopUp & subjectPopUp that defines the semester and subject category. This information itself should be enough to get a working parser, at least from the way it's structured at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants