Skip to content
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.

Algorithm used in goose ? #95

Open
IndianShifu opened this issue Mar 9, 2015 · 4 comments
Open

Algorithm used in goose ? #95

IndianShifu opened this issue Mar 9, 2015 · 4 comments

Comments

@IndianShifu
Copy link

Hi,

I am working on my undergrad research thesis and using goose extractor.Goose is really a commendable tool. However, I have a mid term presentation regarding my thesis and I will have to explain the algorithm used by goose.

Can you please tell me the algorithm or how goose extracts information from html pages.

Thanks,
Faisal

@hugows
Copy link

hugows commented Mar 9, 2015

Since they report that the code was initially based on readability, maybe this is a good place to start:

https://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js#711

@IndianShifu
Copy link
Author

ok thanks will look into it.
but do you know about any research paper or algorithm used by them

@hugows
Copy link

hugows commented Mar 10, 2015

No - I think they use an heuristic approach (http://en.wikipedia.org/wiki/Heuristic). Web pages are messy stuff, and its unlikely that an elegant mathematical algorithm would give good results.

That said, I saw a piece of the code and it was something like: "for each paragraph, compute a score that tells you how likely this paragraph is actual content".

For example, if we were trying to find the content of this Github page, we could suppose that finding the words "Terms, Privacy, Contact" meant that we were looking at the footer, not content. So these words give a negative score.

HTH

@IndianShifu
Copy link
Author

ok thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants