Algorithm used in goose ? #95

IndianShifu · 2015-03-09T19:57:51Z

Hi,

I am working on my undergrad research thesis and using goose extractor.Goose is really a commendable tool. However, I have a mid term presentation regarding my thesis and I will have to explain the algorithm used by goose.

Can you please tell me the algorithm or how goose extracts information from html pages.

Thanks,
Faisal

hugows · 2015-03-09T20:02:49Z

Since they report that the code was initially based on readability, maybe this is a good place to start:

https://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js#711

IndianShifu · 2015-03-10T12:36:22Z

ok thanks will look into it.
but do you know about any research paper or algorithm used by them

hugows · 2015-03-10T12:57:00Z

No - I think they use an heuristic approach (http://en.wikipedia.org/wiki/Heuristic). Web pages are messy stuff, and its unlikely that an elegant mathematical algorithm would give good results.

That said, I saw a piece of the code and it was something like: "for each paragraph, compute a score that tells you how likely this paragraph is actual content".

For example, if we were trying to find the content of this Github page, we could suppose that finding the words "Terms, Privacy, Contact" meant that we were looking at the footer, not content. So these words give a negative score.

HTH

IndianShifu · 2015-03-10T14:03:34Z

ok thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algorithm used in goose ? #95

Algorithm used in goose ? #95

IndianShifu commented Mar 9, 2015

hugows commented Mar 9, 2015

IndianShifu commented Mar 10, 2015

hugows commented Mar 10, 2015

IndianShifu commented Mar 10, 2015

Algorithm used in goose ? #95

Algorithm used in goose ? #95

Comments

IndianShifu commented Mar 9, 2015

hugows commented Mar 9, 2015

IndianShifu commented Mar 10, 2015

hugows commented Mar 10, 2015

IndianShifu commented Mar 10, 2015