You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.
I am working on my undergrad research thesis and using goose extractor.Goose is really a commendable tool. However, I have a mid term presentation regarding my thesis and I will have to explain the algorithm used by goose.
Can you please tell me the algorithm or how goose extracts information from html pages.
Thanks,
Faisal
The text was updated successfully, but these errors were encountered:
No - I think they use an heuristic approach (http://en.wikipedia.org/wiki/Heuristic). Web pages are messy stuff, and its unlikely that an elegant mathematical algorithm would give good results.
That said, I saw a piece of the code and it was something like: "for each paragraph, compute a score that tells you how likely this paragraph is actual content".
For example, if we were trying to find the content of this Github page, we could suppose that finding the words "Terms, Privacy, Contact" meant that we were looking at the footer, not content. So these words give a negative score.
Hi,
I am working on my undergrad research thesis and using goose extractor.Goose is really a commendable tool. However, I have a mid term presentation regarding my thesis and I will have to explain the algorithm used by goose.
Can you please tell me the algorithm or how goose extracts information from html pages.
Thanks,
Faisal
The text was updated successfully, but these errors were encountered: