No support for articles in Chinese? #26

jr314159 · 2011-10-03T18:09:56Z

Hi, I was just trying goose out on some Chinese language news sites, and it doesn't appear to be able to pull any article text. Examples:

http://news.xhby.net/system/2011/10/03/011788372.shtml
http://news.iqilu.com/shehui/huahuashijie/20111003/565892.html

Will your algorithm work on Chinese with a minor fix or does it need to be a latin language?

Thanks,
Joel

danielspicar · 2011-10-04T10:42:57Z

Hi,

Basically Goose depends a lot on the english stop-word list to extract "relevant" text. Goose currently does not support other languages than English. I made changes so it can support other languages because I needed support for German. To support German the changes needed to be done are to detect the charset (UTF-8, ISO-xxxxx, etc.) the web page is encoded in (to support special characters like umlauts), content language detection (find out whether the web page is in English, German, or something else) and to provide a German stop word list if the article is in German.

I am unfortunately not (yet) allowed to provide my changes due to legal reasons.

In your case you would need to do something similar but I don't really know much about the structure of the Chinese language and if something like stop-words exists in Chinese.

Regards,
Daniel

karussell · 2011-10-11T21:50:39Z

Checkout https://github.com/karussell/snacktory which should work for chinese text too

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No support for articles in Chinese? #26

No support for articles in Chinese? #26

jr314159 commented Oct 3, 2011

danielspicar commented Oct 4, 2011

karussell commented Oct 11, 2011

No support for articles in Chinese? #26

No support for articles in Chinese? #26

Comments

jr314159 commented Oct 3, 2011

danielspicar commented Oct 4, 2011

karussell commented Oct 11, 2011