Skip to content
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.

No support for articles in Chinese? #26

Open
jr314159 opened this issue Oct 3, 2011 · 2 comments
Open

No support for articles in Chinese? #26

jr314159 opened this issue Oct 3, 2011 · 2 comments

Comments

@jr314159
Copy link

jr314159 commented Oct 3, 2011

Hi, I was just trying goose out on some Chinese language news sites, and it doesn't appear to be able to pull any article text. Examples:

http://news.xhby.net/system/2011/10/03/011788372.shtml
http://news.iqilu.com/shehui/huahuashijie/20111003/565892.html

Will your algorithm work on Chinese with a minor fix or does it need to be a latin language?

Thanks,
Joel

@danielspicar
Copy link

Hi,

Basically Goose depends a lot on the english stop-word list to extract "relevant" text. Goose currently does not support other languages than English. I made changes so it can support other languages because I needed support for German. To support German the changes needed to be done are to detect the charset (UTF-8, ISO-xxxxx, etc.) the web page is encoded in (to support special characters like umlauts), content language detection (find out whether the web page is in English, German, or something else) and to provide a German stop word list if the article is in German.

I am unfortunately not (yet) allowed to provide my changes due to legal reasons.

In your case you would need to do something similar but I don't really know much about the structure of the Chinese language and if something like stop-words exists in Chinese.

Regards,
Daniel

@karussell
Copy link

Checkout https://github.com/karussell/snacktory which should work for chinese text too

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants