Skip to content

Commit

Permalink
Merge pull request #8 from dohliam/new_languages
Browse files Browse the repository at this point in the history
New languages
  • Loading branch information
6 authored Jun 22, 2016
2 parents a2e47f9 + 06a8127 commit 4a02ece
Show file tree
Hide file tree
Showing 10 changed files with 26 additions and 4 deletions.
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,11 @@ Stopwords for various languages in JSON format. Per [Wikipedia](http://en.wikipe
You can use all stopwords with [stopwords-all.json](stopwords-all.json) (keyed by language ISO 639-1 code), or see the below table for individual language stopword files.

## Languages
There are a total of 43 supported languages:
There are a total of 50 supported languages:

Language | Stopword count | Filename
--- | --- | ---
Afrikaans | 51 | [af.json](dist/af.json)
Arabic | 162 | [ar.json](dist/ar.json)
Armenian | 45 | [hy.json](dist/hy.json)
Basque | 98 | [eu.json](dist/eu.json)
Expand All @@ -31,6 +32,7 @@ French | 606 | [fr.json](dist/fr.json)
Galician | 160 | [gl.json](dist/gl.json)
German | 596 | [de.json](dist/de.json)
Greek | 75 | [el.json](dist/el.json)
Hausa | 39 | [ha.json](dist/ha.json)
Hebrew | 194 | [he.json](dist/he.json)
Hindi | 225 | [hi.json](dist/hi.json)
Hungarian | 781 | [hu.json](dist/hu.json)
Expand All @@ -48,12 +50,17 @@ Polish | 260 | [pl.json](dist/pl.json)
Portuguese | 408 | [pt.json](dist/pt.json)
Romanian | 282 | [ro.json](dist/ro.json)
Russian | 539 | [ru.json](dist/ru.json)
Sesotho | 31 | [st.json](dist/st.json)
Slovak | 110 | [sk.json](dist/sk.json)
Slovenian | 446 | [sl.json](dist/sl.json)
Somali | 30 | [so.json](dist/so.json)
Spanish | 577 | [es.json](dist/es.json)
Swahili | 74 | [sw.json](dist/sw.json)
Swedish | 401 | [sv.json](dist/sv.json)
Thai | 115 | [th.json](dist/th.json)
Turkish | 279 | [tr.json](dist/tr.json)
Yoruba | 60 | [yo.json](dist/yo.json)
isiZulu | 29 | [zu.json](dist/zu.json)


## Sources
Expand All @@ -63,6 +70,7 @@ Turkish | 279 | [tr.json](dist/tr.json)
- [cue.language](https://github.com/vcl/cue.language) - [Apache 2.0 License](https://github.com/vcl/cue.language/blob/master/license.txt)
- [Jacques Savoy](http://members.unine.ch/jacques.savoy/clef/index.html) - BSD License
- SMART Information Retrieval System: ftp://ftp.cs.cornell.edu/pub/smart/
- [ASP Stoplist Project](https://github.com/dohliam/more-stoplists) - CC-BY and Apache 2.0

## License and Copyright
Copyright (c) 2016 Peter Graham, contributors.
Expand Down
1 change: 1 addition & 0 deletions dist/af.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
["'n","aan","af","al","as","baie","by","daar","dag","dat","die","dit","een","ek","en","gaan","gesê","haar","het","hom","hulle","hy","in","is","jou","jy","kan","kom","ma","maar","met","my","na","nie","om","ons","op","saam","sal","se","sien","so","sy","te","toe","uit","van","vir","was","wat","ʼn"]
1 change: 1 addition & 0 deletions dist/ha.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
["a","amma","ba","ban","ce","cikin","da","don","ga","in","ina","ita","ji","ka","ko","kuma","lokacin","ma","mai","na","ne","ni","sai","shi","su","suka","sun","ta","tafi","take","tana","wani","wannan","wata","ya","yake","yana","yi","za"]
1 change: 1 addition & 0 deletions dist/so.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
["aad","albaabkii","atabo","ay","ayaa","ayee","ayuu","dhan","hadana","in","inuu","isku","jiray","jirtay","ka","kale","kasoo","ku","kuu","lakin","markii","oo","si","soo","uga","ugu","uu","waa","waxa","waxuu"]
1 change: 1 addition & 0 deletions dist/st.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
["a","ba","bane","bona","e","ea","eaba","empa","ena","ha","hae","hape","ho","hore","ka","ke","la","le","li","me","mo","moo","ne","o","oa","re","sa","se","tloha","tsa","tse"]
1 change: 1 addition & 0 deletions dist/sw.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
["akasema","alikuwa","alisema","baada","basi","bila","cha","chini","hadi","hapo","hata","hivyo","hiyo","huku","huo","ili","ilikuwa","juu","kama","karibu","katika","kila","kima","kisha","kubwa","kutoka","kuwa","kwa","kwamba","kwenda","kwenye","la","lakini","mara","mdogo","mimi","mkubwa","mmoja","moja","muda","mwenye","na","naye","ndani","ng","ni","nini","nonkungu","pamoja","pia","sana","sasa","sauti","tafadhali","tena","tu","vile","wa","wakati","wake","walikuwa","wao","watu","wengine","wote","ya","yake","yangu","yao","yeye","yule","za","zaidi","zake"]
1 change: 1 addition & 0 deletions dist/yo.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
["a","an","","","bẹ̀rẹ̀","fún","fẹ́","gbogbo","inú","","jẹ","jẹ́","kan","","","","láti","","lọ","mi","mo","máa","mọ̀","ni","náà","","nígbà","nítorí","nǹkan","o","padà","","púpọ̀","pẹ̀lú","rẹ̀","","","sínú","","ti","","","","wọn","wọ́n","yìí","àti","àwọn","é","í","òun","ó","ń","ńlá","ṣe","ṣé","ṣùgbọ́n","ẹmọ́","ọjọ́","ọ̀pọ̀lọpọ̀"]
1 change: 1 addition & 0 deletions dist/zu.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
["futhi","kahle","kakhulu","kanye","khona","kodwa","kungani","kusho","la","lakhe","lapho","mina","ngesikhathi","nje","phansi","phezulu","u","ukuba","ukuthi","ukuze","uma","wahamba","wakhe","wami","wase","wathi","yakhe","zakhe","zonke"]
11 changes: 9 additions & 2 deletions docs/supported-languages.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
There are a total of 43 supported languages:
There are a total of 50 supported languages:

Language | Stopword count | Filename
--- | --- | ---
Afrikaans | 51 | [af.json](dist/af.json)
Arabic | 162 | [ar.json](dist/ar.json)
Armenian | 45 | [hy.json](dist/hy.json)
Basque | 98 | [eu.json](dist/eu.json)
Expand All @@ -22,6 +23,7 @@ French | 606 | [fr.json](dist/fr.json)
Galician | 160 | [gl.json](dist/gl.json)
German | 596 | [de.json](dist/de.json)
Greek | 75 | [el.json](dist/el.json)
Hausa | 39 | [ha.json](dist/ha.json)
Hebrew | 194 | [he.json](dist/he.json)
Hindi | 225 | [hi.json](dist/hi.json)
Hungarian | 781 | [hu.json](dist/hu.json)
Expand All @@ -39,9 +41,14 @@ Polish | 260 | [pl.json](dist/pl.json)
Portuguese | 408 | [pt.json](dist/pt.json)
Romanian | 282 | [ro.json](dist/ro.json)
Russian | 539 | [ru.json](dist/ru.json)
Sesotho | 31 | [st.json](dist/st.json)
Slovak | 110 | [sk.json](dist/sk.json)
Slovenian | 446 | [sl.json](dist/sl.json)
Somali | 30 | [so.json](dist/so.json)
Spanish | 577 | [es.json](dist/es.json)
Swahili | 74 | [sw.json](dist/sw.json)
Swedish | 401 | [sv.json](dist/sv.json)
Thai | 115 | [th.json](dist/th.json)
Turkish | 279 | [tr.json](dist/tr.json)
Turkish | 279 | [tr.json](dist/tr.json)
Yoruba | 60 | [yo.json](dist/yo.json)
isiZulu | 29 | [zu.json](dist/zu.json)
2 changes: 1 addition & 1 deletion stopwords-all.json

Large diffs are not rendered by default.

0 comments on commit 4a02ece

Please sign in to comment.