-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Pakkapon Phongtawee
committed
Mar 27, 2016
1 parent
228ea39
commit 93ca95b
Showing
17 changed files
with
24,496 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,60 @@ | ||
# CutThai | ||
|
||
if you find javascript library for Thai word segmentation in production. I **strongly** recommend [wordcut](https://github.com/veer66/wordcut) This repository is use for describe how Thai word segmentaion work. | ||
if you find javascript library for Thai word segmentation in production. I **strongly** recommend [wordcut](https://github.com/veer66/wordcut) This repository is use for describe how Thai word segmentation work. | ||
|
||
This work is base on document of [wordcut](https://github.com/veer66/wordcut) that you can found on [meduim](https://medium.com/@vsatayamas/wordcut-%E0%B8%A0%E0%B8%B2%E0%B8%84%E0%B8%AD%E0%B8%98%E0%B8%B4%E0%B8%9A%E0%B8%B2%E0%B8%A2-d3b3a617e946#.7sfq26b7t) (Thai language) | ||
This work is base on document of [wordcut](https://github.com/veer66/wordcut) that you can found on [meduim](https://medium.com/@vsatayamas/wordcut-ภาคอธิบาย-d3b3a617e946#.7sfq26b7t) (Thai language) | ||
|
||
## Algorithm | ||
|
||
###1. Find wordlist | ||
|
||
this work is use Dictionary base you must have some Thai wordlist. | ||
you can found some Thai wordlist from | ||
- [LibThai](http://linux.thai.net/projects/libthai) | ||
- [Thai National Corpus](http://www.arts.chula.ac.th/~ling/TNC/category.php?id=58&) | ||
|
||
###2. Build word Trie | ||
convert wordlist from step 1 into trie to increase speed of searching. | ||
read more about trie: [Wikipedia - Trie](https://en.wikipedia.org/wiki/Trie) | ||
Note: This step is difference from [wordcut](https://github.com/veer66/wordcut), it using Binary search | ||
|
||
###3.Create wordgraph | ||
Wordgraph is [graph](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)). use to determine position to word Segmentation where vertex is position to segmentation and Edge is word. create edge by compare input with trie. | ||
|
||
###4.Find shortest path | ||
Find shortest path from start vertex to end vertex by using SPFA | ||
read more about SPFA: [Wikipedia - SPFA](https://en.wikipedia.org/wiki/Shortest_Path_Faster_Algorithm) | ||
|
||
###5.Segmentation sentense to array | ||
use shortest path from step 4 to segmentation sentense and convert to array | ||
|
||
|
||
## Usage | ||
|
||
CutThai isn't recommend to use in production. but you can download lastest release from [Releases](https://github.com/pureexe/cutthai/releases) | ||
|
||
by using Node.js or CommonJS | ||
``` javascript | ||
var Irin = require("cutthai") | ||
``` | ||
|
||
by using normal browser | ||
``` html | ||
<script src="path/to/cutthai.min.js"></script> | ||
``` | ||
|
||
run some segmentation | ||
``` javascript | ||
var bot = new Irin(function(err){ | ||
if(err){ | ||
throw err; | ||
} | ||
console.log(cutthai.cut("ฉันกินข้าว")); | ||
}); | ||
``` | ||
|
||
### Thank | ||
[wordcut](https://github.com/veer66/wordcut) - for Algorithm to Thai word segmentaion | ||
[LibThai](http://linux.thai.net/projects/libthai) - for Thai word dictionary | ||
|
||
**Note:** This document isn't complete yet. need to improve gramma add more picture to describe Algorithm. add more instruction to build. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,228 @@ | ||
### | ||
CutThai is Thai language word segmentation written in coffeescript | ||
more information see https://github.com/pureexe/cutthai | ||
### | ||
|
||
class CutThai | ||
data: | ||
isInit: false | ||
trie: {} | ||
dictList: [ | ||
"data/tdict-acronyms.txt", | ||
"data/tdict-city.txt", | ||
"data/tdict-collection.txt", | ||
"data/tdict-common.txt", | ||
"data/tdict-country.txt", | ||
"data/tdict-district.txt", | ||
"data/tdict-geo.txt", | ||
"data/tdict-history.txt", | ||
"data/tdict-ict.txt", | ||
"data/tdict-lang-ethnic.txt", | ||
"data/tdict-proper.txt", | ||
"data/tdict-science.txt", | ||
"data/tdict-spell.txt", | ||
"data/tdict-std-compound.txt", | ||
"data/tdict-std.txt" | ||
] | ||
|
||
constructor: (dict,callback)-> | ||
self = @ | ||
if typeof dict is "function" | ||
callback = dict | ||
else | ||
if dict instanceof Array | ||
self.dictList = dict | ||
else | ||
self.dictList = [dict] | ||
if !callback | ||
callback = -> | ||
self.loadDict (err,wordlist)-> | ||
if(err) | ||
callback(err) | ||
self.data.trie = self.buildTrie(wordlist) | ||
self.data.isInit = true | ||
callback() | ||
|
||
## | ||
# load dictionary files | ||
loadDict: (callback)-> | ||
cnt = 0 | ||
dictLength = @data.dictList.length | ||
output = [] | ||
if typeof(window) is "undefined" and typeof(module) is "object" | ||
readFile = @readFileNodeJS | ||
else | ||
readFile = @readFileBrowser | ||
for dict in @data.dictList | ||
readFile dict,(err,data)-> | ||
if err | ||
callback(err) | ||
else | ||
data = data.toString().split("\n"); | ||
output = output.concat(data) | ||
cnt++ | ||
if cnt == dictLength | ||
callback(undefined,output) | ||
|
||
## | ||
# Build Trie | ||
# Trie will use object not array because stackoverflow is confirm | ||
# that javascript object is hash (must same speed as array but easy to handle) | ||
# @param {Array} wordlist - array of string from dictionarry | ||
# @return {Object} Trie | ||
# @see https://en.wikipedia.org/wiki/Trie | ||
buildTrie: (wordlist)-> | ||
trie = {} | ||
for word in wordlist | ||
i = 0 | ||
ptr = trie | ||
while i < word.length | ||
if !ptr[word[i]] | ||
ptr[word[i]] = {} | ||
ptr = ptr[word[i]] | ||
i++ | ||
if i == word.length | ||
ptr.word = true | ||
return trie; | ||
|
||
## | ||
# Cut word into out | ||
# @param {string} sentense - to segmentation | ||
# @return {string} sentense with thai segmentation seperate with | | ||
cut: (sentense)-> | ||
return @cutArray(sentense).join("|") | ||
|
||
## | ||
# cut into array | ||
# @param {string} sentense - to segmentation | ||
# @return {array} sentense with thai segmentation to string of array | ||
cutArray: (sentense)-> | ||
if !@data.isInit | ||
throw """Please wait constructor complete before call this method. | ||
it need little time build trie for increase speed of word segmentation""" | ||
else | ||
wordgraph = @createWordGraph(sentense) | ||
path = @findShortestPath(wordgraph) | ||
result = @splitByPath(sentense,path) | ||
return result | ||
|
||
## | ||
# Create word graph from sentense by compare with trie | ||
# @params {string} sentense - input | ||
# @return {array} wordgraph | ||
createWordGraph: (sentense)-> | ||
graph = [] | ||
i = 0 | ||
while i<sentense.length | ||
isEng = /^[A-Za-z][A-Za-z0-9]*$/ | ||
graph[i] = {index:i,next:[]} | ||
trie = @data.trie | ||
j = 0 | ||
if isEng.test(sentense[i]) ## english segment by using space | ||
while isEng.test(sentense[i+j]) and i+j<sentense.length | ||
j++ | ||
graph[i].next.push(i+j) | ||
else ## thai sengment by using trie | ||
while trie | ||
if(trie.word) | ||
graph[i].next.push(i+j) | ||
else if(i+j+1==sentense.length) | ||
graph[i].next.push(i+j+1) | ||
trie = trie[sentense[i+j]] | ||
j++ | ||
if graph[i].next.length == 0 | ||
graph[i].next.push(i+1) | ||
i++ | ||
graph[i] = {index:i,next:[],finish:true} | ||
return graph | ||
|
||
## | ||
# find shortest word graph using SPFA | ||
# @params {array} graph - wordgraph from createWordGraph | ||
# @return {array} shortest path of wordgraph | ||
# @see https://en.wikipedia.org/wiki/Shortest_Path_Faster_Algorithm | ||
findShortestPath: (graph)-> | ||
out = [] | ||
queue = [] | ||
visited = {} | ||
graph[0].dist = 0 | ||
queue.push(0) | ||
while queue.length > 0 | ||
u = queue.shift() | ||
for v in graph[u].next | ||
if !graph[v].dist or graph[u].dist + 1 < graph[v].dist | ||
graph[v].prev = graph[u] | ||
graph[v].dist = graph[u].dist+1 | ||
if !visited[v] | ||
if queue.lenght > 0 and graph[queue[0]].dist > graph[v].dist | ||
queue.unshift(v) | ||
else | ||
queue.push(v) | ||
visited[v] = true | ||
index = graph.length-1 | ||
while index !=0 | ||
out.unshift(index) | ||
index = graph[index].prev.index | ||
return out | ||
|
||
## | ||
# Split sentense to text from path | ||
# @params {string} sentense - input | ||
# @params {array} path - shortest path from findShortestPath | ||
# @return word segmentation array | ||
splitByPath: (sentense,path)-> | ||
console.log(path) | ||
path.unshift(0) | ||
out = [] | ||
i=1 | ||
prebuilt = "" | ||
while i<path.length | ||
cWord=sentense.substring(path[i],path[i-1]) | ||
if path[i]-path[i-1] == 1 | ||
if cWord !=" " | ||
prebuilt+=cWord | ||
else if prebuilt != "" | ||
out.push(prebuilt) | ||
prebuilt ="" | ||
else | ||
if prebuilt!="" | ||
if out.length == 0 | ||
out.push(prebuilt) | ||
else | ||
out[out.length-1]+=prebuilt | ||
prebuilt="" | ||
out.push(cWord) | ||
i++ | ||
if prebuilt != "" | ||
out.push prebuilt | ||
return out | ||
|
||
## | ||
# support readFile from nodeJs | ||
# | ||
readFileNodeJS: (file,callback)-> | ||
fs = require "fs" | ||
fs.readFile(file,callback) | ||
|
||
## | ||
# support readFile from browser | ||
# | ||
readFileBrowser: (file,callback)-> | ||
fileRequest = new XMLHttpRequest() | ||
fileRequest.onreadystatechange = ()-> | ||
if fileRequest.readyState == 4 and fileRequest.status == 200 | ||
callback(undefined,fileRequest.responseText) | ||
else if fileRequest.readyState == 4 | ||
callback({message:"ENOENT: no such file or directory, open '#{file}' "}) | ||
fileRequest.open("GET", file, true) | ||
fileRequest.send() | ||
|
||
## | ||
# Export module to support most platform of javascript | ||
if typeof module == "object" and module and typeof module.exports == "object" | ||
module.exports = CutThai #support Node.js / IO.js / CommonJS | ||
else | ||
window.CutThai = CutThai #support browser | ||
if typeof define == "function" && define.amd | ||
define 'CutThai', [], -> #support AMDjs | ||
CutThai |
Oops, something went wrong.