- /data(ignore) --- Place for training data, test data and augmented data. (CSV format)
- /result(ignore) --- Place for classified reports.(XLSX format)
- /models(ignore) --- Place for generated models.
- config.json(ignore) --- configuration file
- config.template.json --- configuration file template
- requirement.txt --- Environment requirement. (strictly required with newest Anaconda)
- .gitignore --- gitignore
- Config.py --- Read configuration file.
- ProcessData.py --- Read CSV file.
- ReprocessData.py --- Read augmented CSV file and re-process new augmented data.
- MorphologicalAnalysis.py --- Morphological analyze with MeCab
- GenModel.py --- Generate word2vec model. (Maybe used to generate other models)
- main.py --- Run program by
python main.py
. - README.md --- You are reading this.
CSV file requirement
"Article1", "Category1"
"Article2", "Category1"
"Article3", "Category2"
...
CSV Example
"独女通信", dokujo-tsushin
"ITライフハック", it-life-hack
"家電チャンネル", kaden-channel
"livedoor HOMME", livedoor-homme
"MOVIE ENTER", movie-enter
"Peachy", peachy
"エスマックス", smax
"Sports Watch", sports-watch
"トピックニュース", topic-news
Article | Category |
---|---|
独女通信 | dokujo-tsushin |
ITライフハック | it-life-hack |
家電チャンネル | kaden-channel |
livedoor HOMME | livedoor-homme |
MOVIE ENTER | movie-enter |
Peachy | peachy |
エスマックス | smax |
Sports Watch | sports-watch |
トピックニュース | topic-news |
- result/result_report.xlsx
Report including Confusion Matrix, and Accuray, Precision, Recall, K-measure by sklean.metrics
All you need to customize is the config.json
, and config.template.json
is the template.
You can run the program to generate the default structure.
train_path
,test_path
,model_path
,augment_path
,label
,result_path
corpus_*
for generate models.method
for change method between tfidf, Countvector and tfidfvector.dataaugment
to switch the word augmenting with model option.MeCab.Tagger("mecabrc")
in MorphologicalAnalysis.py if you want to use other dictionary.
chiVe (Apache License 2.0)