文本文件分解式词性标注失败 #65

Hz-EMW · 2018-09-15T17:14:35Z

尊敬的覃博士，您好。我在词性标记过程中遇到了麻烦，请求您的帮助。具体情况如下：
第一，环境信息
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936
[2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] jiebaR_0.9.99 jiebaRD_0.1 chinese.misc_0.1.9

loaded via a namespace (and not attached):
[1] compiler_3.5.1 magrittr_1.5 parallel_3.5.1 tools_3.5.1 NLP_0.1-11
[6] yaml_2.2.0 Rcpp_0.12.18 slam_0.1-43 xml2_1.2.0 stringi_1.1.7
[11] tm_0.7-5 Ruchardet_0.0-3 rlang_0.2.2 purrr_0.2.5
第二，全部错误信息
Warning messages:
1: In segment(itext, analyzer, mod = "mix") :
In file mode, only the first element will be processed.
2: In readLines(input.r, n = lines, encoding = encoding) : incomplete final line found on 'E:/201803D/0910ontosim/texttest/鍩轰簬鏂囩尞璁￠噺瀛︾殑鍥介檯鐏北鐢熸€佸鐮旂┒鎬佸娍鍒嗘瀽_榄忔檽闆?segment.2018-09-15_23_44_30.txt'

Error in file_coding(code[1]) : Cannot open file
第三，最小可重复代码和数据源文件，哪一步的代码出现错误

Text Processing and Analysis

ifolder<-"E:/201803D/0910ontosim/texttest"
itext<-list.files(ifolder, pattern = ".txt", all.files = FALSE,
recursive = TRUE, include.dirs = FALSE, full.names=TRUE)

tagging

library(jiebaR)
analyzer <- worker(type = "mix", dict = DICTPATH, hmm = HMMPATH, user = "E:/2017DN/data/custom.dict", stop_word ="E:/2017DN/data/stopwords.txt",
write = TRUE, qmax = 20, topn = 5, encoding = "UTF-8", detect = TRUE, symbol = FALSE, lines = 1e+05,
output = NULL, bylines = TRUE, user_weight = "max")
textseg <- segment(itext, analyzer, mod = "mix")
tokenizer <- worker("tag")
pos_tag<-tagging(textseg, tokenizer)

第四，尝试过用什么方式来解决，可能的问题根源
测试过字符串格式输入分词标记对象，执行无误。
测试过一步到位的词性标记，无误。（但是不知可否使用第三方词典，专业文档标注十分仰赖专业词汇。）
换回文本文件会在分词后的标记步骤报错，仍然声称无法读取文档，不生成第二个标记的分词文档。

BruceZhaoR · 2018-09-22T06:36:17Z

@Hz-EMW

1、确保文件路径不包含中文（还可以用 normalizePath(fs::dir_ls("E:/201803D/0910ontosim/texttest",glob = "*.txt"))）
2、确保文件编码为UTF-8/或者在读取文件的时候指定编码
3、用户自定义词典里面可以添加专业词汇

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

文本文件分解式词性标注失败 #65

文本文件分解式词性标注失败 #65

Hz-EMW commented Sep 15, 2018 •

edited

Loading

BruceZhaoR commented Sep 22, 2018

文本文件分解式词性标注失败 #65

文本文件分解式词性标注失败 #65

Comments

Hz-EMW commented Sep 15, 2018 • edited Loading

Text Processing and Analysis

tagging

BruceZhaoR commented Sep 22, 2018

Hz-EMW commented Sep 15, 2018 •

edited

Loading