Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

文本文件分解式词性标注失败 #65

Open
Hz-EMW opened this issue Sep 15, 2018 · 1 comment
Open

文本文件分解式词性标注失败 #65

Hz-EMW opened this issue Sep 15, 2018 · 1 comment

Comments

@Hz-EMW
Copy link

Hz-EMW commented Sep 15, 2018

尊敬的覃博士,您好。我在词性标记过程中遇到了麻烦,请求您的帮助。具体情况如下:
第一,环境信息
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936
[2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] jiebaR_0.9.99 jiebaRD_0.1 chinese.misc_0.1.9

loaded via a namespace (and not attached):
[1] compiler_3.5.1 magrittr_1.5 parallel_3.5.1 tools_3.5.1 NLP_0.1-11
[6] yaml_2.2.0 Rcpp_0.12.18 slam_0.1-43 xml2_1.2.0 stringi_1.1.7
[11] tm_0.7-5 Ruchardet_0.0-3 rlang_0.2.2 purrr_0.2.5
第二,全部错误信息
Warning messages:
1: In segment(itext, analyzer, mod = "mix") :
In file mode, only the first element will be processed.
2: In readLines(input.r, n = lines, encoding = encoding) : incomplete final line found on 'E:/201803D/0910ontosim/texttest/鍩轰簬鏂囩尞璁¢噺瀛︾殑鍥介檯鐏北鐢熸€佸鐮旂┒鎬佸娍鍒嗘瀽_榄忔檽闆?segment.2018-09-15_23_44_30.txt'

Error in file_coding(code[1]) : Cannot open file
第三,最小可重复代码和数据文件,哪一步的代码出现错误

Text Processing and Analysis

ifolder<-"E:/201803D/0910ontosim/texttest"
itext<-list.files(ifolder, pattern = ".txt", all.files = FALSE,
recursive = TRUE, include.dirs = FALSE, full.names=TRUE)

tagging

library(jiebaR)
analyzer <- worker(type = "mix", dict = DICTPATH, hmm = HMMPATH, user = "E:/2017DN/data/custom.dict", stop_word ="E:/2017DN/data/stopwords.txt",
write = TRUE, qmax = 20, topn = 5, encoding = "UTF-8", detect = TRUE, symbol = FALSE, lines = 1e+05,
output = NULL, bylines = TRUE, user_weight = "max")
textseg <- segment(itext, analyzer, mod = "mix")
tokenizer <- worker("tag")
pos_tag<-tagging(textseg, tokenizer)

第四,尝试过用什么方式来解决,可能的问题根源
测试过字符串格式输入分词标记对象,执行无误。
测试过一步到位的词性标记,无误。(但是不知可否使用第三方词典,专业文档标注十分仰赖专业词汇。)
换回文本文件会在分词后的标记步骤报错,仍然声称无法读取文档,不生成第二个标记的分词文档。

@BruceZhaoR
Copy link

@Hz-EMW

1、确保文件路径不包含中文(还可以用 normalizePath(fs::dir_ls("E:/201803D/0910ontosim/texttest",glob = "*.txt"))
2、确保文件编码为UTF-8/或者在读取文件的时候指定编码
3、用户自定义词典里面可以添加专业词汇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants