Merge remote-tracking branch 'origin/develop'

mozillazg · Jul 5, 2020 · 2496ed0 · 2496ed0
2 parents 09d12d1 + 6ffa37a
commit 2496ed0
Show file tree

Hide file tree

Showing 12 changed files with 764 additions and 84 deletions.
diff --git a/.appveyor.yml b/.appveyor.yml
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -18,14 +18,17 @@ jobs:
       - checkout
 
       # Download and cache dependencies
-      - restore_cache:
-          keys:
-          - v1-dependencies-{{ .Environment.TOX_ENV }}-{{ checksum "requirements_dev.txt" }}
+#      - restore_cache:
+#          keys:
+#          - v1-dependencies-{{ .Environment.TOX_ENV }}-{{ checksum "requirements_dev.txt" }}
 
       - run:
           name: install dependencies
           command: |
-            pip install -U pip virtualenv --user
+            # pip install -U pip virtualenv --user
+            if ! which virtualenv; then
+              pip install 'virtualenv<=20.0.21' --user
+            fi
             export PATH="~/.local/bin:$PATH"
 
             virtualenv venv
@@ -41,11 +44,11 @@ jobs:
             if [[ $(python -c "import sys; print(sys.stdin.encoding)" |grep None) ]]; then
               export PYTHONIOENCODING=utf-8
             fi
-
-      - save_cache:
-          paths:
-            - ./venv
-          key: v1-dependencies-{{ .Environment.TOX_ENV }}-{{ checksum "requirements_dev.txt" }}
+#
+#      - save_cache:
+#          paths:
+#            - ./venv
+#          key: v1-dependencies-{{ .Environment.TOX_ENV }}-{{ checksum "requirements_dev.txt" }}
 
       - run:
           name: run tests

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,29 @@
+# This workflow will install Python dependencies, run tests and lint with a single version of Python
+# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+
+name: CI
+
+on: [push, pull_request]
+
+jobs:
+  build:
+
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [windows-latest]
+        # python-version: [2.7, 3.5, 3.6, 3.7, 3.8]
+        python-version: [3.8]
+        tox-env: [py27, py35, py36, py37, py38]
+
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v2
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install tox
+    - name: Test with tox
+      run: tox -e ${{ matrix.tox-env}}
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,12 @@
 Changelog
 ---------
 
+`0.38.1`_ (2020-07-05)
+++++++++++++++++++++++++
+
+* **[Improved]** 优化内置分词，处理前缀匹配导致无法正确识别尾部词语的问题。 Fixed `#205`_
+* **[Improved]** 使用 `phrase-pinyin-data`_ v0.10.3 的词语拼音数据。
+
 
 `0.38.0`_ (2020-06-07)
 ++++++++++++++++++++++++
@@ -181,14 +187,14 @@ Changelog
 `0.26.0`_ (2017-10-12)
 +++++++++++++++++++++++
 
-* **[Changed]** 不再自动调用 jieba 分词模块，改为自动调用内置的最大匹配分词模块来分词。
+* **[Changed]** 不再自动调用 jieba 分词模块，改为自动调用内置的正向最大匹配分词模块来分词。
   (via `#102`_)
 
 
 `0.25.0`_ (2017-10-01)
 +++++++++++++++++++++++
 
-* **[New]** 内置一个最大匹配分词模块，使用内置的词语拼音库来训练这个分词模块，
+* **[New]** 内置一个正向最大匹配分词模块，使用内置的词语拼音库来训练这个分词模块，
   解决自定义词语库有时可能不生效的问题（因为这个词语在 jieba 等分词模块中不是可用词）。(via `#81`_)
 
 
@@ -207,7 +213,7 @@ Changelog
 
       >>> from pypinyin.contrib.mmseg import seg, retrain
       >>> retrain(seg)   # 没有使用 load_phrases_dict 时可以不调用这个函数
-      >>> pinyin(seg.cut('了局啊'))  # 使用内置的最大匹配分词
+      >>> pinyin(seg.cut('了局啊'))  # 使用内置的正向最大匹配分词
       [['liǎo'], ['jú'], ['a']]
       >>>
 
@@ -760,6 +766,7 @@ __ https://github.com/mozillazg/python-pinyin/issues/8
 .. _#170: https://github.com/mozillazg/python-pinyin/issues/170
 .. _#174: https://github.com/mozillazg/python-pinyin/issues/174
 .. _#139: https://github.com/mozillazg/python-pinyin/issues/139
+.. _#205: https://github.com/mozillazg/python-pinyin/issues/205
 .. _#164: https://github.com/mozillazg/python-pinyin/pull/164
 .. _#176: https://github.com/mozillazg/python-pinyin/pull/176
 .. _@hanabi1224: https://github.com/hanabi1224
@@ -840,3 +847,4 @@ __ https://github.com/mozillazg/python-pinyin/issues/8
 .. _0.36.0: https://github.com/mozillazg/python-pinyin/compare/v0.35.4...v0.36.0
 .. _0.37.0: https://github.com/mozillazg/python-pinyin/compare/v0.36.0...v0.37.0
 .. _0.38.0: https://github.com/mozillazg/python-pinyin/compare/v0.37.0...v0.38.0
+.. _0.38.1: https://github.com/mozillazg/python-pinyin/compare/v0.38.0...v0.38.1
diff --git a/README.rst b/README.rst
@@ -1,7 +1,7 @@
 汉字拼音转换工具（Python 版）
 =============================
 
-|Build| |appveyor| |Coverage| |Pypi version| |DOI|
+|Build| |GitHubAction| |Coverage| |Pypi version| |DOI|
 
 
 将汉字转为拼音。可以用于汉字注音、排序、检索(`Russian translation`_) 。
@@ -177,8 +177,8 @@ __ https://github.com/mozillazg/rust-pinyin
 
 .. |Build| image:: https://img.shields.io/circleci/project/github/mozillazg/python-pinyin/master.svg
    :target: https://circleci.com/gh/mozillazg/python-pinyin
-.. |appveyor| image:: https://ci.appveyor.com/api/projects/status/ni8gdyextfa85yqo/branch/master?svg=true
-   :target: https://ci.appveyor.com/project/mozillazg/python-pinyin
+.. |GitHubAction| image:: https://github.com/mozillazg/python-pinyin/workflows/CI/badge.svg
+   :target: https://github.com/mozillazg/python-pinyin/actions
 .. |Coverage| image:: https://img.shields.io/codecov/c/github/mozillazg/python-pinyin/master.svg
    :target: https://codecov.io/gh/mozillazg/python-pinyin
 .. |PyPI version| image:: https://img.shields.io/pypi/v/pypinyin.svg

diff --git a/README_en.rst b/README_en.rst
@@ -1,7 +1,7 @@
 A tool for converting Chinese characters to pinyin (Python version)
-=============================
+=====================================================================
 
-|Build| |appveyor| |Coverage| |Pypi version| |DOI|
+|Build| |GitHubAction| |Coverage| |Pypi version| |DOI|
 
 
 Takes Chinese characters and converts them to pinyin, zhuyin, and Cyrillic.
@@ -174,8 +174,8 @@ __ https://github.com/mozillazg/rust-pinyin
 
 .. |Build| image:: https://img.shields.io/circleci/project/github/mozillazg/python-pinyin/master.svg
    :target: https://circleci.com/gh/mozillazg/python-pinyin
-.. |appveyor| image:: https://ci.appveyor.com/api/projects/status/ni8gdyextfa85yqo/branch/master?svg=true
-   :target: https://ci.appveyor.com/project/mozillazg/python-pinyin
+.. |GitHubAction| image:: https://github.com/mozillazg/python-pinyin/workflows/CI/badge.svg
+   :target: https://github.com/mozillazg/python-pinyin/actions
 .. |Coverage| image:: https://img.shields.io/codecov/c/github/mozillazg/python-pinyin/master.svg
    :target: https://codecov.io/gh/mozillazg/python-pinyin
 .. |PyPI version| image:: https://img.shields.io/pypi/v/pypinyin.svg

diff --git a/phrase-pinyin-data b/phrase-pinyin-data
diff --git a/pypinyin/phrases_dict.py b/pypinyin/phrases_dict.py
@@ -37835,6 +37835,7 @@
     '还淳反素': [['huán'], ['chún'], ['fǎn'], ['sù']],
     '还淳返朴': [['huán'], ['chún'], ['fǎn'], ['pǔ']],
     '还清': [['huán'], ['qīng']],
+    '还珠': [['huán'], ['zhū']],
     '还珠买椟': [['huán'], ['zhū'], ['mǎi'], ['dú']],
     '还珠合浦': [['huán'], ['zhū'], ['hé'], ['pǔ']],
     '还珠返璧': [['huán'], ['zhū'], ['fǎn'], ['bì']],

diff --git a/pypinyin/seg/mmseg.py b/pypinyin/seg/mmseg.py
@@ -4,15 +4,16 @@
 
 
 class Seg(object):
-    """最大正向匹配分词
+    """正向最大匹配分词
 
     :type prefix_set: PrefixSet
+    :param no_non_phrases: 是否严格按照词语分词，不允许把非词语的词当做词语进行分词
+    :type no_non_phrases: bool
     """
 
-    def __init__(self, prefix_set):
+    def __init__(self, prefix_set, no_non_phrases=False):
         self._prefix_set = prefix_set
-        # 是否严格按照词语分词，不允许把非词语的词当做词语进行分词
-        self._no_non_phrases = False
+        self._no_non_phrases = no_non_phrases
 
     def cut(self, text):
         """分词
@@ -38,18 +39,29 @@ def cut(self, text):
                         yield matched
                         matched = ''
                         remain = remain[index:]
-                    else:  # 前面为空
+                    else:  # 前面为空或不是真正的词语
                         # 严格按照词语分词的情况下，不是词语的词拆分为单个汉字
+                        # 先返回第一个字，后面的重新参与分词，
+                        # 处理前缀匹配导致无法识别输入尾部的词语，
+                        # 支持简单的逆向匹配分词:
+                        #   已有词语：金融寡头 行业
+                        #   输入：金融行业
+                        #   输出：金 融 行业
                         if self._no_non_phrases:
-                            for x in word:
-                                yield x
+                            yield word[0]
+                            remain = remain[index + 2 - len(word):]
                         else:
                             yield word
-                        remain = remain[index + 1:]
+                            remain = remain[index + 1:]
                     # 有结果了，剩余的重新开始匹配
+                    matched = ''
                     break
-            else:  # 整个文本就是一个词语
-                yield remain
+            else:  # 整个文本就是一个词语，或者不包含任何词语
+                if self._no_non_phrases and remain not in PHRASES_DICT:
+                    for x in remain:
+                        yield x
+                else:
+                    yield remain
                 break
 
     def train(self, words):
@@ -99,8 +111,7 @@ def __contains__(self, key):
 #:     ['你好', '，', '我是', '中国人', '，', '我', '爱',
 #:      '我的', '祖国']
 #:     >>>
-seg = Seg(p_set)
-seg._no_non_phrases = True
+seg = Seg(p_set, no_non_phrases=True)
 
 
 def retrain(seg_instance):

diff --git a/pypinyin/seg/mmseg.pyi b/pypinyin/seg/mmseg.pyi
@@ -7,8 +7,9 @@ class Seg(object):
     """最大正向匹配分词
 
     :type prefix_set: PrefixSet
+    :type no_non_phrases: bool
     """
-    def __init__(self, prefix_set: PrefixSet) -> None:
+    def __init__(self, prefix_set: PrefixSet, no_non_phrases: bool) -> None:
         self._no_non_phrases = ...  # type: bool
         self._prefix_set = ...  # type: PrefixSet
         ...
+1 −1		.bumpversion.cfg
+8 −1		CHANGELOG.md
+1 −1		large_pinyin.txt
+1 −1		merge.py
+1 −0		overwrite.txt
+2 −1		pinyin.txt