Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
mozillazg committed Jul 5, 2020
2 parents 09d12d1 + 6ffa37a commit 2496ed0
Show file tree
Hide file tree
Showing 12 changed files with 764 additions and 84 deletions.
45 changes: 0 additions & 45 deletions .appveyor.yml

This file was deleted.

21 changes: 12 additions & 9 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,17 @@ jobs:
- checkout

# Download and cache dependencies
- restore_cache:
keys:
- v1-dependencies-{{ .Environment.TOX_ENV }}-{{ checksum "requirements_dev.txt" }}
# - restore_cache:
# keys:
# - v1-dependencies-{{ .Environment.TOX_ENV }}-{{ checksum "requirements_dev.txt" }}

- run:
name: install dependencies
command: |
pip install -U pip virtualenv --user
# pip install -U pip virtualenv --user
if ! which virtualenv; then
pip install 'virtualenv<=20.0.21' --user
fi
export PATH="~/.local/bin:$PATH"
virtualenv venv
Expand All @@ -41,11 +44,11 @@ jobs:
if [[ $(python -c "import sys; print(sys.stdin.encoding)" |grep None) ]]; then
export PYTHONIOENCODING=utf-8
fi
- save_cache:
paths:
- ./venv
key: v1-dependencies-{{ .Environment.TOX_ENV }}-{{ checksum "requirements_dev.txt" }}
#
# - save_cache:
# paths:
# - ./venv
# key: v1-dependencies-{{ .Environment.TOX_ENV }}-{{ checksum "requirements_dev.txt" }}

- run:
name: run tests
Expand Down
29 changes: 29 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: CI

on: [push, pull_request]

jobs:
build:

runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [windows-latest]
# python-version: [2.7, 3.5, 3.6, 3.7, 3.8]
python-version: [3.8]
tox-env: [py27, py35, py36, py37, py38]

steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install tox
- name: Test with tox
run: tox -e ${{ matrix.tox-env}}
14 changes: 11 additions & 3 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
Changelog
---------

`0.38.1`_ (2020-07-05)
++++++++++++++++++++++++

* **[Improved]** 优化内置分词,处理前缀匹配导致无法正确识别尾部词语的问题。 Fixed `#205`_
* **[Improved]** 使用 `phrase-pinyin-data`_ v0.10.3 的词语拼音数据。


`0.38.0`_ (2020-06-07)
++++++++++++++++++++++++
Expand Down Expand Up @@ -181,14 +187,14 @@ Changelog
`0.26.0`_ (2017-10-12)
+++++++++++++++++++++++

* **[Changed]** 不再自动调用 jieba 分词模块,改为自动调用内置的最大匹配分词模块来分词
* **[Changed]** 不再自动调用 jieba 分词模块,改为自动调用内置的正向最大匹配分词模块来分词
(via `#102`_)


`0.25.0`_ (2017-10-01)
+++++++++++++++++++++++

* **[New]** 内置一个最大匹配分词模块,使用内置的词语拼音库来训练这个分词模块,
* **[New]** 内置一个正向最大匹配分词模块,使用内置的词语拼音库来训练这个分词模块,
解决自定义词语库有时可能不生效的问题(因为这个词语在 jieba 等分词模块中不是可用词)。(via `#81`_)


Expand All @@ -207,7 +213,7 @@ Changelog
>>> from pypinyin.contrib.mmseg import seg, retrain
>>> retrain(seg) # 没有使用 load_phrases_dict 时可以不调用这个函数
>>> pinyin(seg.cut('了局啊')) # 使用内置的最大匹配分词
>>> pinyin(seg.cut('了局啊')) # 使用内置的正向最大匹配分词
[['liǎo'], [''], ['a']]
>>>
Expand Down Expand Up @@ -760,6 +766,7 @@ __ https://github.com/mozillazg/python-pinyin/issues/8
.. _#170: https://github.com/mozillazg/python-pinyin/issues/170
.. _#174: https://github.com/mozillazg/python-pinyin/issues/174
.. _#139: https://github.com/mozillazg/python-pinyin/issues/139
.. _#205: https://github.com/mozillazg/python-pinyin/issues/205
.. _#164: https://github.com/mozillazg/python-pinyin/pull/164
.. _#176: https://github.com/mozillazg/python-pinyin/pull/176
.. _@hanabi1224: https://github.com/hanabi1224
Expand Down Expand Up @@ -840,3 +847,4 @@ __ https://github.com/mozillazg/python-pinyin/issues/8
.. _0.36.0: https://github.com/mozillazg/python-pinyin/compare/v0.35.4...v0.36.0
.. _0.37.0: https://github.com/mozillazg/python-pinyin/compare/v0.36.0...v0.37.0
.. _0.38.0: https://github.com/mozillazg/python-pinyin/compare/v0.37.0...v0.38.0
.. _0.38.1: https://github.com/mozillazg/python-pinyin/compare/v0.38.0...v0.38.1
6 changes: 3 additions & 3 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
汉字拼音转换工具(Python 版)
=============================

|Build| |appveyor| |Coverage| |Pypi version| |DOI|
|Build| |GitHubAction| |Coverage| |Pypi version| |DOI|


将汉字转为拼音。可以用于汉字注音、排序、检索(`Russian translation`_) 。
Expand Down Expand Up @@ -177,8 +177,8 @@ __ https://github.com/mozillazg/rust-pinyin

.. |Build| image:: https://img.shields.io/circleci/project/github/mozillazg/python-pinyin/master.svg
:target: https://circleci.com/gh/mozillazg/python-pinyin
.. |appveyor| image:: https://ci.appveyor.com/api/projects/status/ni8gdyextfa85yqo/branch/master?svg=true
:target: https://ci.appveyor.com/project/mozillazg/python-pinyin
.. |GitHubAction| image:: https://github.com/mozillazg/python-pinyin/workflows/CI/badge.svg
:target: https://github.com/mozillazg/python-pinyin/actions
.. |Coverage| image:: https://img.shields.io/codecov/c/github/mozillazg/python-pinyin/master.svg
:target: https://codecov.io/gh/mozillazg/python-pinyin
.. |PyPI version| image:: https://img.shields.io/pypi/v/pypinyin.svg
Expand Down
8 changes: 4 additions & 4 deletions README_en.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
A tool for converting Chinese characters to pinyin (Python version)
=============================
=====================================================================

|Build| |appveyor| |Coverage| |Pypi version| |DOI|
|Build| |GitHubAction| |Coverage| |Pypi version| |DOI|


Takes Chinese characters and converts them to pinyin, zhuyin, and Cyrillic.
Expand Down Expand Up @@ -174,8 +174,8 @@ __ https://github.com/mozillazg/rust-pinyin

.. |Build| image:: https://img.shields.io/circleci/project/github/mozillazg/python-pinyin/master.svg
:target: https://circleci.com/gh/mozillazg/python-pinyin
.. |appveyor| image:: https://ci.appveyor.com/api/projects/status/ni8gdyextfa85yqo/branch/master?svg=true
:target: https://ci.appveyor.com/project/mozillazg/python-pinyin
.. |GitHubAction| image:: https://github.com/mozillazg/python-pinyin/workflows/CI/badge.svg
:target: https://github.com/mozillazg/python-pinyin/actions
.. |Coverage| image:: https://img.shields.io/codecov/c/github/mozillazg/python-pinyin/master.svg
:target: https://codecov.io/gh/mozillazg/python-pinyin
.. |PyPI version| image:: https://img.shields.io/pypi/v/pypinyin.svg
Expand Down
2 changes: 1 addition & 1 deletion phrase-pinyin-data
1 change: 1 addition & 0 deletions pypinyin/phrases_dict.py
Original file line number Diff line number Diff line change
Expand Up @@ -37835,6 +37835,7 @@
'还淳反素': [['huán'], ['chún'], ['fǎn'], ['sù']],
'还淳返朴': [['huán'], ['chún'], ['fǎn'], ['pǔ']],
'还清': [['huán'], ['qīng']],
'还珠': [['huán'], ['zhū']],
'还珠买椟': [['huán'], ['zhū'], ['mǎi'], ['dú']],
'还珠合浦': [['huán'], ['zhū'], ['hé'], ['pǔ']],
'还珠返璧': [['huán'], ['zhū'], ['fǎn'], ['bì']],
Expand Down
35 changes: 23 additions & 12 deletions pypinyin/seg/mmseg.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,16 @@


class Seg(object):
"""最大正向匹配分词
"""正向最大匹配分词
:type prefix_set: PrefixSet
:param no_non_phrases: 是否严格按照词语分词,不允许把非词语的词当做词语进行分词
:type no_non_phrases: bool
"""

def __init__(self, prefix_set):
def __init__(self, prefix_set, no_non_phrases=False):
self._prefix_set = prefix_set
# 是否严格按照词语分词,不允许把非词语的词当做词语进行分词
self._no_non_phrases = False
self._no_non_phrases = no_non_phrases

def cut(self, text):
"""分词
Expand All @@ -38,18 +39,29 @@ def cut(self, text):
yield matched
matched = ''
remain = remain[index:]
else: # 前面为空
else: # 前面为空或不是真正的词语
# 严格按照词语分词的情况下,不是词语的词拆分为单个汉字
# 先返回第一个字,后面的重新参与分词,
# 处理前缀匹配导致无法识别输入尾部的词语,
# 支持简单的逆向匹配分词:
# 已有词语:金融寡头 行业
# 输入:金融行业
# 输出:金 融 行业
if self._no_non_phrases:
for x in word:
yield x
yield word[0]
remain = remain[index + 2 - len(word):]
else:
yield word
remain = remain[index + 1:]
remain = remain[index + 1:]
# 有结果了,剩余的重新开始匹配
matched = ''
break
else: # 整个文本就是一个词语
yield remain
else: # 整个文本就是一个词语,或者不包含任何词语
if self._no_non_phrases and remain not in PHRASES_DICT:
for x in remain:
yield x
else:
yield remain
break

def train(self, words):
Expand Down Expand Up @@ -99,8 +111,7 @@ def __contains__(self, key):
#: ['你好', ',', '我是', '中国人', ',', '我', '爱',
#: '我的', '祖国']
#: >>>
seg = Seg(p_set)
seg._no_non_phrases = True
seg = Seg(p_set, no_non_phrases=True)


def retrain(seg_instance):
Expand Down
3 changes: 2 additions & 1 deletion pypinyin/seg/mmseg.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@ class Seg(object):
"""最大正向匹配分词
:type prefix_set: PrefixSet
:type no_non_phrases: bool
"""
def __init__(self, prefix_set: PrefixSet) -> None:
def __init__(self, prefix_set: PrefixSet, no_non_phrases: bool) -> None:
self._no_non_phrases = ... # type: bool
self._prefix_set = ... # type: PrefixSet
...
Expand Down
Loading

0 comments on commit 2496ed0

Please sign in to comment.