The library implements following components:
- symbols - symbolic level processing component;
- encode - UNICODE and single-byte encoding implementation.
- morpho - morphological analysis implementation for English and Russian languages;
- automata - fast finite state machine implementation in memory;
- utility - list of utility classes and routins suitable for language processing.
The library is implemented in C++ 2003 for the sake of more compatibility.
The shell script called ubuntu_requirements.sh
will install all required libraries to build the project. In fact, the depends are only Boost and
liblog4cplus libraries.
The directory with name examples contains using examples of the library. Below there is shell script example to build release version of the project.
git clone [email protected]:merfill/strutext.git
cd strutext
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release .. && make -j20 && make test
The library implements UNICODE symbol classification routines. There are following types of classes:
- UNICODE inherited classes.
- Upper and Lower subclasses of Letter symbol class.
The library is located in subdirectory symbols.
Possible classes are:
- Control: any symbol, which may be used control text out.
- Letter: letters in any language.
- Number: digits in any language.
- Separator: symbol, which may be used as separator in texts (space as example).
- Punctuator: the punctuation symbol -- "!,;" and etc.
- Mark: for example, enclosing square, which is used to sign theorem proving.
- Symbol: for example, '$' sign.
The classes may also have sybclasses. See type definition for more information. The SymbolInfo type provides information concerning symbol class and some extra information, which depends on the symbol class. For instance, for Letter class the type contains code of lower and upper letter variant.
typedef uint32_t SymbolCode;
enum SymbolClass {
UPPERCASE_LETTER = 0x00000001,
LOWERCASE_LETTER = 0x00000002,
TITLECASE_LETTER = 0x00000004,
CASED_LETTER = UPPERCASE_LETTER | LOWERCASE_LETTER | TITLECASE_LETTER,
MODIFIER_LETTER = 0x00000008,
OTHER_LETTER = 0x00000010,
LETTER = CASED_LETTER | MODIFIER_LETTER | OTHER_LETTER,
NONSPACING_MARK = 0x00000020,
SPACING_MARK = 0x00000040,
ENCLOSING_MARK = 0x00000080,
MARK = NONSPACING_MARK | SPACING_MARK | ENCLOSING_MARK,
DECIMAL_NUMBER = 0x00000100,
LETTER_NUMBER = 0x00000200,
OTHER_NUMBER = 0x00000400,
NUMBER = DECIMAL_NUMBER | LETTER_NUMBER | OTHER_NUMBER,
CONNECTOR_PUNCTUATION = 0x00000800,
DASH_PUNCTUATION = 0x00001000,
OPEN_PUNCTUATION = 0x00002000,
CLOSE_PUNCTUATION = 0x00004000,
INITIAL_PUNCTUATION = 0x00008000,
FINAL_PUNCTUATION = 0x00010000,
OTHER_PUNCTUATION = 0x00020000,
PUNCTUATION = CONNECTOR_PUNCTUATION | DASH_PUNCTUATION | OPEN_PUNCTUATION | CLOSE_PUNCTUATION
| INITIAL_PUNCTUATION | FINAL_PUNCTUATION | OTHER_PUNCTUATION,
MATH_SYMBOL = 0x00040000,
CURRENCY_SYMBOL = 0x00080000,
MODIFIER_SYMBOL = 0x00100000,
OTHER_SYMBOL = 0x00200000,
SYMBOL = MATH_SYMBOL | CURRENCY_SYMBOL | MODIFIER_SYMBOL | OTHER_SYMBOL,
SPACE_SEPARATOR = 0x00400000,
LINE_SEPARATOR = 0x00800000,
PARAGRAPH_SEPARATOR = 0x01000000,
SEPARATOR = SPACE_SEPARATOR | LINE_SEPARATOR | PARAGRAPH_SEPARATOR,
CONTROL = 0x02000000,
FORMAT = 0x04000000,
SURROGATE = 0x08000000,
PRIVATE_USE = 0x10000000,
UNASSIGNED = 0x20000000,
OTHER = CONTROL | FORMAT | SURROGATE | PRIVATE_USE | UNASSIGNED
};
// Get class of the symbol
inline const uint32_t& GetSymbolClass(const SymbolCode& code);
// Is symbols in the specified class
template<SymbolClass class_name>
inline bool Is(const SymbolCode& code);
// Definitions for most important classes
inline bool IsCasedLetter(const SymbolCode& code);
inline bool IsLetter(const SymbolCode& code);
inline bool IsMark(const SymbolCode& code);
inline bool IsNumber(const SymbolCode& code);
inline bool IsPunctuation(const SymbolCode& code);
inline bool IsSymbol(const SymbolCode& code);
inline bool IsSeparator(const SymbolCode& code);
inline bool IsOther(const SymbolCode& code);
Letter symbol class has two sublclasses that contain lower and upper letter's variants if it's applicable to the letter. The library provides routines to transform letters to lower and upper representations and to define either the letter is in upper or lower case.
bool IsCasedLetter(const SymbolCode& code);
SymbolCode ToLower(const SymbolCode& code);
SymbolCode ToUpper(const SymbolCode& code)
The class system is based on UNICODE symbol class definition. The UNICODE consortium provides symbol definitions in file called UnicodeData.txt. This is CSV file where each line contains one symbol definition with class and mapping to related symbols (for instance, upper case for lower case latter). Below there are several lines from the begining of the file:
0007;<control>;Cc;0;BN;;;;;N;BELL;;;;
0008;<control>;Cc;0;BN;;;;;N;BACKSPACE;;;;
0009;<control>;Cc;0;S;;;;;N;CHARACTER TABULATION;;;;
000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;
000B;<control>;Cc;0;S;;;;;N;LINE TABULATION;;;;
000C;<control>;Cc;0;WS;;;;;N;FORM FEED (FF);;;;
Example of English symbol definitions:
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062;
0043;LATIN CAPITAL LETTER C;Lu;0;L;;;;;N;;;;0063;
0044;LATIN CAPITAL LETTER D;Lu;0;L;;;;;N;;;;0064;
0045;LATIN CAPITAL LETTER E;Lu;0;L;;;;;N;;;;0065;
0046;LATIN CAPITAL LETTER F;Lu;0;L;;;;;N;;;;0066;
0047;LATIN CAPITAL LETTER G;Lu;0;L;;;;;N;;;;0067;
...
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042
0063;LATIN SMALL LETTER C;Ll;0;L;;;;;N;;;0043;;0043
0064;LATIN SMALL LETTER D;Ll;0;L;;;;;N;;;0044;;0044
0065;LATIN SMALL LETTER E;Ll;0;L;;;;;N;;;0045;;0045
0066;LATIN SMALL LETTER F;Ll;0;L;;;;;N;;;0046;;0046
0067;LATIN SMALL LETTER G;Ll;0;L;;;;;N;;;0047;;0047
The library implements Python script, which reads UnicodeData.txt during precompilation stage and generates three arrays:
extern uint32_t SYM_CLASS_TABLE[];
extern SymbolCode SYM_UPPER_TABLE[];
extern SymbolCode SYM_LOWER_TABLE[];
for symbol class, upper and lower letter cases accordingly.
For each symbol code one can define the class in constant time as well as make fast transform to upper or lower cases for cased letters. The arrays are loaded to the memory when the program runs and this may be the reason for more memory using. However, size of each array is not more 4 MB, so summary space 12 Mb not seems to be too big payment for the fast implementation approach.
Encode library implements rotines for manipulations of symbols in different encodings.
The library is located in subdirectory encode.
The main component of the library is template class Utf8Iterator, which is encoded in utf8_iterator.h
file. The class is implemented by basing on
boost::iterator_facade and thus can be used in the code as ordinal C++ STL style iterator. The class allows to extract UNICODE symbols from byte
sequence encoded in UTF-8. The template paremeter of the class is ByteIterator to get bytes from the input sequence.
The extracted symbol is encoded in structure named Utf8Symbol.
struct Utf8Symbol {
...
uint8_t chain_[6]; // UTF-8 byte sequence read.
size_t len_; // The length of UTF-8 sequence.
uint32_t utf32_; // UTF-32 symbol code.
};
Here utf32_ contains UNICODE code of the extracted symbol, .GetChain() references to byte sequence of the extracted symbol and
GetChainLen() - length of the extracted byte sequence. Utf8Symbol also implements cast method to uint32_t
type to get UNICODE code directly
from the iterator.
Below there is the example of extracting UNICODE symbols from byte sequence of Russian text contained in std::string object.
typedef strutext::encode::Utf8Iterator<std::string::const_iterator> Utf8Iterator;
std::string input = "Мама мыла раму";
for (Utf8Iterator it = Utf8Iterator(input.begin(), input.end()); it != Utf8Iterator(); ++it) {
std::cout << *it << ":";
for (uint32_t id = 0; id < it.GetChainLen(); ++id) {
std::cout << " " << it.GetChain()[id];
}
std::cout << "\n";
}
The file utf8_generator.h
contains routine, which implements generation of UTF-8 sequence from UNICODE symbol code. The routine code is following:
template <typename ByteIterator>
inline ByteIterator GetUtf8Sequence(strutext::symbols::SymbolCode code, ByteIterator oi);
template <typename Utf32Iterator, typename ByteIterator>
inline ByteIterator GetUtf8Sequence(Utf32Iterator begin, Utf32Iterator end, ByteIterator oi);
The function gets symbol code on the input and produced UTF-8 sequence to the passed output byte iterator. The second version of the routine gets a sequence of UNICODE symbols as the input. Below there is the code example:
std::string result;
strutext::encode::GetUtf8Sequence(0x41, std::back_inserter(result));
The library also impements variety of single byte to UNICODE encoder iterators for Russian and Ukraine languages. Below there is the definition of these encoders:
#include "char_iterator.h"
#include "char_unicode32_decoder.h"
typedef strutext::encode::CharIterator<const char*, strutext::encode::Cp1251Decoder> Cp1251Iterator;
typedef strutext::encode::CharIterator<const char*, strutext::encode::Cp1252Decoder> Cp1252Iterator;
typedef strutext::encode::CharIterator<const char*, strutext::encode::Cp1253Decoder> Cp1253Iterator;
typedef strutext::encode::CharIterator<const char*, strutext::encode::Cp866Decoder> Cp866Iterator;
typedef strutext::encode::CharIterator<const char*, strutext::encode::Iso88591Decoder> Iso88591Iterator;
typedef strutext::encode::CharIterator<const char*, strutext::encode::Koi8ruDecoder> Koi8ruIterator;
typedef strutext::encode::CharIterator<const char*, strutext::encode::Koi8uDecoder> Koi8uIterator;
typedef strutext::encode::CharIterator<const char*, strutext::encode::Koi8rDecoder> Koi8r1Iterator;
typedef strutext::encode::CharIterator<const char*, strutext::encode::MacCyrillicDecoder> MacCyrillicIterator;
typedef strutext::encode::CharIterator<const char*, strutext::encode::MacUkraineDecoder> MacUkraineIterator;
std::string word = "some Russian text in cp1251";
for (Cp1251Iterator it(word.begin(), word.end(), end; it != end; ++it) {
std::cout << *it << "\n";
}
The library implements morphological analysis algorithms for English and Russia languages. The main library interface is defined in
morpho/morpholib/morpho.h
file. the file contains definitions of classes MorphologistBase
, which defines interface to the library and inherited
Morphologist
template class, which is parametrized by Alphabet class. It's needed to define following types to use library:
#include "rus_alphabet.h"
#include "eng_alphabet.h"
#include "morpho_modifier.h"
#include "morpho.h"
#include "rus_model.h"
#include "eng_model.h"
typedef strutext::morpho::Morphologist<strutext::morpho::EnglishAlphabet> EnglishMorpher;
typedef strutext::morpho::Morphologist<strutext::morpho::RussianAlphabet> RussianMorpher;
Here EnglishAlphabet
and RussianAlphabet
are classes defined in morpho/alphabets
directory. Each alphabet symbol in morpho library is
implemented in one byte encoding schema for optimization purposes.Alphabet
class implementation implements decoding/encoding routines to/from
UNICODE symbols.
The Morphologist
class object contains list of all word forms in the dictionary (Russian or English). An each word (lemma) in the dictionary has
list of forms. For instance, lemma say has two forms: say ans says. The first form say is called main form of the lemma. So, lemma can be
defined just as a list of its forms, presented by its main form. An each form in the list lexical attributes. We'll discuss lexical attributes
later, in the section Language Models.
The Morphologist
class provides method Analize
, which gets form text in UTF-8 encoding and returns list of possible lemmas for the given form.
One form can be in more than one lemma. For example, form say can be in two lemmas: noun say and adverb say. The method definition is:
void Analize(const std::string& text, LemList& lem_list) const;
Here the definition of LemList
:
struct Lemma {
...
uint32_t id_; ///< Lemma identifier.
uint32_t attr_; ///< Form attributes.
};
/// Lemma list type definition.
typedef std::list<Lemma> LemList;
Thus, Lemma
contains unique identifier and list of lexical attributes. This list is encoded in 4 bytes only and Morphologist
class provides
special methods to generate form UTF-8 text from encoded attribute list and lemma id: Generate
ans GenAllForms
:
std::string Generate(uint32_t lem_id, uint32_t attrs) const;
size_t GenAllForms(uint32_t lem_id, std::set<std::string>& form_set) const;
One can also generate main form for passed lemma identifier:
bool GenMainForm(uint32_t lem_id, std::string& main_form) const;
The Morphologist
class is also serializable and must be initialized by dictionary comming from std::stream:
void Serialize(std::ostream& os) const;
void Deserialize(std::istream& is);
The serialized dictionary within lexical information should be generated by special utility from text representation. We'll discuss below how to do this as well as how to operate extracted lexical attributes for specified language model.
Language model is definition of lexical attributes that can be assigned to forms. The base class PartOfSpeech
allows to operate models for different
languages by using the same abstract type.
For the moment, two langauge models are implemented: for Russian and English languages. The part of speech classes are inherited from EnglishPos
and RussianPos
respectively. The concrete POS classes are, for example: Noun, Adjective, Verb, and etc. An each POS class is serializable to
uint32_t
type. The PosSerializer
class provides interface to extract POS class instance from uint32_t
object.
static EnglishPos::Ptr Deserialize(const uint32_t& ob);
static RussianPos::Ptr Deserialize(const uint32_t& ob);
The library also provide human understandable description of POS for English and Russian languages. This description is located to
morpho/models/lang_model_description.h
, where lang
can be either rus
or eng
.
The using example will be provided below. Before we need to understand the structure of dictionaries and how to generate binary representation from the text one.
The dictionaries (English and Russian) are got from http://aot.ru. Each dictionary has implemented in two files: lang_tabs.txt
and lang_morphs.txt
,
where lang is either eng or rus.
Tabs file contains lexical attribute definitions that are encode by two letters sequence. For example:
aa 1 ADJECTIVE
ab 1 ADJECTIVE comp
ac 1 ADJECTIVE sup
...
na 1 NOUN narr,sg
nb 1 NOUN narr,pl
File morphs consists of several parts. By historical reasons only two sections are used:
- Line section.
- Morph section.
This is the sequence of lines. An each line contains a set of suffixes along with lexical attribute definition. An each suffix is concatenated to prefix defined in Morph section to produce the form with assigned lexical attributes. For instance:
%*va%S*vb%ED*vc%ED*vd%ING*ve
can be used to generate lemma ABANDON
with forms: ABANDON
(VERB inf), ABANDONS
(VERB prsa,sg,3), ABANDONED
(VERB pasa),
ABANDONING
(VERB ing).
The section contains lemma definitions. Lemma is defined by its prefix text and number of line in Line section. For example:
ABALONE 6 0 0 - -
ABAMPERE 6 0 0 - -
ABANDON 6 0 0 - -
ABANDON 8 2 0 - -
ABANDONED 5 1 0 - -
ABANDONEE 6 0 0 - -
Here first field is prefix text and the second field is number of line in Line section (numbered starting with zero).
All this information is encoded in dictionary implementation in morpho library. Utility called aot-parser
is used to generate binary representation
of English and Russian dictionaries. This representation can be then deserialized in Morphologist
class object.
aot-parser --help
Allowed options:
--help produce help
-t [ --tab ] arg tab file name
-d [ --dict ] arg dictionary file name
-b [ --bin ] arg binary dictionary file name
-m [ --model ] arg language model: eng, rus
-v [ --verbose ] produce process info to stderr
aot-parser -t morpho/aot/eng_tabs.txt -d morpho/aot/eng_morphs.txt -b eng_dict.bin -m eng
aot-parser -t morpho/aot/rus_tabs.txt -d morpho/aot/rus_morphs.txt -b rus_dict.bin -m rus
These two commands generate English and Russian binary representations of dictionaries.
Here the the example of main form generation for Russian dictionary.
typedef strutext::morpho::Morphologist<strutext::morpho::RussianAlphabet> Morpher;
std::ifstream dict("rus_dict.bin");
if (not dict.is_open()) {
throw std::invalid_argument("Cannot open russian dictionary: rus_dict.bin");
}
Morpher morpher;
morpher.Deserialize(dict);
std::string text = "мыла";
strutext::morpho::MorphologistBase::LemList lemmas;
morpher.>Analize(text, lemmas);
// Extract main forms.
std::set<std::string> forms;
for (const auto& lemma : lemmas) {
std::string main_form;
if (morpher.GenMainForm(lemma.id_, main_form)) {
std::cout << main_form << "\n";
}
}