Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find/Create function for text tokenzation #5

Open
AndreyKondakovGW opened this issue Oct 5, 2021 · 4 comments
Open

Find/Create function for text tokenzation #5

AndreyKondakovGW opened this issue Oct 5, 2021 · 4 comments

Comments

@AndreyKondakovGW
Copy link

AndreyKondakovGW commented Oct 5, 2021

Find/Create function for sepatating text string to tokens(words).
Function must get text string and return list of string tokens. Function also should not return tokens
containing digits and punctuation (you can do this with regular expressions).
Function must have optional parameters:

  1. to_lower - (shows that all tokens in returnable list must be in lower case)
  2. min_token_size - (shows min length of returnable tokens)
@AndreyKondakovGW
Copy link
Author

Exaple of this function: (on their basis you can write test fo this function)

  1. tokenize(text="Классификация текстов (документов) (англ. Document classification) — задача компьютерной лингвистики") =>
    ["Классификация", "текстов", "документов" , "англ", "Document", classification", "задача", "компьютерной", "лингвистики"]
  2. tokenize(text="Классификация настроения текста из базы ANEW[3], :
    счастливый - 8.21;
    хороший - 7.47;
    скучный - 2.95;", to_lower = true, min_token_size=4) => ["классификация", "настроения", "текста", "базы", "anew", "счастливый", "хороший", "скучный"]

@CyberSniff
Copy link
Contributor

I take it

@AndreyKondakovGW
Copy link
Author

AndreyKondakovGW commented Oct 25, 2021

Also sugest add lemmatization option to lemmatize tokens, as independant function or as part of tokenize function
tokenize(text="Классификация настроения текста из базы , to_lower = true, min_token_size=4, lemmatize = True) => ["классификация", "настроение", "текст", "база"]

@Wolwer1nE
Copy link
Contributor

Looks like there is no ready to use library for lemmatization written in ruby, so we will focus on tokenzation in this issue and extract lemmatization as a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants