WordCloud and Cluster #600

kennynakamura · 2021-06-15T12:48:20Z

kennynakamura
Jun 15, 2021

Hello! me again, all good? :)
I had an idea to generate wordcloud from cluster chats, to have a preview of the subjects clusters.
For example it could separate personal and professional conversations.

I have a code to generate the cluster and wordcloud, but to do this, I need to create a variable where all texts are stored.
Can python scripts do this? In the tests I'm doing, the scripts act only on one item at a time. And could i generate an image and where could i store it?

lfcnassif · 2021-06-15T14:59:47Z

lfcnassif
Jun 15, 2021
Maintainer

Hi @kennynakamura, all good. Some years ago I thought about a similar feature to extract relevant words from texts and store them in a "relevant words" column. Text summarization is another related idea. I think you can use the item.setExtraAttribute() method like you did in #508 to store keywords extracted from a specific document.

It is possible to store images using the method above, but they won't be shown automatically in UI. To create a new viewer, java code needs to be implemented. But I think a simple new column with relevant words is already a good start.

About clustering documents, a python task could create document vectors in process() and store them using method above, then they could be retrieved in task finish() method to run the clustering algorithm. Actually, I think word vectors already exist in lucene index and could be reused instead of recomputing them again and wasting resources. To save the cluster number to which a document belongs, currently it can only be saved as a new bookmark. After #24, we will able to create custom columns in finish() method or even after processing ends.

6 replies

lfcnassif Jun 17, 2021
Maintainer

Hi @kennynakamura, thanks. Seems good to me. Maybe changing the column name to "FrequentWords" could be more intuitive? For sure the stopwords list should be adjusted after running large scale tests.

kennynakamura Jun 17, 2021
Author

Sure! it seems better to me too. I was already doing some tests with real whatsapp conversations, and added some abbreviations in Config File. Can i make a pull-request? The code is down here.

I use NLTK lib, they are already used in IPED, right? I not sure if i done import from text file with stopwords correctly, but works.

https://github.com/kennynakamura/FrequentWordsIPED/blob/main/FrequentWordsConfig.txt

https://github.com/kennynakamura/FrequentWordsIPED/blob/main/FrequentWordsTask.py

`
import os, re
import heapq
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

configFile = 'FrequentWordsConfig.txt'
STOPProp = 'STOP_det'

class FrequentWordsTask:

enabled = False
configDir = None

def isEnabled(self):
    return True
    
def init(self, confProps, configFolder):
    
    from java.io import File
    from dpf.sp.gpinf.indexer.util import UTF8Properties
    extraProps = UTF8Properties()
    extraProps.load(File(configFolder, configFile))
    STOP_det = extraProps.getProperty(STOPProp)      
    global STOP
    
    def Convert(string):
       li = list(string.split(" "))
       return li
       
    if STOP_det is not None:
       STOP = Convert(str(STOP_det)) 
    return 

def finish(self):      
    return 
    
def process(self, item):
    
    categories = item.getCategorySet().toString()
    if not ("Chats" in categories or "Emails" in categories):
       return 
    
    text = str(item.getParsedTextCache()).lower()
    #Retirando números
    text = re.sub(r'\d+','',text)
    #Tokenização
    tokenizer = RegexpTokenizer(r'\w+')
    Text_separed_from_words = tokenizer.tokenize(text)   
    #Importando as stopwords do arquivo txt
    All_stopwords = STOP
    
    def remove_stopwords(words, stopwords):
       #Retirando as stop words
       words_without_stopwords = []
       for item in words:
          if item not in stopwords:
             words_without_stopwords.append(item)
       #Retirando palavras com menos de 3 letras
       for item in words_without_stopwords:
          if len(item) < 3:
             words_without_stopwords.remove(item)
       #Retirando risadas "kkk" 
       for item in words_without_stopwords:
          characters_word = item.split()
          for i in characters_word:
            contLetra = i.count('k')
            if contLetra >= 2:
               words_without_stopwords.remove(item)                 
       return words_without_stopwords
     
    def Create_bag_of_words(words):
       wordfreq ={}
       for item in words:
          if item not in wordfreq.keys():
             wordfreq[item] = 1
          else:
             wordfreq[item] += 1     
       return wordfreq
       
    words_without_stopwords = remove_stopwords(Text_separed_from_words, All_stopwords)
    
    def stemming(words):
        stem = PorterStemmer()
        for item in words:
           item = stem.stem(item)
        return words
    #Stemmização das palavras -- Não tenho certeza se está fazendo muita diferença, mas é uma função do NLTK
    words_without_stopwords = stemming(words_without_stopwords)
    
    Bag_of_words = Create_bag_of_words(words_without_stopwords)
    #Seleção das 10 palavras com maiores frequências
    most_freq = heapq.nlargest(10, Bag_of_words, key=Bag_of_words.get) 
    item.setExtraAttribute('FrequentWords', most_freq) `

kennynakamura Jun 17, 2021
Author

Sorry about part code outside, I'm trying to change it

lfcnassif Jun 17, 2021
Maintainer

For sure, a PR will be welcome! I will have time to review next week. Please translate var names and comments to english as this project is migrating to english. And stopwords should be language aware. You could query "iped-locale" java system property and set the stopwords based on configured language, maybe NLTK already has language specific stopwords.

kennynakamura Jun 18, 2021
Author

I did the PR! NLTK have your own specific stopwords for multiple languages. I used 'iped-local', tested, and works. I just don't know if it was done in the proper way. And translate var names and comments to. If you have any questions I am available. I hope it helps in improve IPED!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordCloud and Cluster #600

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

WordCloud and Cluster #600

kennynakamura Jun 15, 2021

Replies: 1 comment · 6 replies

lfcnassif Jun 15, 2021 Maintainer

lfcnassif Jun 17, 2021 Maintainer

kennynakamura Jun 17, 2021 Author

kennynakamura Jun 17, 2021 Author

lfcnassif Jun 17, 2021 Maintainer

kennynakamura Jun 18, 2021 Author

kennynakamura
Jun 15, 2021

Replies: 1 comment 6 replies

lfcnassif
Jun 15, 2021
Maintainer

lfcnassif Jun 17, 2021
Maintainer

kennynakamura Jun 17, 2021
Author

kennynakamura Jun 17, 2021
Author

lfcnassif Jun 17, 2021
Maintainer

kennynakamura Jun 18, 2021
Author