tianchi_eco_infor_extractor/readme.md at master · dante0shy/tianchi_eco_infor_extractor · GitHub

text_2_list.py: from html to word_list

jie_ba_initial('/you/path/to/FDDC_announcements_company_name_20180531.json')
text = get_data(file)
for t in text:
   fined_seg_list = get_word_list(t)
   if not random.randint(0, 50):
       print '\'~\''.join(fined_seg_list)

word_index.py: build index for word_list:

index_tree = WordPrefixTree()
for idx,word in enumerate(words):
    index_tree.add(word,idx)
index_tree.check('words')