GitHub

text_2_list.py: from html to word_list

jie_ba_initial('/you/path/to/FDDC_announcements_company_name_20180531.json')
text = get_data(file)
for t in text:
   fined_seg_list = get_word_list(t)
   if not random.randint(0, 50):
       print '\'~\''.join(fined_seg_list)

word_index.py: build index for word_list:

index_tree = WordPrefixTree()
for idx,word in enumerate(words):
    index_tree.add(word,idx)
index_tree.check('words')

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
extras		extras
tianchi_extractor		tianchi_extractor
.gitignore		.gitignore
FDDC_announcements_company_name_20180531.json		FDDC_announcements_company_name_20180531.json
[New] FDDC_announcements_instruction_20180605.pdf		[New] FDDC_announcements_instruction_20180605.pdf
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

dante0shy/tianchi_eco_infor_extractor

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages