-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
about your dataset #11
Comments
|
Okay, time has passed for a long time, and I forgot how to debug it.I can post some code that I have processed, I hope it will be useful to you
train_id = pickle.load(open("../EANN-KDD18-master/Data/weibo/train_id.pickle", 'rb'))
val_id = pickle.load(open("../EANN-KDD18-master/Data/weibo/validate_id.pickle", 'rb'))
stop_words = process_data_weibo.stopwordslist()
pre_path = 'F:/data/EANN-KDD18-master/Data/weibo/tweets/'
file_list = [pre_path + "test_nonrumor.txt", pre_path + "test_rumor.txt", \
pre_path + "train_nonrumor.txt", pre_path + "train_rumor.txt"]
nonrumor_images = deal_image('F:/data/EANN-KDD18-master/Data/weibo/nonrumor_images/')
rumor_images = deal_image('F:/data/EANN-KDD18-master/Data/weibo/rumor_images/')
#train
for k,f in enumerate(file_list):
f = open(f,encoding='utf-8')
if (k + 1) % 2 == 1:
label = 0 ### real is 0
else:
label = 1 ####fake is 1
lines = f.readlines()
post_id = ""
url = ""
for i, line in enumerate(lines):
if (i+1)%3 ==1 :
post_id = line.split('|')[0]
if (i+1)%3 ==2:
url = (line.lower())
if (i+1)%3 ==0:
line = process_data_weibo.clean_str_sst(line)
seg_list = jieba.cut_for_search(line) #中文分词
new_seg_list = []
for word in seg_list:
if word not in stop_words:
new_seg_list.append(word)
clean_l = ' '.join(new_seg_list)
if len(clean_l) > 10 and post_id in train_id:
describe = []
for x in new_seg_list:
if x not in word2ix:
word2ix[x] = wordcnt
ix2word[wordcnt] = x
wordcnt += 1
describe.append(word2ix[x])
max_seq_len =max(max_seq_len,len(describe))
event = int(train_id[post_id])
max_event = max(max_event,event)
for x in url.split('|'):
image_id = x.split('/')[-1].split(".")[0]
if label==0 and image_id in nonrumor_images:
image_url = 'F:/data/EANN-KDD18-master/Data/weibo/nonrumor_images/' + image_id + '.' + nonrumor_images[image_id]
data.append([describe,image_url,label,event])
elif label==1 and image_id in rumor_images:
image_url = 'F:/data/EANN-KDD18-master/Data/weibo/rumor_images/' + image_id + '.' + rumor_images[image_id]
data.append([describe,image_url,label,event])
elif len(clean_l) > 10 and post_id in val_id:
describe = []
for x in new_seg_list:
if x not in word2ix:
word2ix[x] = wordcnt
ix2word[wordcnt] = x
wordcnt += 1
describe.append(word2ix[x])
max_seq_len =max(max_seq_len,len(describe))
event = int(val_id[post_id])
max_event = max(max_event,event)
for x in url.split('|'):
image_id = x.split('/')[-1].split(".")[0]
if label==0 and image_id in nonrumor_images:
image_url = 'F:/data/EANN-KDD18-master/Data/weibo/nonrumor_images/' + image_id + '.' + nonrumor_images[image_id]
val_data.append([describe,image_url,label,event])
elif label==1 and image_id in rumor_images:
image_url = 'F:/data/EANN-KDD18-master/Data/weibo/rumor_images/' + image_id + '.' + rumor_images[image_id]
val_data.append([describe,image_url,label,event])
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2021年5月15日(星期六) 晚上8:30
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [yaqingwang/EANN-KDD18] about your dataset (#11)
Hello, can I ask how you generate validate_id.pickle/train_id.pickle/test_id.pickle?
hello,Do you know how to generate the files now?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Thank you very much for your reply. Could you tell me the relevant codes of this file of w2v.pickle |
Thank you very much for your reply,Could you tell me the relevant codes of this file of w2v.pickle ?It's very important to me. I'm a beginner. I'm sorry to bother you |
The code on git is pretty good, you can take a closer look and you can understand the usefulness of each
def get_data(text_only):
#text_only = False
if text_only:
print("Text only")
image_list = []
else:
print("Text and image")
image_list = read_image()
train_data = write_data("train", image_list, text_only)
valiate_data = write_data("validate", image_list, text_only)
test_data = write_data("test", image_list, text_only)
print("loading data...")
# w2v_file = '../Data/GoogleNews-vectors-negative300.bin'
vocab, all_text = load_data(train_data, valiate_data, test_data)
# print(str(len(all_text)))
print("number of sentences: " + str(len(all_text)))
print("vocab size: " + str(len(vocab)))
max_l = len(max(all_text, key=len))
print("max sentence length: " + str(max_l))
word_embedding_path = "../EANN-KDD18-master/Data/weibo/w2v.pickle"
w2v = pickle.load(open(word_embedding_path,"rb"),encoding='bytes')
# print(w2v)
# input("w2v over")
print("word2vec loaded!")
print("num words already in word2vec: " + str(len(w2v)))
add_unknown_words(w2v, vocab)
W, word_idx_map = get_W(w2v)
# # rand_vecs = {}
# # add_unknown_words(rand_vecs, vocab)
W2 = rand_vecs = {}
w_file = open("../EANN-KDD18-master/Data/weibo/word_embedding.pickle", "wb")
pickle.dump([W, W2, word_idx_map, vocab, max_l], w_file)
w_file.close()
return train_data, valiate_data, test_data
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2021年5月16日(星期天) 上午10:08
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [yaqingwang/EANN-KDD18] about your dataset (#11)
Hello, can I ask how you generate validate_id.pickle/train_id.pickle/test_id.pickle?
Thank you very much for your reply,Could you tell me the relevant codes of this file of w2v.pickle ?It's very important to me. I'm a beginner. I'm sorry to bother you
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I used Weibo files and it worked ,but I don't know how to use Twitter data.Please tell me whether you useTwitter files for your experiment? |
Hello, can I ask how you generate validate_id.pickle/train_id.pickle/test_id.pickle?Thank you! |
hello,I have the same problem.could you please tell me how should I do? |
my email adress is [email protected] |
I have the same problem. Can you tell me how to solve it? |
Hello, can I ask how you generate validate_id.pickle/train_id.pickle/test_id.pickle?
The text was updated successfully, but these errors were encountered: