-
Notifications
You must be signed in to change notification settings - Fork 231
How to generate embeddings for new candidates? #106
Comments
Thanks to @ledw-2 and others from other issues, I was able to recreate embeddings for existing entities (in Given a new entity title and its description, here's how to generate its embeddings: # Load biencoder model and biencoder params just like in main_dense.py
with open(args.biencoder_config) as json_file:
biencoder_params = json.load(json_file)
biencoder_params["path_to_model"] = args.biencoder_model
biencoder = load_biencoder(biencoder_params)
# Read 10 entities from entity.jsonl
entities = []
count = 10
with open('./models/entity.jsonl') as f:
for i, line in enumerate(f):
entity = json.loads(line)
entities.append(entity)
if i == count-1:
break
# Get token_ids corresponding to candidate title and description
tokenizer = biencoder.tokenizer
max_context_length, max_cand_length = biencoder_params["max_context_length"], biencoder_params["max_cand_length"]
max_seq_length = max_cand_length
ids = []
for entity in entities:
candidate_desc = entity['text']
candidate_title = entity['title']
cand_tokens = get_candidate_representation(
candidate_desc,
tokenizer,
max_seq_length,
candidate_title=candidate_title
)
token_ids = cand_tokens["ids"]
ids.append(token_ids)
ids = torch.tensor(ids)
torch.save(ids, path) The file in which these ids are saved should be passed in the Thanks to the FB team for this awesome project! |
@abhinavkulkarni Thanks for the comments! I hope you find this project useful to you.😃 |
seems that we have to update and re-generate the whole entity.jsonl file in order to get .t7 file. |
@amelieyu1989: No, if |
I see. you mean I could get my new_encode_list = torch.cat((old_encode_list, new_entities_tokens)) |
@abhinavkulkarni I guess it may be as follows. python generate_candidates.py --path_to_model_config models/biencoder_wiki_large.json --path_to_model models/biencoder_wiki_large.bin --entity_dict_path models/entity1.jsonl --encoding_save_file_dir models --saved_cand_ids models/entity_token_ids_128.t7 --batch_size 512 --chunk_start 0 --chunk_end 1000000 |
Hi,
I am been going through the code, documentation and issues to figure out how to obtain embeddings for new candidates - however I have not been able to figure this out.
I would like to add new candidates to
all_entities_large.t7
file.Firstly, the script
generate_candidates.py
is supposed to generate the embeddings given thetoken_idxs
of new entities (the input parameterssaved_cand_ids
refers to a file that has thesetoken_idxs
), however, it is not clear how to generate thesetoken_idxs
.So, I tried to reverse engineer generating embeddings for the following entity in
entity.jsonl
file:Firing up
main_dense.py
in interactive mode and submitting the above text produces the following named entities (persons only):I then tried running the samples corresponding to
Aristotle
mentions through both context and candidate encoder parts of BiEncoder and saved the embeddings to the disk, however, they are all different from the one inall_entities_large.t7
.Are we supposed to average the embeddings of all the mentions corresponding to
Aristotle
entity? Or any other logic?The BLINK paper says, the embeddings for candidates were generated by taking first 10 lines from their Wikipedia description, however, only 32 tokens are submitted to encoder to obtain an embedding, so not sure why 10 lines were selected.
Thanks!
The text was updated successfully, but these errors were encountered: