Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with NER model with adjacent entities of the same type. #764

Open
ThomasBourgeois opened this issue Dec 4, 2024 · 1 comment
Open

Comments

@ThomasBourgeois
Copy link

ThomasBourgeois commented Dec 4, 2024

It's a double issue :

1/ In the section "Fast tokenizer special powers" of the chapter on Tokenizer, it is mentionned that the model that is loaded (dbmdz/bert-large-cased-finetuned-conll03-english) has been finetuned on a dataset following IOB1 format, that is, for two adjacent entities of the same type, the second one start with B- rather than I-.

It seems to me that this model does not work that way.

I've tried many times, with several entities, the entities always get tagged I-.

E.g : Screenshot below with locations.
image

Same with Persons :
image

Thus the example mentionned in the course does not work that way. Cf: (should be like in blue below)
image

2/ The piece of code meant to group entities at the end of that section (screenshot below) has an issue too.
Following the supposed behaviour (second entity starts with B-) an entity starting with B- would be instantly ejected from the while loop, thus losing all the rest of the following tokens starting with I-. There's most probably an issue in the iteration of the idx in the while loop.

image

@ThomasBourgeois
Copy link
Author

@sgugger might be able to tag the good people ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant