If using the same tokenizer instance, deeplima tokenizer blocks on second file #172

kleag · 2024-05-09T20:42:51Z

Describe the bug
When deeplima is asked to analyze several files, the first one is analyzed correctly, but then the program stalls.

To Reproduce
Steps to reproduce the behavior:

Run deeplima --tok-model ~/.local/share/lima/resources/RnnTokenizer/ud/tokenizer-eng-UD_English-EWT.pt test-eng*.txt
See that the program is stalled before printing the result for second file

deeplima --tok-model ~/.local/share/lima/resources/RnnTokenizer/ud/tokenizer-eng-UD_English-EWT.pt  test-eng*.txt
one_pos_mask == 00000000 00000000 00000000 00000000 00000000 00011111 11111111 11111111
mask  [0]    == 00000000 00000000 00000011 11111111 11111111 11100000 00000000 00000000
shift [0]    == 21
one_pos_mask == 00000000 00000000 00000000 00000000 00000000 00011111 11111111 11111111
mask  [1]    == 00000000 00000000 00000011 11111111 11111111 11111111 11111111 11111111
shift [1]    == 0
one_pos_mask == 00000000 00000000 00000000 00000000 00000000 00011111 11111111 11111111
mask  [2]    == 01111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111
shift [2]    == 0
one_pos_mask == 00000000 00000000 00000000 00000000 00000000 00000000 00000000 11111111
mask  [3]    == 00000000 00000000 00000000 00000000 00000000 00000000 11111111 00000000
shift [3]    == 8
one_pos_mask == 00000000 00000000 00000000 00000000 00000000 00000000 00000000 11111111
mask  [4]    == 00000000 00000000 00000000 00000000 00000000 00000000 11111111 11111111
shift [4]    == 0
one_pos_mask == 00000000 00000000 00000000 00000000 00000000 00000000 00000000 11111111
mask  [5]    == 00000000 00000000 00000000 00000000 00000000 11111111 11111111 11111111
shift [5]    == 0
one_pos_mask == 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001
mask  [6]    == 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000010
shift [6]    == 1
locked_buffer_set_t::init
locked_buffer_t::locked_buffer_t()0x655515b3a530
locked_buffer_t::locked_buffer_t()0x655515b3a548
locked_buffer_t::locked_buffer_t()0x655515b3a560
locked_buffer_t::locked_buffer_t()0x655515b3a578
Reading file: test-eng10.txt
SegmentationImpl::parse_from_stream
locked_buffer_set_t::init
SegmentationImpl::parse_from_stream Reading callback: 55 bytes, continue_reading=0 counter=55
SegmentationImpl::parse_from_stream locking (m_buff_set) buff 0
locked_buffer_t::lock 0x655515b3a530 0 -> 1
BUFFS:  | 1 | 0 | 0 | 0 |
SLOTS:  | 2 | 0 | 0 | 0 |
BUFFS:  | 1 | 0 | 0 | 0 |
SLOTS:  | 2 | 0 | 0 | 0 |
BUFFS:  | 1 | 0 | 0 | 0 |
SLOTS:  | 2 | 0 | 0 | 0 |
1       electric        _       _       _       _       0       _       _       _
2       servo   _       _       _       _       1       _       _       _
3       motor   _       _       _       _       2       _       _       _
4       six     _       _       _       _       3       _       _       _
5       axes    _       _       _       _       4       _       _       _
6       robot   _       _       _       _       5       _       _       _
7       application     _       _       _       _       6       _       _       _
8       domain  _       _       _       _       7       _       _       _
locked_buffer_t::unlock 0x655515b3a530 1 -> 0
BUFFS:  | 0 | 0 | 0 | 0 |
SLOTS:  | 0 | 0 | 0 | 0 |
Parsed: 8 in 0.86 seconds.
Parsing speed: 9.32 tokens / sec.
SegmentationImpl::finalize
Reading file: test-eng11.txt
SegmentationImpl::parse_from_stream
locked_buffer_set_t::init
SegmentationImpl::parse_from_stream Reading callback: 124 bytes, continue_reading=0 counter=124
SegmentationImpl::parse_from_stream locking (m_buff_set) buff 0
locked_buffer_t::lock 0x655515b3a530 0 -> 1
BUFFS:  | 1 | 0 | 0 | 0 |
SLOTS:  | 1 | 0 | 0 | 0 |
BUFFS:  | 1 | 0 | 0 | 0 |
SLOTS:  | 1 | 0 | 0 | 0 |
BUFFS:  | 1 | 0 | 0 | 0 |
SLOTS:  | 1 | 0 | 0 | 0 |

In gdb, one can see that it loops in segmentation_impl.cpp here:

lima/deeplima/libs/tasks/segmentation/inference/segmentation_impl.cpp

Line 156 in 7a33899

while (buff.locked())

Expected behavior
All files should be analyzed successfully.

The text was updated successfully, but these errors were encountered:

kleag · 2024-05-11T08:52:28Z

Temporary solution in b462061 by creating a new instance of the tokenizer on each file. This is not too costly because tokenization models are very small but it cannot be counted as a correction.
And in fact, now the tagger crashes on second file.

This is necessary while issue #172 is not solved

kleag added bug crash labels May 9, 2024

kleag assigned kleag and victorbocharov May 9, 2024

kleag mentioned this issue May 9, 2024

lima blocks on second file when used with deeplima aymara/lima-python#10

Closed

kleag changed the title ~~Deeplima tokenizer blocks on second file~~ If using the same tokenizer instance, deeplima tokenizer blocks on second file May 15, 2024

kleag added a commit that referenced this issue May 15, 2024

Reset deeplima tokenizer on each text in lima too

2b3dcff

This is necessary while issue #172 is not solved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If using the same tokenizer instance, deeplima tokenizer blocks on second file #172

If using the same tokenizer instance, deeplima tokenizer blocks on second file #172

kleag commented May 9, 2024

kleag commented May 11, 2024

If using the same tokenizer instance, deeplima tokenizer blocks on second file #172

If using the same tokenizer instance, deeplima tokenizer blocks on second file #172

Comments

kleag commented May 9, 2024

kleag commented May 11, 2024