Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling corrupt unicode on annotation creation #6

Merged
merged 1 commit into from
Jul 31, 2024
Merged

Conversation

dat-boris
Copy link

@dat-boris dat-boris commented Jul 31, 2024

Summary:

This is an odd bug! On specific browser, they have noticed that on Label
Studio this gives the following issues:

psycopg2.errors.InvalidTextRepresentation: invalid input syntax for
type json
DETAIL: Unicode low surrogate must follow a high surrogate

Note: we don't care about any unicode in annotation, as we don't use
the text for referencing our annotation. So let's just remove any unicode
for now to avoid issue with potentially corrupted unicode.

Test Plan:

We can reproduce this by forcing the annotation end point to consume unicode.

  1. Use the copy as CURL option on your browser to copy the "update annotation" call from Label Studio

  2. Inject some unicode into the CURL command - e.g.

curl 'http://localhost:8080/api/annotations/126?taskID=5979&project=43' \
  -X 'PATCH' \
  #.... snipped
  --data-raw $'{"result":[{"value":{"value":{"start":"/div[1]/div[1]/text()[1]","startOffset":0,"end":"/div[1]/div[1]/text()[1]","endOffset":88,"globalOffsets":{"start":0,"end":88},"text":"Some unicode: \udfff	"  ...],"draft_id":0,"parent_prediction":null,"parent_annotation":null,"project":"43"}'
  1. Run it againt the dev server, previously this will cause 500, and with the fix it should 200.

@dat-boris dat-boris self-assigned this Jul 31, 2024
This is an odd bug! On specific browser, they have noticed that on Label
Studio this gives the following issues:

```
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for
type json
DETAIL: Unicode low surrogate must follow a high surrogate
```

We work around this by doing a round trip to JSON, with
`ensure_ascii=False` will ensure that any UNICODE characters are
preserved, and therefore escaped on the round trip back.

Note: we don't care about any unicode in annotation, as we don't use
the text for referencing our annotation.

Test Plan:

We can reproduce this by forcing the annotation end point to consume unicode.

1. Use the `copy as CURL` option on your browser to copy the "update annotation" call from Label Studio

2. Inject some unicode into the CURL command - e.g.

```
curl 'http://localhost:8080/api/annotations/126?taskID=5979&project=43' \
  -X 'PATCH' \
  #.... snipped
  --data-raw $'{"result":[{"value":{"value":{"start":"/div[1]/div[1]/text()[1]","startOffset":0,"end":"/div[1]/div[1]/text()[1]","endOffset":88,"globalOffsets":{"start":0,"end":88},"text":"Some unicode: \udfff	"  ...],"draft_id":0,"parent_prediction":null,"parent_annotation":null,"project":"43"}'
```

3. Run it againt the dev server, previously this will cause 500, and with the fix it should 200.
@dat-boris dat-boris changed the title Handlig unicode on annotation creation Handling corrupt unicode on annotation creation Jul 31, 2024
@dat-boris dat-boris requested review from lizfaubell and a team July 31, 2024 17:29
@dat-boris dat-boris marked this pull request as ready for review July 31, 2024 17:29
Copy link

@lizfaubell lizfaubell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Good catch finding a way to reproduce it!

@dat-boris dat-boris merged commit 7bdfdae into develop Jul 31, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants