Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with quotations(). Some punctuation marks not included...? #1154

Open
beholdbible opened this issue Oct 20, 2024 · 1 comment
Open

Comments

@beholdbible
Copy link

beholdbible commented Oct 20, 2024

Hi. Thank you for creating and sharing this tool; it's truly impressive, and I was amazed when I discovered it recently.

I am playing around with quotations() and when I use .out('text') and doc.html(), it seems that some punctuation marks are missing (the results differ to doc.json() where the punctuation marks are included). Is this intended or a bug?

compromise-ss

<script src="https://unpkg.com/compromise"></script>
<script>
  var doc = nlp(`This is a "test". Hello "World."`)
  
  let hold = doc.quotations();
  
  console.log(hold.out('text'));
  console.log(hold.json())

  document.body.innerHTML = doc.html({
    '.red': hold
  });
</script>
@spencermountain
Copy link
Owner

hey @beholdbible - thank you for the good issue and kind words.
It's clear, seeing this, that we should shuffle the pre- and -post whitespace characters around a bit, to try and avoid this weirdness. Same as #1144 - for any paired punctuation symbols.

Happy to look at this, it's tricky because the tokenizer doesn't know very much, will have to guess abt some of the classification.
Will move this to plans for the next release.
cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants