Issue with quotations(). Some punctuation marks not included...? #1154

beholdbible · 2024-10-20T11:45:43Z

Hi. Thank you for creating and sharing this tool; it's truly impressive, and I was amazed when I discovered it recently.

I am playing around with quotations() and when I use .out('text') and doc.html(), it seems that some punctuation marks are missing (the results differ to doc.json() where the punctuation marks are included). Is this intended or a bug?

<script src="https://unpkg.com/compromise"></script>
<script>
  var doc = nlp(`This is a "test". Hello "World."`)
  
  let hold = doc.quotations();
  
  console.log(hold.out('text'));
  console.log(hold.json())

  document.body.innerHTML = doc.html({
    '.red': hold
  });
</script>

The text was updated successfully, but these errors were encountered:

spencermountain · 2024-10-21T15:08:37Z

hey @beholdbible - thank you for the good issue and kind words.
It's clear, seeing this, that we should shuffle the pre- and -post whitespace characters around a bit, to try and avoid this weirdness. Same as #1144 - for any paired punctuation symbols.

Happy to look at this, it's tricky because the tokenizer doesn't know very much, will have to guess abt some of the classification.
Will move this to plans for the next release.
cheers

spencermountain added yesss next-release labels Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with quotations(). Some punctuation marks not included...? #1154

Issue with quotations(). Some punctuation marks not included...? #1154

beholdbible commented Oct 20, 2024 •

edited

Loading

spencermountain commented Oct 21, 2024

Issue with quotations(). Some punctuation marks not included...? #1154

Issue with quotations(). Some punctuation marks not included...? #1154

Comments

beholdbible commented Oct 20, 2024 • edited Loading

spencermountain commented Oct 21, 2024

beholdbible commented Oct 20, 2024 •

edited

Loading