Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compound score diverges for long sequences #151

Open
VincentGurgul opened this issue Jul 5, 2024 · 1 comment
Open

Compound score diverges for long sequences #151

VincentGurgul opened this issue Jul 5, 2024 · 1 comment

Comments

@VincentGurgul
Copy link

The compound score has a serious flaw – it diverges for long sequences. Example:

polarity_scores('bad good')
>>> {'neg': 0.547, 'neu', 0.0, 'pos': 0.453, 'compound': -0.1531}
polarity_scores('bad good bad good bad good bad good bad good bad good bad good bad good bad good bad')
>>> {'neg': 0.547, 'neu', 0.0, 'pos': 0.453, 'compound': -0.8979}

It seems, the 'neg' and 'pos' scores are averages, whereas the 'compound' score is some sort of a sum. Thus, the compound score always takes on extreme values for long sequences, like Reddit posts or news articles.

This is particularly unfortunate, since a lot of beginners will blindly use the compound score without noticing this and get discouraged by the poor results. I suggest replacing the current implementation of the compound score with compound = pos – neg or completely removing it.

@VincentGurgul
Copy link
Author

It's regrettable that a package that is part of NLTK has such a serious bug at all, and even more disheartening that it isn't being addressed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant