-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory consumption when using sentiment() function #39
Comments
sentiment()
function
Can you make both parts of this reproducible. The stringi package has tools to generate andom text that you can use to mimic the data you're talking about. |
Thank you for the hint. Below you can find the minimal.
The profiler run through Below you find my session infos.
|
sentimentr works at the sentence level. So in the example you provide a split into sentences produced ~500K sentences. This runs for me but will certainly consume a bit of memory. There may be ways to improve sentimentr's memory consumption but I have not found a way. If someone sees this and sees a way to make sentimentr more memory efficient a PR is welcomed. I used data.table for speed reasons, not memory. I'm guessing there are ways to improve my code in this respect. Until then my suggestion is to chunk the text and loop through with a manual memory release ( My second thought is that perhaps sentimentr isn't the right tool for this job. I state in the README that the tool is designed to balance the tradeoff between accuracy and speed. I don't address memory but if you're chugging through that much text you're going to have to balance your own trade offs. I evaluate a number of sentiment tools in the package README. One tool I evaluate is Drew Schmidt's meanr (https://github.com/wrathematics/meanr). This is written in low level C and is very fast. It should be memory efficient as well. His work is excellent and specifically targeted at the type of analysis you seem to be doing. This might be the better choice. Both of our packages have READMEs that explain the package philosophies/goals very well. I think starting there and asking if you care about the added accuracy of sentimentr enough to chunk your text and loop through it. If not it's not the tool for this task. That being said I want to leave this issue open if any community members want to look through the code and optimize memory usage the improvement would be welcomed. |
Thank you for the precious answer. You provide good reasons so I'll check the meanr package as you suggest. What puzzles me though, is that data.table has been conceived not only for speed, but also memory efficiency. Its "by-reference" paradigm aims specifically at minimising all internal copies which are very common under the R environment. Anyway, I suspect you are right. My texts contain many sentences, sometimes even more than it should because of HTML tagging and other fancy stuff. I will try to keep this updated and when I will have time, it could be worthwhile to take a look at the internals of sentimentr. Thank you again. |
I suspect a true data.table whizz would see how to optimize this (@mattdowle would likely feel sick if he sees how I've used data.table). So I'm saying let's assume the issue is my misuse of data.table not data.table itself. |
data.table works like magic. No doubt about this. Stop. |
Referred here from another forum by Trinker. Profiling should give you what is consuming the most memory, here is a quick guide: https://github.com/MarcoDVisser/aprof#memory-statisics [This is on condition that you aren't working in a lower-level language]. I' ll be happy to help think what is causing the "high consumption". M |
Hi trinker, Looking at https://github.com/trinker/sentimentr/blob/master/R/utils.R I see a bunch of potential problems (e.g. the potential use of non vectorized ifelse statements), which may in fact not be problems at all. It all depends on how these functions are used, and how they are "fed" data. Hence, we would need more detailed profiling. Would you mind running the targetedSummary function on line 262? As you appear to use data.table, I'll be interested to see which functions are consuming so much memory. M. |
#46 may reduce some memory consumption: |
I am running some polarity computation through the function
sentiment()
. What I am experiencing is, even for small piece of text, a huge amount of allocated RAM. Sometimes I get also the following error:Error in
`
[.data.table
`
(word_dat, , .(non_pol = unlist(non_pol)), by = c("id", : negative length vectors are not allowed Calls: assign -> compute_tone -> sentiment -> [ -> [.data.table Execution halted
A character vector of 669 kB (computed through
object_size()
in the package pryr leads to a peak allocation of 3.590 GB in RAM which is impressive. This is causing some problems, as you can imagine, when texts get longer.I know you have developed everything using the data.table package (I did the same for my own package), so this sounds strange to me.
Do you have any hints or are you aware of this issue?
I am not including any minimal since this analysis can be easily performed through the profiling tool in RStudio.
Thanks
The text was updated successfully, but these errors were encountered: