Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimilarityAnalysis.cooccurrencesIDSs performance bottleneck #22

Open
erebus1 opened this issue Nov 16, 2016 · 0 comments
Open

SimilarityAnalysis.cooccurrencesIDSs performance bottleneck #22

erebus1 opened this issue Nov 16, 2016 · 0 comments

Comments

@erebus1
Copy link

erebus1 commented Nov 16, 2016

Hi,

I'm using UR(v0.2.3) template and have some trouble with scalability.

Training take 18hours (each day) and last 12 hours it use only one core.
As I can see URAlgorithm.scala (line 144) call SimilarityAnalysis.cooccurrencesIDSs
with data.actions (12 partitions)

untill reduceByKey in AtB.scala it executes in parallel
but after this it executing in single thread.

It is strange, that when SimilarityAnalysis.scala(line 145) call
indexedDatasets(0).create(drm, indexedDatasets(0).columnIDs, indexedDatasets(i).columnIDs)
it return IndexedDataset with only one partition.

As I can see in SimilarityAnalysis.scala(line 63)
drmARaw.par(auto = true)
May be this cause decreasing the number of partitions.
As I can see in master branch of MAHOUT
has ParOpt:
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala#L142
May be this can fix the problem.

So, am I right with root of problems, and how can I fix it?
screenshot from 2016-11-16 15 42 36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant