Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupAll takes a HUGE amount of time #29

Open
pferrel opened this issue Nov 19, 2016 · 2 comments
Open

groupAll takes a HUGE amount of time #29

pferrel opened this issue Nov 19, 2016 · 2 comments

Comments

@pferrel
Copy link

pferrel commented Nov 19, 2016

Running on a large cluster and medium sized data (100Mb) this stage take 9.2 hours, by far the longest phase. Any ideas @laser13 @alexice ? This is not very large data and running on 4 r3.4xlarge AWS instances.

image

image

@pferrel
Copy link
Author

pferrel commented Nov 19, 2016

here is the old implementation. Should I try putting this back in?

def groupAll( fields: Seq[RDD[(String, (Map[String, Any]))]]): RDD[(String, (Map[String, Any]))] = {

@alexice
Copy link

alexice commented Nov 19, 2016

Yes, it would be good to compare total time and stage time of previous code. Looks wired. Maybe this is because of some laziness and some other calculations were attributed to this line?

On Nov 19, 2016, at 23:51 , Pat Ferrel [email protected] wrote:

Running on a large cluster and medium sized data (100Mb) this stage take 9.2 hours, by far the longest phase. Any ideas @laser13 @alexice ? This is not very large data and running on 4 r3.4xlarge AWS instances. We are only using popularity, no random or user-defined ranking.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Best regards,
Alexey Pan'kov
e-mail: [email protected]
phone: +7 981 891 2239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants