Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize scrapping #6

Open
Mehonoshin opened this issue Nov 16, 2019 · 1 comment
Open

Parallelize scrapping #6

Mehonoshin opened this issue Nov 16, 2019 · 1 comment
Assignees

Comments

@Mehonoshin
Copy link
Member

It should be up to crawler developer whether he want's to parallelize the process, or not.
For example for some social networks it is not good to scrap with many parallel threads.

For we need to come up with proof-of-concept, that allows to parallelize certain pieces with separate sidekiq tasks.

For gallery crawler it makes sense to parallelize each separate page scrapping, to make it faster.

parallelize do |context|
  # some action
end

Passing code to this block should spawn a separate job, that receives all necessary context, for example page url, maybe cookies and so on.

@Mehonoshin Mehonoshin self-assigned this Nov 16, 2019
@Mehonoshin
Copy link
Member Author

The question is, how are we going to collect all scrapped data?
We need to come up with some synchronization mechanism, or each worker should report its results separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant