Parallelize scrapping #6

Mehonoshin · 2019-11-16T17:02:15Z

It should be up to crawler developer whether he want's to parallelize the process, or not.
For example for some social networks it is not good to scrap with many parallel threads.

For we need to come up with proof-of-concept, that allows to parallelize certain pieces with separate sidekiq tasks.

For gallery crawler it makes sense to parallelize each separate page scrapping, to make it faster.

parallelize do |context|
  # some action
end

Passing code to this block should spawn a separate job, that receives all necessary context, for example page url, maybe cookies and so on.

The text was updated successfully, but these errors were encountered:

Mehonoshin · 2019-11-26T09:36:38Z

The question is, how are we going to collect all scrapped data?
We need to come up with some synchronization mechanism, or each worker should report its results separately.

Mehonoshin self-assigned this Nov 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize scrapping #6

Parallelize scrapping #6

Mehonoshin commented Nov 16, 2019

Mehonoshin commented Nov 26, 2019

Parallelize scrapping #6

Parallelize scrapping #6

Comments

Mehonoshin commented Nov 16, 2019

Mehonoshin commented Nov 26, 2019