Skip to content

Commit

Permalink
Added new '-w'/'--wait' option, accepting a number of seconds to slee…
Browse files Browse the repository at this point in the history
…p/wait between requests, plus a '--wait-random' option which will randomize the number of wait seconds by a 0.5x-2x. These options are used by the new WaybackMachineDownloader#wait method which is called during subsequent requests. Issue cocoflan#1
  • Loading branch information
morgant committed May 20, 2024
1 parent 8e444a2 commit fcebbc3
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 3 deletions.
8 changes: 6 additions & 2 deletions bin/wayback_machine_downloader
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,12 @@ option_parser = OptionParser.new do |opts|
options[:exact_url] = t
end

opts.on("-o", "--only ONLY_FILTER", String, "Restrict downloading to urls that match this filter", "(use // notation for the filter to be treated as a regex)") do |t|
options[:only_filter] = t
opts.on("-w", "--wait SECONDS", Integer, "Wait the specified number of seconds between requests") do |t|
options[:wait_seconds] = t
end

opts.on("--random-wait", "When used with --wait, randomize number of seconds waited between requests by a factor of 0.5 to 2") do |t|
options[:wait_randomize] = true
end

opts.on("-x", "--exclude EXCLUDE_FILTER", String, "Skip downloading of urls that match this filter", "(use // notation for the filter to be treated as a regex)") do |t|
Expand Down
10 changes: 9 additions & 1 deletion lib/wayback_machine_downloader.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ class WaybackMachineDownloader

attr_accessor :base_url, :exact_url, :directory, :all_timestamps,
:from_timestamp, :to_timestamp, :only_filter, :exclude_filter,
:all, :maximum_pages, :threads_count
:all, :maximum_pages, :threads_count, :wait_seconds, :wait_randomized

def initialize params
@base_url = params[:base_url]
Expand All @@ -32,6 +32,8 @@ def initialize params
@all = params[:all]
@maximum_pages = params[:maximum_pages] ? params[:maximum_pages].to_i : 100
@threads_count = params[:threads_count].to_i
@wait_seconds = params[:wait_seconds].to_i
@wait_randomized = params[:wait_randomized]
end

def backup_name
Expand Down Expand Up @@ -89,6 +91,7 @@ def get_all_snapshots_to_consider
print "."
unless @exact_url
@maximum_pages.times do |page_index|
wait
snapshot_list = get_raw_list_from_api(@base_url + '/*', page_index)
break if snapshot_list.empty?
snapshot_list_to_consider += snapshot_list
Expand Down Expand Up @@ -208,6 +211,7 @@ def download_files
@threads_count.times do
threads << Thread.new do
until file_queue.empty?
wait
file_remote_info = file_queue.pop(true) rescue nil
download_file(file_remote_info) if file_remote_info
end
Expand Down Expand Up @@ -313,4 +317,8 @@ def file_list_by_timestamp
def semaphore
@semaphore ||= Mutex.new
end

def wait
@wait_seconds.positive? && @wait_randomized ? sleep(@wait_seconds.to_f * (rand(1.5) + 0.5)) : sleep(@wait_seconds)
end
end

0 comments on commit fcebbc3

Please sign in to comment.