Java command line application to download crawls from WASAPI.
You'll need the following prerequisites installed on your local computer:
- Java (7)
- Ruby (we use Capistrano for deployment)
The minimal sequence of steps to verify that you can work with the code is:
git clone https://github.com/sul-dlss/wasapi-downloader.git
cd wasapi-downloader
./gradlew installDist
(compile and test the code and create a script to execute it)./build/install/wasapi-downloader/bin/wasapi-downloader --help
(explains usage)
An example invocation of the downloader:
./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 123 --crawlStartAfter 2014-03-14
This repository contains an example config/settings.properties
file with dummy values for the required configuration settings. In order to successfully execute the Java application, you will need to override these default settings.
wasapi-downloader uses the gradle wrapper (https://docs.gradle.org/3.3/userguide/gradle_wrapper.html) so users don't have to worry about installing gradle. However, using the gradle wrapper once (gradlew [task]
) installs gradle on your system and from then forward you can simply execute gradle [tasks]
rather than gradlew [tasks]
(though either will work).
wasapi-downloader is built using Gradle. To create a runnable installation with all needed jars and shell script (cleaning out old builds first):
./gradle clean installDist
List all available build tasks:
./gradle tasks
To run:
./build/install/wasapi-downloader/bin/wasapi-downloader --help
(explain usage)
An example invocation of the downloader:
./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 123 --crawlStartAfter 2014-03-14
See more examples below in the Production section.
Capistrano is used for deployment to Stanford VMs.
-
On your laptop, run
bundle
to install the Ruby capistrano gems and other dependencies for deployment.
-
Deploy code to remote VM:
cap <environment> deploy
<environment>
is eitherdev
,stage
orprod
, as specified inconfig/deploy/
.This will also get our (Stanford's) latest configuration settings.
The deployment command shown above creates an executable Java application. After logging onto the production server you may run wasapi-downloader by following these steps:
cd wasapi-downloader/current/
./build/install/wasapi-downloader/bin/wasapi-downloader <args>
The --help
option will display a message listing all of the arguments:
./build/install/wasapi-downloader/bin/wasapi-downloader --help
Some of the available command line arguments have a default value set in config/settings.properties
. --help
will display the current configuration as taken from the settings.properties
file. Command line arguments will override values set from config/settings.properties
.
For many users of the production instance of wasapi-downloader, the following examples will be relevant/helpful:
./build/install/wasapi-downloader/bin/wasapi-downloader
./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 8001
./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 8001 --crawlStartAfter 2014-01-01
Download all crawl files for a certain collection (ex. 8001) created before a certain date (ex: 2012) into a particular output directory (ex. /tmp/
, which override the config.settings
default value):
./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 8001 --crawlStartBefore 2012-01-01 --outputBaseDir /tmp/
Download all crawl files available for a certain collection (more likely) created before a certain date (ex: 2012) and after a certain date (ex: 2014)
./build/install/wasapi-downloader/bin/wasapi-downloader --collectionId 8001 --crawlStartBefore 2012-01-01 --crawlStartAfter 2014-01-01
./build/install/wasapi-downloader/bin/wasapi-downloader --filename ARCHIVEIT-5425-MONTHLY-JOB302671-20170526114117181-00049.warc.gz
Note: When a --filename
argument is present, all other request parameters (crawl start/end, collection ID, job ID) are ignored.