Command-line preservation archiving tool for S3
The aws-archiver tool is intended to facilitate deposit of assets to Amazon S3 storage while ensuring end-to-end asset fixity and the creation of auditable deposit records.
To install the tool for system-wide access, the recommended method is via pip:
$ git clone https://www.github.com/umd-lib/aws-archiver
$ cd aws-archiver && pip install -e .
To see the list of available subcommands, run:
$ archiver --help
For help with a particular subcommand, run:
$ archiver <SUBCOMMAND> --help
where <SUBCOMMAND> is the name of the subcommand. For example, for the "deposit" subcommand:
$ archiver deposit --help
Usage: archiver deposit [-h] -b BUCKET [-c CHUNK] [-l LOGS] [-n NAME] [-p PROFILE] [-r ROOT] [-s STORAGE] [-t THREADS] (-m MAPFILE | -a ASSET) [--dry-run]
Deposit a batch of resources to S3
options:
-h, --help show this help message and exit
-b BUCKET, --bucket BUCKET
S3 bucket to deposit files into
-c CHUNK, --chunk CHUNK
Chunk size for multipart uploads
-l LOGS, --logs LOGS Location to store log files
-n NAME, --name NAME Batch identifier or name
-p PROFILE, --profile PROFILE
AWS authorization profile
-r ROOT, --root ROOT Root dir of files being archived
-s STORAGE, --storage STORAGE
S3 storage class
-t THREADS, --threads THREADS
Maximum number of concurrent threads
-m MAPFILE, --mapfile MAPFILE
Archive assets in inventory file
-a ASSET, --asset ASSET
Archive a single asset
--dry-run Perform a "dry run" without actually contacting AWS.
The "deposit" subcommand is used to deposit either a single asset (using the "-a/--asset" argument) or multiple assets in a single batch (using the "-m/--mapfile" argument).
For historical reasons, a "dep" alias is provided for the "deposit" subcommand.
The "--mapfile" argument uses files in one of three different batch manifest formats:
- md5sum manifest files
- patsy-db manifest files
- inventory manifest files
A text file listing one asset per line, in the form
<md5 hash> <whitespace> <absolute local path>
. This is the same line
format as the output of the Unix md5sum
utility. As a convenience, a
script to generate the latter from a directory of files is included in this
repository's bin
directory.
To create a batch manifest with the included script, do:
$ ./bin/make_mapfile.sh path/to/asset/dir mapfile.txt
A CSV file listing one asset per line, in the form
<md5 hash>,<absolute local path>,<relative path>
See the "patsy-db" documentation for information about creating the manifest file.
A CSV file listing one asset per line, as generated by the "inventory" command of the "preserve" tool.
See the "preserve" documentation (https://github.com/umd-lib/preserve) for more information about creating the manifest file.
Note: The "BATCH" field in the first row of the manifest file will be used as the "name", overriding any "name" argument given on the command-line.
AWS credentials are required for making deposits. This tool uses the boto3
library to manage authorization using AWS authentication profiles. These
profiles are stored in ~/.aws/credentials
. To choose a profile to use
with a batch, use the -p PROFILE
option. If left unspecified, the tool
will default to the default profile. The chosen profile must have write
permission for the bucket specified in the -b BUCKET
option.
The following arguments listed above as "optional" are necessary for the deposit and use default values if not specified:
option | default |
---|---|
'-c', '--chunk' | '4GB' |
'-l', '--logs' | 'logs' |
'-n', '--name' | 'test_batch' |
'-p', '--profile' | 'default' |
'-r', '--root' | '.' |
'-s', '--storage' | 'DEEP_ARCHIVE' |
'-t', '--threads' | 10 |
usage: archiver batch-deposit [-h] -f BATCHES_FILE [-p PROFILE]
options:
-h, --help show this help message and exit
-f BATCHES_FILE, --batches-file BATCHES_FILE
YAML file containing the paths to the manifests of individual batches.
-p PROFILE, --profile PROFILE
AWS authorization profile
Enables depositing multiple batches specified in a YAML manifest. The format of the YAML file is:
batches_dir: <Fully-qualified filepath to the directory batches>
batches:
- path: <Relative subdirectory to the manifest file for the batch>
bucket: <The AWS bucket to store the assets in>
asset_root: <The asset root for the batch>
For example:
batches_dir: /libr/archives/logs/libdc/load1
batches:
- path: Archive000Football1
bucket: libdc-archivebucket-17lowbw7m2av1
asset_root: /libr/archives/footballfilmsexport/FootballFilmMpeg2_07272011/2010-07-12/Mpeg2QCd
- path: Archive000Football2
bucket: libdc-archivebucket-17lowbw7m2av1
asset_root: /libr/archives/footballfilmsexport/FootballFilmMpeg2_07272011/2010-08-20/Maryland_mpg2_master/Maryland_mpg2_Batch1
You can restore files from AWS Deep Glacier using the scripts bin/requestfilesfromdeepglacier.sh and bin/copyfromawstolocal.sh.
-
Install the AWS CLI. One option for installation is
brew install awscli
. -
Configure your region and credentials following the instructions in the AWS CLI Reference, General Options and Credentials. You can use
export AWS_PROFILE=some-profile
to declare the credential to use--the script will respect the value of the environment variable. -
Create a CSV-like input file with the list of files to restore, including the 3 columns bucketname, filelocation, fileserverlocation, without a header row. The scripts will prompt for the name of the input file. Example file contents:
libdc-archivebucket-foobarxyz,Archive092/scpa-062057-0018.tif,./restore_directory
Important: Due to how the script reads the restore file, a blank newline must exist or the last entry may not be processed.
Note: This text file is read from a command line utility and is not parsed as a CSV. Unlike a CSV, quoted values are passed directly to the AWS CLI -- Do not use quotes around "colunms"!
-
Request the restoration from Deep Glacier to an S3 bucket using
bin/requestfilesfromdeepglacier.sh
.- The restoration may take up to 48 hours to complete. Plan ahead.
- A successful request will trigger an email from
Sent emails for Prod S3 <[email protected]>
of the event with the subject ofS3 Prod Archive Bucket Event
. - A similar email will come through when the restore is completed.
-
Copy the file from the S3 bucket to the local file system using
bin/copyfromawstolocal.sh
.
Manual tests to verify application conformanance to actual AWS behavior are specified in docs/ConformanceTests.md.