Skip to content

Command-line preservation archiving tool for S3

License

Notifications You must be signed in to change notification settings

aguilarm-umd/aws-archiver

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

aws-archiver

Command-line preservation archiving tool for S3

Purpose

The aws-archiver tool is intended to facilitate deposit of assets to Amazon S3 storage while ensuring end-to-end asset fixity and the creation of auditable deposit records.

Installation

To install the tool for system-wide access, the recommended method is via pip:

$ git clone https://www.github.com/umd-lib/aws-archiver
$ cd aws-archiver && pip install -e .

Usage

To see the list of available subcommands, run:

$ archiver --help

For help with a particular subcommand, run:

$ archiver <SUBCOMMAND> --help

where <SUBCOMMAND> is the name of the subcommand. For example, for the "deposit" subcommand:

$ archiver deposit --help

"deposit" subcommand

Usage: archiver deposit [-h] -b BUCKET [-c CHUNK] [-l LOGS] [-n NAME] [-p PROFILE] [-r ROOT] [-s STORAGE] [-t THREADS] (-m MAPFILE | -a ASSET) [--dry-run]

Deposit a batch of resources to S3

options:
  -h, --help            show this help message and exit
  -b BUCKET, --bucket BUCKET
                        S3 bucket to deposit files into
  -c CHUNK, --chunk CHUNK
                        Chunk size for multipart uploads
  -l LOGS, --logs LOGS  Location to store log files
  -n NAME, --name NAME  Batch identifier or name
  -p PROFILE, --profile PROFILE
                        AWS authorization profile
  -r ROOT, --root ROOT  Root dir of files being archived
  -s STORAGE, --storage STORAGE
                        S3 storage class
  -t THREADS, --threads THREADS
                        Maximum number of concurrent threads
  -m MAPFILE, --mapfile MAPFILE
                        Archive assets in inventory file
  -a ASSET, --asset ASSET
                        Archive a single asset
  --dry-run             Perform a "dry run" without actually contacting AWS.

The "deposit" subcommand is used to deposit either a single asset (using the "-a/--asset" argument) or multiple assets in a single batch (using the "-m/--mapfile" argument).

For historical reasons, a "dep" alias is provided for the "deposit" subcommand.

Batch manifest file

The "--mapfile" argument uses files in one of three different batch manifest formats:

  • md5sum manifest files
  • patsy-db manifest files
  • inventory manifest files

md5sum manifest files

A text file listing one asset per line, in the form <md5 hash> <whitespace> <absolute local path>. This is the same line format as the output of the Unix md5sum utility. As a convenience, a script to generate the latter from a directory of files is included in this repository's bin directory.

To create a batch manifest with the included script, do:

$ ./bin/make_mapfile.sh path/to/asset/dir mapfile.txt

patsy-db manifest files

A CSV file listing one asset per line, in the form

<md5 hash>,<absolute local path>,<relative path>

See the "patsy-db" documentation for information about creating the manifest file.

inventory manifest files

A CSV file listing one asset per line, as generated by the "inventory" command of the "preserve" tool.

See the "preserve" documentation (https://github.com/umd-lib/preserve) for more information about creating the manifest file.

Note: The "BATCH" field in the first row of the manifest file will be used as the "name", overriding any "name" argument given on the command-line.

AWS credentials

AWS credentials are required for making deposits. This tool uses the boto3 library to manage authorization using AWS authentication profiles. These profiles are stored in ~/.aws/credentials. To choose a profile to use with a batch, use the -p PROFILE option. If left unspecified, the tool will default to the default profile. The chosen profile must have write permission for the bucket specified in the -b BUCKET option.

Default option values

The following arguments listed above as "optional" are necessary for the deposit and use default values if not specified:

option default
'-c', '--chunk' '4GB'
'-l', '--logs' 'logs'
'-n', '--name' 'test_batch'
'-p', '--profile' 'default'
'-r', '--root' '.'
'-s', '--storage' 'DEEP_ARCHIVE'
'-t', '--threads' 10

"batch-deposit" subcommand

usage: archiver batch-deposit [-h] -f BATCHES_FILE [-p PROFILE]

options:
  -h, --help            show this help message and exit
  -f BATCHES_FILE, --batches-file BATCHES_FILE
                        YAML file containing the paths to the manifests of individual batches.
  -p PROFILE, --profile PROFILE
                        AWS authorization profile

Enables depositing multiple batches specified in a YAML manifest. The format of the YAML file is:

batches_dir: <Fully-qualified filepath to the directory batches>
batches:
    - path: <Relative subdirectory to the manifest file for the batch>
      bucket: <The AWS bucket to store the assets in>
      asset_root: <The asset root for the batch>

For example:

batches_dir: /libr/archives/logs/libdc/load1
batches:
    - path: Archive000Football1
      bucket: libdc-archivebucket-17lowbw7m2av1
      asset_root: /libr/archives/footballfilmsexport/FootballFilmMpeg2_07272011/2010-07-12/Mpeg2QCd
    - path: Archive000Football2
      bucket: libdc-archivebucket-17lowbw7m2av1
      asset_root: /libr/archives/footballfilmsexport/FootballFilmMpeg2_07272011/2010-08-20/Maryland_mpg2_master/Maryland_mpg2_Batch1

Restoring from AWS Deep Glacier

You can restore files from AWS Deep Glacier using the scripts bin/requestfilesfromdeepglacier.sh and bin/copyfromawstolocal.sh.

  1. Install the AWS CLI. One option for installation is brew install awscli.

  2. Configure your region and credentials following the instructions in the AWS CLI Reference, General Options and Credentials. You can use export AWS_PROFILE=some-profile to declare the credential to use--the script will respect the value of the environment variable.

  3. Create a CSV-like input file with the list of files to restore, including the 3 columns bucketname, filelocation, fileserverlocation, without a header row. The scripts will prompt for the name of the input file. Example file contents:

libdc-archivebucket-foobarxyz,Archive092/scpa-062057-0018.tif,./restore_directory

Important: Due to how the script reads the restore file, a blank newline must exist or the last entry may not be processed.

Note: This text file is read from a command line utility and is not parsed as a CSV. Unlike a CSV, quoted values are passed directly to the AWS CLI -- Do not use quotes around "colunms"!

  1. Request the restoration from Deep Glacier to an S3 bucket using bin/requestfilesfromdeepglacier.sh.

    • The restoration may take up to 48 hours to complete. Plan ahead.
    • A successful request will trigger an email from Sent emails for Prod S3 <[email protected]> of the event with the subject of S3 Prod Archive Bucket Event.
    • A similar email will come through when the restore is completed.
  2. Copy the file from the S3 bucket to the local file system using bin/copyfromawstolocal.sh.

Development Setup

See docs/DevelopmentSetup.md.

Conformance Tests

Manual tests to verify application conformanance to actual AWS behavior are specified in docs/ConformanceTests.md.

About

Command-line preservation archiving tool for S3

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 97.4%
  • Shell 2.6%