Tristan Wellman ([email protected])
Science Analytics and Synthesis (SAS), Core Science Systems.
This repository contains basic scripts on how to use Bagit for disk-based storage and network transfer of digital content. BagIt is a hierarchical file packaging format for the creation of standardised digital containers called 'bags,' which are used for storing and transferring digital content.
BagIt
is used as a packaging format to support storage of digital content. BagIt
packages can be used to facilitate data sharing with Federal, State, and Local archive centers - thus ensuring preservation of important datasets. Bags are ideal for digital content normally kept as a collection of files. They are also well-suited to export, for archival purposes of content normally kept in database structures that receiving parties are unlikely to support. Relying on cross-platform (Windows and Unix) filesystem naming conventions, a bag's payload may include any number of directories and sub-directories (folders and sub-folders). A bag can specify payload content indirectly via a "fetch.txt" file that lists URLs for content that can be fetched over the network to complete the bag; simple parallelization (e.g. running 10 instances of Wget) can exploit this feature to transfer large bags very quickly.
- Wide adoption in digital libraries (e.g. the Library of Congress).
- Easy to implement using ubiquitous and ordinary filesystem tools.
- Content that originates as files need only be copied to the payload directory.
- Compared to XML wrapping, content need not be encoded (e.g. Base64) which saves time and storage space.
- Received content is ready-to-go in a familiar filesystem tree.
- Easy to implement fast network transfer by running ordinary transfer tools in parallel.
Data managers and scientists
In progress, exploratory python scripts have been created to retrieve datasets, archive repo content, perform validation analysis, and archive compression. These scripts present functional case examples of using BagIt as an archive technology.
-
File: /source/BagIt-Sciencebase_example.ipnb
Operations:
(a) Constructs BagIt data archive for preserving one ScienceBase item (data files, item *.json),
(b) Selects appropriate files in ScienceBase item to be stored in archive using search criteria,
(c) Employs stream request to download relevant files into archive folder,
(d) Infuses BagIt archive metadata (task name, processing uuid, provider, contact, etc.),
(e) Validates Bagit archive structure and manifest information,
(f) Compresses Archive folder in *.tar format for improved transfer capabilities. (g) Shows archive structure with archive metadata -
File: /source/BagIt_ScienceBase_process_OBIS.ipnb
Operations:
(a) Constructs BagIt data archives for ScienceBase items in OBIS-USA parent collection (data files, item *.json),
(b) Selects appropriate files in ScienceBase item to be stored in archive using search criteria,
(c) Employs stream request to download relevant files into archive folder,
(d) Infuses BagIt archive metadata (task name, processing uuid, provider, contact, etc.),
(e) Validates Bagit archive structure and manifest information,
(f) Compresses Archive folder in *.tar format for improved transfer capabilities. -
File: /source/BagIt_archive_metadata.ipnb
examples to retrieve and customize archive metadata
Operations:
(a) Retrieves default archive metadata,
(b) Retrieves default search criteria to select data files from ScienceBase items, and
(c) Customizes defaults via kwargs (optional) -
/bags
Content:
(a) BagIt example archives of OBIS-USA ScienceBase items
This USGS product is considered to be in the U.S. public domain, and is licensed under CC0 1.0.
Although this software program has been used by the U.S. Geological Survey (USGS), no warranty, expressed or implied, is made by the USGS or the U.S. Government as to the accuracy and functioning of the program and related program material nor shall the fact of distribution constitute any such warranty, and no responsibility is assumed by the USGS in connection therewith.