bagit is a Python library and command line utility for working with BagIt style packages.
bagit.py is a single-file python module that you can drop into your project as needed or you can install globally with:
pip install bagit
Python v2.6+ is required.
When you install bagit you should get a command line program called bagit.py which you can use to turn an existing directory into a bag:
bagit.py --contact-name 'John Kunze' /directory/to/bag
You can pass in key/value metadata for the bag using options like
--contact-name
above, which get persisted to the bag-info.txt. For a
complete list of bag-info.txt properties you can use as commmand line
arguments see --help
.
Since calculating checksums can take a while when creating a bag, you may want
to calculate them in parallel if you are on a multicore machine. You can do
that with the --processes
option:
bagit.py --processes 4 /directory/to/bag
To specify which checksum algorithm(s) to use when generating the manifest, use the --md5, --sha1, --sha256 and/or --sha512 flags (MD5 is generated by default).
bagit.py --sha1 /path/to/bag
bagit.py --sha256 /path/to/bag
bagit.py --sha512 /path/to/bag
If you would like to validate a bag you can use the --validate flag.
bagit.py --validate /path/to/bag
If you would like to take a quick look at the bag to see if it seems valid
by just examining the structure of the bag, and comparing its payload-oxum (byte
count and number of files) then use the --fast
flag.
bagit.py --validate --fast /path/to/bag
And finally, if you'd like to parallelize validation to take advantage of multiple CPUs you can:
bagit.py --validate --processes 4 /path/to/bag
You can also use bagit programatically in your own Python programs.
To create a bag you would do this:
bag = bagit.make_bag('mydir', {'Contact-Name': 'John Kunze'})
make_bag
returns a Bag instance. If you have a bag already on disk and would
like to create a Bag instance for it, simply call the constructor directly:
bag = bagit.Bag('/path/to/bag')
You can change the metadata persisted to the bag-info.txt by using the info
property on a Bag
.
# load the bag
bag = bagit.Bag('/path/to/bag')
# update bag info metadata
bag.info['Internal-Sender-Description'] = 'Updated on 2014-06-28.'
bag.info['Authors'] = ['John Kunze', 'Andy Boyko']
bag.save()
By default save
will not update manifests. This guards against a situation
where a call to save
to persist bag metadata accidentally regenerates
manifests for an invalid bag. If you have modified the payload of a bag by
adding, modifying or deleting files in the data directory, and wish to
regenerate the manifests set the manifests
parameter to True when calling
save
.
import shutil, os
# add a file
shutil.copyfile('newfile', '/path/to/bag/data/newfile')
# remove a file
os.remove('/path/to/bag/data/file')
# persist changes
bag.save(manifests=True)
The save method takes an optional processes parameter which will determine how many processes are used to regenerate the checksums. This can be handy on multicore machines.
If you would like to see if a bag is valid, use its is_valid
method:
bag = bagit.Bag('/path/to/bag')
if bag.is_valid():
print "yay :)"
else:
print "boo :("
If you'd like to get a detailed list of validation errors,
execute the validate
method and catch the BagValidationError
exception. If the bag's manifest was invalid (and it wasn't caught by the
payload oxum) the exception's details
property will contain a list of
ManifestError
s that you can introspect on. Each ManifestError, will be of
type ChecksumMismatch
, FileMissing
, UnexpectedFile
.
So for example if you want to print out checksums that failed to validate you can do this:
bag = bagit.Bag("/path/to/bag")
try:
bag.validate()
except bagit.BagValidationError, e:
for d in e.details:
if isinstance(d, bag.ChecksumMismatch):
print "expected %s to have %s checksum of %s but found %s" % \
(e.path, e.algorithm, e.expected, e.found)
To iterate through a bag's manifest and retrieve checksums for the payload files use the bag's entries dictionary:
bag = bagit.Bag("/path/to/bag")
for path, fixity in bag.entries.items():
print "path:%s md5:%s" % (path, fixity["md5"])
% git clone git://github.com/LibraryOfCongress/bagit-python.git
% cd bagit-python
% python test.py
If you'd like to see how increasing parallelization of bag creation on your system effects the time to create a bag try using the included bench utility:
% ./bench.py
Note: By contributing to this project, you agree to license your work under the same terms as those that govern this project's distribution.