-
Notifications
You must be signed in to change notification settings - Fork 14
StorageFormat
v7files stores the contents of files separate from their metadata. This allows for de-duplication (the contents of two identical files have to be stored just once).
v7files provides a number of virtual file systems,
with different features such as folder hierarchies, access
control, or version tracking. These file system implementations
store the metadata that makes up the files in
collections under their own control (and applications using v7files can also
use additional collections as well), but the contents are
all stored in the shared collection v7files.content
.
To facilitate garbage collection of contents that are
no longer used anywhere, backreferences from contents to
files are kept in the collection v7files.refs
.
A hierarchical file system with support for versioning and access control that can be exposed via WebDAV is maintained in the v7files.files
collection.
v7files.buckets
. No folder hierarchy, only manual reference tracking, no version history. See WebAppIntegration.
Every document in the v7files.content
collection describes
a piece of content, i.e. a sequence of bytes. It is uniquely identified by the SHA-1 digest for that sequence of bytes.
This digest is used as the Mongo document _id
(as twenty bytes of binary data).
The simplest way to store contents is by just putting them into a byte array in the content document:
{ _id : <sha1> , in: <bytes> }
The field is called in
for "inline".
Some data can be more efficiently stored using gzip compression.
{ _id: <sha1>, store: 'gz', zin: <gzipped bytes>, length: 1234 }
The store
field denotes the storage scheme, and zin
stands
for "zipped inline". Note that the SHA-1 is the digest of the uncompressed original data and length
field also indicates the uncompressed length.
A content document can also pull in the contents from other documents and concatenate them. This allows for efficient differential storage of similar contents, as well as storing contents too large to fit into a single Mongo document.
{ _id: <sha1>,
store: 'cat',
base: [
{ sha: <sha1 of some content>, length: 1234 },
{ sha: <sha1 of some content>, length: 1234 }
]
}
store: 'cat'
means that this content is a concatenation of
(parts of) other content.
The base
field is an array with the chunks to be concatenated.
Each chunk can be a (presumably short) piece of new inline data (given as a byte array), or a reference to contents stored elsewhere.
If the contents are stored elsewhere, an embedded document that
we call a "content pointer" is given. There are a couple
of options here, but it must contain at least the sha
field
that will be used to look up the data, and the length
field (necessary to calculate the length of the combined piece).
The v7files.content
collection described above stores contents, but you also need some more meta-data to make a complete "file". There are various different types of meta-data and many ways where it can be kept, but all of them use the following simple schema to interact with v7files: You refer to the contents by their SHA digest, and you store it in a MongoDB document (top-level or nested) in a field called sha
, either as a 20-byte binary or as a 40-character hex-encoded String. Since this is wasteful for very short files, you are encouraged to inline short files into a byte array field called in
instead. You probably also want to have filename
, contentType
and length
fields.
Note that the length can be different than the length of the referenced content: If it is smaller, it will be truncated, if it is longer, the content will be repeated. There can also be an offset (off
) to start the concatenation in the middle of the chunk.
Examples:
// You might want to embed this in your Mongo-based application
{ sha : <sha1>,
filename: 'hello.txt',
length: 1993,
contentType: 'text/plain' }
// or this, if the file is short
{ in: <bytes>,
filename: 'short.txt',
contentType: 'text/plain' }
// content pointers are also used internally for chunked documents
// (see above for details)
{ _id: <sha1>, store : 'cat',
base: [
{ sha: <sha1 of some content>, length: 1234 },
{ sha: <sha1 of some content>, length: 1234 }
]
}
// and a file in v7files' WebDAV looks like this
// (see below for details)
{
_id : <ObjectId>,
_version: 3,
parent: <ObjectId>,
acl: {},
filename: 'a.txt',
length: 123,
sha: <sha1>,
contentType: 'text/plain',
createdAt: <Timestamp>,
updatedAt: <Timestamp>
}
The downside of sharing contents between multiple files is that when you delete a file, you cannot just delete the contents. This can only be done once all files that (directly or indirectly) refer to the content have been deleted.
Every file naturally has a reference to its contents, and v7files
also keeps track of the reverse links. In the v7files.refs
collection there is a document for every piece of content in v7files.content
(with the same _id
) that has
-
refs
: an array with the_id
of every file that currently uses it. -
refHistory
: contains all current and previous entries ofrefs
. When a file is deleted, the contents remain in GridFS, but its backlink is removed fromrefs
. It is kept as a copy inrefHistory
, however. -
refBase
: in addition to files referencing content, content can also be referenced by other content (when using out-of-band "alt" storage, such as concatenation). Those references (the SHA1_id
of the referring content) is tracked here.
Conceptually, the references are part of the contents document, but keeping them in a separate collection makes it easier to update them (without having to touch the usually very large content documents).
Every file and folder is represented by a MongoDB document in the v7files.files
collection. It has the following fields:
-
_id
: an id which unique identifies the file, even if it moves around in the filesystem or changes its name. This is a randomly assigned ObjectId, except for the "root" folders which are identified by Strings chosen by the user and mapped to URL endpoints. -
_version
: an integer, starting at 1 and incrementing with every update to the file -
parent
: the_id
of the parent file -
acl
: a nested object containing access control lists -
filename
: the name of the file. This becomes a URL component for WebDAV. -
length
: the length of the file in bytes. Missing in the case of a folder (or inline storage) -
sha
: a byte array with the SHA-1 hash of the file's contents. This is used to link the file to its contents, which are stored in GridFS. Missing in the case of a folder (or inline storage) -
in
: extremely small files (smaller than their own SHA-1 hash) can be stored inline (as a byte array). In this case, thesha
andlength
fields will be missing -
contentType
: the content type of the file -
created_at
: the creation date of the file -
updated_at
: the creation date of the current revision of the file, missing for the first version
When a file is modified, the _version
field in the v7files.files
collection is incremented by one, and the previous revision is moved to a shadow collection that tracks version history, called v7files.files.vermongo
. Deleted files are also stored there. The "main" collection only contains the current versions of all files. See Versioning for details.