Skip to content

Latest commit

 

History

History
169 lines (127 loc) · 5.3 KB

README.md

File metadata and controls

169 lines (127 loc) · 5.3 KB
DaBaDee

DaBaDee is a simple deduplication tool/storage for files. It uses SHA256* to hash the files and store them in the storage, replacing the original path with a hardlink to the storage location.

* SHA256 is the default hashing algorithm, but it can be replaced with any other hashing algorithm that implements the `hash.Generator` interface. The highwayhash algorithm is also provided as an example.

Usage

CLI

Show help

dabadee --help

Usage:
  dabadee [command]

Available Commands:
  completion     Generate the autocompletion script for the specified shell
  cp             Copy a file and deduplicate it in storage
  dedup          Deduplicate files in a directory
  find-links     Find all hard links to the specified file
  help           Help about any command
  remove-orphans Remove all orphaned files from the storage
  rm             Remove a file and its link from storage

Flags:
  -h, --help   help for dabadee

Use "dabadee [command] --help" for more information about a command.

Deduplicate a folder

dabadee dedup /path/to/folder --storage /path/to/storage --workers 2

Where the /path/to/folder is the folder to deduplicate and /path/to/storage is the location you want to store the deduplicated files. The 2 is the number of workers to use to speed up the process.

Do not delete the resulting storage folder, as it contains the original files.

Deduplicate a folder and obtain the pairings of origins and hashes in storage

dabadee dedup /path/to/folder --storage /path/to/storage --workers 2 --manifest-output /path/to/manifest.json

This will create a JSON file with the pairings of the original files and their hashes in the storage, like this:

{
    "/path/to/folder/file1": "1234...",
    "/path/to/folder/file2": "1234...",
    ...
}

the path must point to a file that does not exist, not a folder. The hash is the same as the one used in the storage, so the same algorithm will be used.

Deduplicate on copy

dabadee cp /path/to/file /path/to/dest/file --storage /path/to/storage

This will copy the file to the destination and deduplicate it in the storage if not already present.

Deduplicate a folder on copy

dabadee cp --append /path/to/folder /path/to/dest --storage /path/to/storage --workers 2

This will copy the folder to the destination and deduplicate it in the storage if not already present. The --append flag indicates to copy the folder contents inside the destination folder.

Keep metadata

dabadee dedup /path/to/folder --storage /path/to/storage --workers 2 --with-metadata
dabadee cp /path/to/file /path/to/dest/file --storage /path/to/storage --with-metadata

This will keep the original file metadata (uid, gid, permissions) when copying the file to the storage.

Global Storage vs Scoped Storage

When using the CLI, the storage can be defined globally or scoped. The scoped storage is defined with the --storage flag, this allows you to define a new storage each time you run a command. The global storage is assumed to be in the user's home directory, in the .dabadee/Storage folder if Dabadee is run as user, or in the /opt/dabadee/Storage folder if Dabadee is run as root.

# This will use the global storage:
dabadee dedup /path/to/folder --workers 2

# This will use the scoped storage:
dabadee dedup /path/to/folder --workers 2 --storage /path/to/storage

This behavior is currently supported by the cp and dedup commands only.

Library

package main

import (
    "github.com/mirkobrombin/dabadee/pkg/dabadee"
    "github.com/mirkobrombin/dabadee/pkg/hash"
    "github.com/mirkobrombin/dabadee/pkg/processor"
    "github.com/mirkobrombin/dabadee/pkg/storage"
)

func main() {
    s := storage.NewStorage(storage.StorageOptions{
        Root: "/path/to/storage",
        WithMetadata: true,
    })
    h := hash.NewSHA256Generator()
    p := processor.NewDedupProcessor("/path/to/folder", "/path/to/dest", s, h, 2)

    d := dabadee.NewDaBaDee(p)
    err := d.Run()
    if err != nil {
        panic(err)
    }
}

If the destination folder does not exist, it will be created, if it is not defined (empty string), no files will be copied, and only the original files will be deduplicated.

Please note that DaBaDee is not atomic, so if the process is interrupted, the storage may be left in an inconsistent state. It is recommended to use a transictional folder instead of working with the original files directly, and then swap the folders when the process is done.

What's with the name?

The name comes from the song "Blue (Da Ba Dee)" by Eiffel 65. There is no connection between the song and the project, it's just the song I was listening to when I started the project. So... I'm blue, da ba dee da ba daa.

What is left to do?

  • Add tests
  • Provide better logs and ask user input when needed
  • Add a way to remove files from the storage reflecting the changes in the original files
  • Provide an option to respect metadata (uid, gid, permissions)
  • Split cmd and lib in two different packages
  • Add a way deduplicate files and folders on copy
  • Add a global storage configuration if the storage is not provided