GitHub - superlinear-ai/fastdup: fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.

Manage, Clean & Curate Visual Data - Fast and at Scale.

An unsupervised and free tool for image and video dataset analysis.

fastdup is founded by the authors of XGBoost, Apache TVM & Turi Create - Danny Bickson, Carlos Guestrin and Amir Alush.

Explore the docs »
Features · Report Bug · Blog · Quickstart · Enterprise Edition · About us

🚀 Introducing VL Profiler! 🚀 We're excited to announce our new cloud product, VL Profiler. It's designed to help you gain deeper insights and enhance your productivity while using fastdup. With VL Profiler, you can visualize your data, track changes over time, and much more.

👉 Check out VL Profiler here 👈

📝 Note: VL Profiler is a separate commercial product developed by the same team behind fastdup. Our goal with VL Profiler is to provide additional value to our users while continuing to support and maintain fastdup as a free, open-source project. We'd love for you to give VL Profiler a try and share your feedback with us! Sign-up now, it's free.

What's included in fastdup

fastdup handles both labeled and unlabeled image/video datasets, helping you to discover potential quality concerns while providing extra functionalities.

Why fastdup?

With a plethora of data visualization/profiling tools available, what sets fastdup apart? Here are the top benefits of fastdup:

Quality: High-quality analysis to remove duplicates/near-duplicates, anomalies, mislabels, broken images, and poor-quality images.
Scale: Handles 400M images on a single CPU machine. Enterprise version scales to billions of images.
Speed: Highly optimized C++ engine runs efficiently even on low-resource CPU machines.
Privacy: Runs locally or on your cloud infrastructure. Your data stays where it is.
Ease of use: Works on labeled or unlabeled datasets, images, or videos. Get started with just 3 lines of code.

Setting up

Prerequisites

Supported Python versions:

Supported operating systems:

Installation

Option 1 - Install fastdup via PyPI:

# upgrade pip to its latest version
pip install -U pip

# install fastdup
pip install fastdup
    
# Alternatively, use explicit python version (XX)
python3.XX -m pip install fastdup

Option 2 - Install fastdup via an Ubuntu 20.04 Docker image on DockerHub:

docker pull karpadoni/fastdup-ubuntu-20.04

Detailed installation instructions and common errors here.

Getting Started

Run fastdup with only 3 lines of code.

Visualize the result.

In short, you'll need 3 lines of code to run fastdup:

import fastdup
fd = fastdup.create(input_dir="IMAGE_FOLDER/")
fd.run()

And 5 lines of code to visualize issues:

fd.vis.duplicates_gallery()    # create a visual gallery of duplicates
fd.vis.outliers_gallery()      # create a visual gallery of anomalies
fd.vis.component_gallery()     # create a visualization of connected components
fd.vis.stats_gallery()         # create a visualization of images statistics (e.g. blur)
fd.vis.similarity_gallery()    # create a gallery of similar images

View the API docs here.

Learn from Examples

Learn the basics of fastdup through interactive examples. View the notebooks on GitHub or nbviewer. Even better, run them on Google Colab or Kaggle, for free.

	⚡ Quickstart: Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here! 📌 Dataset: Oxford-IIIT Pet.



	🧹 Clean Image Folder: Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start. 📌 Dataset: Food-101.



	🖼 Analyze Image Classification Dataset: Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go! 📌 Dataset: Imagenette.



	🎁 Analyze Object Detection Dataset: Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try. 📌 Dataset: COCO.

Load Data From Sources

The notebooks in this section show how to load data from various sources and analyze them with fastdup.

	🤗 Hugging Face Datasets: Load and analyze datasets from Hugging Face Datasets. Perfect if you already have a dataset hosted on Hugging Face hub. 🔗 Learn More.



	🏆 Kaggle: Load and analyze any computer vision datasets from Kaggle. Get ahead of your competition with data insights. 🔗 Learn More.



	🌎 Roboflow Universe: Load and analyze any computer vision datasets from Roboflow Universe. Analyze any of the 200,000 datasets on Roboflow Universe. 🔗 Learn More.



	📦 Labelbox: Load and analyze vision datasets from Labelbox - A data-centric AI platform for building intelligent applications. 🔗 Learn More.



	🔦 Torchvision Datasets: Load and analyze vision datasets from Torchvision Datasets. 🔗 Learn More.



	💦 Tensorflow Datasets: Load and analyze vision datasets from Tensorflow Datasets. 🔗 Learn More.

Enrich Data Using Foundation Models

The notebooks in this section show how to enrich your visual dataset using various foundation models supported in fastdup.

	🎞 Zero-Shot Classification: Enrich your visual data with zero-shot image classification and tagging models such as Recognize Anything Model, Tag2Text, and more. 🔗 Learn More.



	🧭 Zero-Shot Detection: Enrich your visual data with zero-shot image detection model such as Grounding DINO and more. 🔗 Learn More.



	🎯 Zero-Shot Segmentation: Enrich your visual data with zero-shot image segmentation model such as Segment Anything Model and more. 🔗 Learn More.

Extract Features From Dataset

The notebooks in this section show how to run fastdup on your own embeddings in combination with frameworks like ONNX and PyTorch.

	🧠 TIMM Embeddings: Compute dataset embeddings using TIMM (PyTorch Image Models) and run fastdup over the them to surface dataset issues. Runs on CPU and GPU.



	🦖 DINOv2 Embeddings: Extract feature vectors of your images using DINOv2 model. Runs on CPU.



	➡️ Use Your Own Feature Vectors: Run fastdup on pre-computed feature vectors and surface data quality issues.

Exciting New Features

Note: We're happy to announce new features are out from beta testing and now available to the public, completely free of charge! We invite you to try them out and provide us with your valuable feedback!

	😗 Face Detection in Videos: Use fastdup with a face detection model to detect faces from videos and analyze the cropped faces for potential issues such as duplicates, near-duplicates, outliers, bright/dark/blurry faces.



	🤖 Object Detection in Videos: Use fastdup with a pre-trained YOLOv5 model to detect and analyze objects for potential issues such as duplicates, near-duplicates, outliers, bright/dark/blurry objects.



	🔢 Optical Character Recognition: Enrich your dataset by detecting multilingual texts with PaddleOCR.



	📑 Image Captioning & Visual Question Answering (VQA): Enrich your dataset by captioning them using BLIP, BLIP-2, or ViT-GPT2 model. Alternatively, use VQA models and ask question about the content of your images with Vilt-b32 or ViT-Age model.



	🔍 Image Search: Search through large image datasets for duplicates/near-duplicates using a query image. Runs on CPU!

Getting Help

Get help from the fastdup team or community members via the following channels -

Slack.
GitHub issues.
Discussion forum.

Community Contributions

The following are community-contributed blog posts about fastdup -

	Deploying AWS Lambda functions with Docker Container by using Custom Base Image 🖋️ atahan bulus • 🗓 16 September 2023
	Renumics: Cleaning Image Classification Datasets With fastdup and Renumics Spotlight 🖋️ Daniel Klitzke • 🗓 4 September 2023
	Roboflow: How to Reduce Dataset Size Without Losing Accuracy 🖋️ Arty Ariuntuya • 🗓 9 August 2023
	The weighty significance of data cleanliness — or as I like to call it, “cleanliness is next to model-ness” — cannot be overstated. 🖋️ Alexander Lan • 🗓 9 March 2023
	Clean Up Your Digital Life: How I Found 1929 Fully Identical Images, Dark, Bright and Blurry Shots in Minutes, For Free. 🖋️ Dickson Neoh • 🗓 23 February 2023
	fastdup: A Powerful Tool to Manage, Clean & Curate Visual Data at Scale on Your CPU - For Free. 🖋️ Dickson Neoh • 🗓 3 January 2023
	Master Data Integrity to Clean Your Computer Vision Datasets. 🖋️ Paul lusztin • 🗓 19 December 2022

What our users say

License

fastdup is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License.

See LICENSE.

For any queries, reach us at [email protected]

Disclaimer

Usage Tracking

We have added an experimental crash report collection, using sentry.io. It does not collect user data and it only logs fastdup library's own actions. We do NOT collect folder names, user names, image names, image content only aggregate performance statistics like total number of images, average runtime per image, total free memory, total free disk space, number of cores, etc. Collecting fastdup crashes will help us improve stability.

The code for the data collection is found here. On MAC we use Google crashpad to report crashes.

It is always possible to opt out of the experimental crash report collection via either of the following two options:

Define an environment variable called SENTRY_OPT_OUT
or run() with turi_param='run_sentry=0'

About Visual-Layer

About Us • Blog • Documentation

Slack Community • Discussion Forum • LinkedIn • Twitter

🔝 Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 1,223 Commits
.github		.github
examples		examples
fastdup		fastdup
gallery		gallery
tests		tests
.gitignore		.gitignore
CLOUD.md		CLOUD.md
Dockerfile		Dockerfile
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
RUN.md		RUN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Manage, Clean & Curate Visual Data - Fast and at Scale.

What's included in fastdup

Why fastdup?

Setting up

Prerequisites

Installation

Getting Started

Learn from Examples

Load Data From Sources

Enrich Data Using Foundation Models

Extract Features From Dataset

Exciting New Features

Getting Help

Community Contributions

What our users say

License

Disclaimer

About Visual-Layer

About

Releases 1

Packages

Languages

License

superlinear-ai/fastdup

Folders and files

Latest commit

History

Repository files navigation

Manage, Clean & Curate Visual Data - Fast and at Scale.

What's included in fastdup

Why fastdup?

Setting up

Prerequisites

Installation

Getting Started

Learn from Examples

Load Data From Sources

Enrich Data Using Foundation Models

Extract Features From Dataset

Exciting New Features

Getting Help

Community Contributions

What our users say

License

Disclaimer

About Visual-Layer

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages