AudioCraft v1.0.0 release with training code, AudioGen, MultiBandDiff…

…usion etc.
facebookresearch · Aug 2, 2023 · bf70522 · bf70522
1 parent 539ec04
commit bf70522
Show file tree

Hide file tree

Showing 204 changed files with 15,938 additions and 936 deletions.
diff --git a/.github/workflows/audiocraft_docs.yml b/.github/workflows/audiocraft_docs.yml
@@ -23,9 +23,9 @@ jobs:
       - name: Make docs
         run: |
           . env/bin/activate
-          make docs
-          git add -f docs
-          git commit -m docs
+          make api_docs
+          git add -f api_docs
+          git commit -m api_docs
 
       - name: Push branch
         run: |

diff --git a/.github/workflows/audiocraft_tests.yml b/.github/workflows/audiocraft_tests.yml
@@ -12,6 +12,11 @@ jobs:
     steps:
       - uses: actions/checkout@v2
       - uses: ./.github/actions/audiocraft_build
-      - run: |
+      - name: Run unit tests
+        run: |
           . env/bin/activate
           make tests
+      - name: Run integration tests
+        run: |
+          . env/bin/activate
+          make tests_integ
diff --git a/.gitignore b/.gitignore
@@ -35,7 +35,7 @@ wheels/
 .coverage
 
 # docs
-/docs
+/api_docs
 
 # dotenv
 .env
@@ -46,6 +46,13 @@ wheels/
 venv/
 ENV/
 
+# egs with manifest files
+egs/*
+!egs/example
+# local datasets
+dataset/*
+!dataset/example
+
 # personal notebooks & scripts
 */local_scripts
 */notes

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,11 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
+## [1.0.0] - 2023-08-02
+
+Major revision, added training code for EnCodec, AudioGen, MusicGen, and MultiBandDiffusion.
+Added pretrained model for AudioGen and MultiBandDiffusion.
+
 ## [0.0.2] - 2023-08-01
 
 Improved demo, fixed top p (thanks @jnordberg).

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,11 +1,11 @@
-# Contributing to Audiocraft
+# Contributing to AudioCraft
 
 We want to make contributing to this project as easy and transparent as
 possible.
 
 ## Pull Requests
 
-Audiocraft is the implementation of a research paper.
+AudioCraft is the implementation of a research paper.
 Therefore, we do not plan on accepting many pull requests for new features.
 We certainly welcome them for bug fixes.
 

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -6,3 +6,4 @@ include *.ini
 include requirements.txt
 include audiocraft/py.typed
 include assets/*.mp3
+recursive-include conf *.yaml
diff --git a/Makefile b/Makefile
@@ -1,3 +1,15 @@
+INTEG=AUDIOCRAFT_DORA_DIR="/tmp/magma_$(USER)" python3 -m dora -v run --clear device=cpu dataset.num_workers=0 optim.epochs=1 \
+	dataset.train.num_samples=10 dataset.valid.num_samples=10 \
+	dataset.evaluate.num_samples=10 dataset.generate.num_samples=2 sample_rate=16000 \
+	logging.level=DEBUG
+INTEG_COMPRESSION = $(INTEG) solver=compression/debug rvq.n_q=2 rvq.bins=48 checkpoint.save_last=true   # SIG is 616d7b3c
+INTEG_MUSICGEN = $(INTEG) solver=musicgen/debug dset=audio/example compression_model_checkpoint=//sig/5091833e \
+	transformer_lm.n_q=2 transformer_lm.card=48 transformer_lm.dim=16 checkpoint.save_last=false  # Using compression model from 616d7b3c
+INTEG_AUDIOGEN = $(INTEG) solver=audiogen/debug dset=audio/example compression_model_checkpoint=//sig/5091833e \
+	transformer_lm.n_q=2 transformer_lm.card=48 transformer_lm.dim=16 checkpoint.save_last=false  # Using compression model from 616d7b3c
+INTEG_MBD = $(INTEG) solver=diffusion/debug dset=audio/example  \
+	checkpoint.save_last=false  # Using compression model from 616d7b3c
+
 default: linter tests
 
 install:
@@ -10,12 +22,19 @@ linter:
 
 tests:
 	coverage run -m pytest tests
-	coverage report --include 'audiocraft/*'
+	coverage report
+
+tests_integ:
+	$(INTEG_COMPRESSION)
+	$(INTEG_MBD)
+	$(INTEG_MUSICGEN)
+	$(INTEG_AUDIOGEN)
+
 
-docs:
-	pdoc3 --html -o docs -f audiocraft
+api_docs:
+	pdoc3 --html -o api_docs -f audiocraft
 
 dist:
 	python setup.py sdist
 
-.PHONY: linter tests docs dist
+.PHONY: linter tests api_docs dist
diff --git a/README.md b/README.md
@@ -1,30 +1,14 @@
-# Audiocraft
+# AudioCraft
 ![docs badge](https://github.com/facebookresearch/audiocraft/workflows/audiocraft_docs/badge.svg)
 ![linter badge](https://github.com/facebookresearch/audiocraft/workflows/audiocraft_linter/badge.svg)
 ![tests badge](https://github.com/facebookresearch/audiocraft/workflows/audiocraft_tests/badge.svg)
 
-Audiocraft is a PyTorch library for deep learning research on audio generation. At the moment, it contains the code for MusicGen, a state-of-the-art controllable text-to-music model.
+AudioCraft is a PyTorch library for deep learning research on audio generation. AudioCraft contains inference and training code
+for two state-of-the-art AI generative models producing high-quality audio: AudioGen and MusicGen.
 
-## MusicGen
-
-Audiocraft provides the code and models for MusicGen, [a simple and controllable model for music generation][arxiv]. MusicGen is a single stage auto-regressive
-Transformer model trained over a 32kHz <a href="https://github.com/facebookresearch/encodec">EnCodec tokenizer</a> with 4 codebooks sampled at 50 Hz. Unlike existing methods like [MusicLM](https://arxiv.org/abs/2301.11325), MusicGen doesn't require a self-supervised semantic representation, and it generates
-all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict
-them in parallel, thus having only 50 auto-regressive steps per second of audio.
-Check out our [sample page][musicgen_samples] or test the available demo!
-
-<a target="_blank" href="https://colab.research.google.com/drive/1-Xe9NCdIs2sCUbiSmwHXozK6AAhMm7_i?usp=sharing">
-  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
-</a>
-<a target="_blank" href="https://huggingface.co/spaces/facebook/MusicGen">
-  <img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg" alt="Open in HugginFace"/>
-</a>
-<br>
-
-We use 20K hours of licensed music to train MusicGen. Specifically, we rely on an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.
 
 ## Installation
-Audiocraft requires Python 3.9, PyTorch 2.0.0, and a GPU with at least 16 GB of memory (for the medium-sized model). To install Audiocraft, you can run the following:
+AudioCraft requires Python 3.9, PyTorch 2.0.0. To install AudioCraft, you can run the following:
 
 ```shell
 # Best to make sure you have torch installed first, in particular before installing xformers.
@@ -33,143 +17,66 @@ pip install 'torch>=2.0'
 # Then proceed to one of the following
 pip install -U audiocraft  # stable release
 pip install -U git+https://[email protected]/facebookresearch/audiocraft#egg=audiocraft  # bleeding edge
-pip install -e .  # or if you cloned the repo locally
-```
-
-## Usage
-We offer a number of way to interact with MusicGen:
-1. A demo is also available on the [`facebook/MusicGen`  HuggingFace Space](https://huggingface.co/spaces/facebook/MusicGen) (huge thanks to all the HF team for their support).
-2. You can run the Gradio demo in Colab: [colab notebook](https://colab.research.google.com/drive/1-Xe9NCdIs2sCUbiSmwHXozK6AAhMm7_i?usp=sharing).
-3. You can use the gradio demo locally by running `python app.py`.
-4. You can play with MusicGen by running the jupyter notebook at [`demo.ipynb`](./demo.ipynb) locally (if you have a GPU).
-5. Checkout [@camenduru Colab page](https://github.com/camenduru/MusicGen-colab) which is regularly
-  updated with contributions from @camenduru and the community.
-6. Finally, MusicGen is available in 🤗 Transformers from v4.31.0 onwards, see section [🤗 Transformers Usage](#-transformers-usage) below.
-
-## API
-
-We provide a simple API and 4 pre-trained models. The pre trained models are:
-- `small`: 300M model, text to music only - [🤗 Hub](https://huggingface.co/facebook/musicgen-small)
-- `medium`: 1.5B model, text to music only - [🤗 Hub](https://huggingface.co/facebook/musicgen-medium)
-- `melody`: 1.5B model, text to music and text+melody to music - [🤗 Hub](https://huggingface.co/facebook/musicgen-melody)
-- `large`: 3.3B model, text to music only - [🤗 Hub](https://huggingface.co/facebook/musicgen-large)
-
-We observe the best trade-off between quality and compute with the `medium` or `melody` model.
-In order to use MusicGen locally **you must have a GPU**. We recommend 16GB of memory, but smaller
-GPUs will be able to generate short sequences, or longer sequences with the `small` model.
-
-**Note**: Please make sure to have [ffmpeg](https://ffmpeg.org/download.html) installed when using newer version of `torchaudio`.
-You can install it with:
-```
-apt-get install ffmpeg
-```
-
-See after a quick example for using the API.
-
-```python
-import torchaudio
-from audiocraft.models import MusicGen
-from audiocraft.data.audio import audio_write
-
-model = MusicGen.get_pretrained('melody')
-model.set_generation_params(duration=8)  # generate 8 seconds.
-wav = model.generate_unconditional(4)    # generates 4 unconditional audio samples
-descriptions = ['happy rock', 'energetic EDM', 'sad jazz']
-wav = model.generate(descriptions)  # generates 3 samples.
-
-melody, sr = torchaudio.load('./assets/bach.mp3')
-# generates using the melody from the given audio and the provided descriptions.
-wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)
-
-for idx, one_wav in enumerate(wav):
-    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
-    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
-```
-
-## 🤗 Transformers Usage
-
-MusicGen is available in the 🤗 Transformers library from version 4.31.0 onwards, requiring minimal dependencies 
-and additional packages. Steps to get started:
-
-1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers) from main:
-
-```
-pip install git+https://github.com/huggingface/transformers.git
+pip install -e .  # or if you cloned the repo locally (mandatory if you want to train).
 ```
 
-2. Run the following Python code to generate text-conditional audio samples:
-
-```py
-from transformers import AutoProcessor, MusicgenForConditionalGeneration
-
-
-processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
-model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
-
-inputs = processor(
-    text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
-    padding=True,
-    return_tensors="pt",
-)
-
-audio_values = model.generate(**inputs, max_new_tokens=256)
+We also recommend having `ffmpeg` installed, either through your system or Anaconda:
+```bash
+sudo apt-get install ffmpeg
+# Or if you are using Anaconda or Miniconda
+conda install 'ffmpeg<5' -c  conda-forge
 ```
 
-3. Listen to the audio samples either in an ipynb notebook:
+## Models
 
-```py
-from IPython.display import Audio
+At the moment, AudioCraft contains the training code and inference code for:
+* [MusicGen](./docs/MUSICGEN.md): A state-of-the-art controllable text-to-music model.
+* [AudioGen](./docs/AUDIOGEN.md): A state-of-the-art text-to-sound model.
+*  [EnCodec](./docs/ENCODEC.md), a state-of-the-art high fidelity neural audio codec.
+* [Multi Band Diffusion](./docs/MBD.md): EnCodec compatible decoder using diffusion.
 
-sampling_rate = model.config.audio_encoder.sampling_rate
-Audio(audio_values[0].numpy(), rate=sampling_rate)
-```
+## Training code
 
-Or save them as a `.wav` file using a third-party library, e.g. `scipy`:
+AudioCraft contains PyTorch components for deep learning research in audio and training pipelines for the developed models.
+For a general introduction of AudioCraft design principles and instructions to develop your own training pipeline, refer to
+the [AudioCraft training documentation](./docs/TRAINING.md).
 
-```py
-import scipy
+For reproducing existing work and using the developed training pipelines, refer to the instructions for each specific model
+that provides pointers to configuration, example grids and model/task-specific information and FAQ.
 
-sampling_rate = model.config.audio_encoder.sampling_rate
-scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].numpy())
-```
 
-For more details on using the MusicGen model for inference using the 🤗 Transformers library, refer to the 
-[MusicGen docs](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen) or the hands-on 
-[Google Colab](https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/MusicGen.ipynb).
+## API documentation
 
-## Model Card
+We provide some [API documentation](https://facebookresearch.github.io/audiocraft/api_docs/audiocraft/index.html) for AudioCraft.
 
-See [the model card page](./MODEL_CARD.md).
 
 ## FAQ
 
-#### Will the training code be released?
-
-Yes. We will soon release the training code for MusicGen and EnCodec.
+#### Is the training code available?
 
+Yes! We provide the training code for [EnCodec](./docs/ENCODEC.md), [MusicGen](./docs/MUSICGEN.md) and [Multi Band Diffusion](./docs/MBD.md).
 
-#### I need help on Windows
+#### Where are the models stored?
 
-@FurkanGozukara made a complete tutorial for [Audiocraft/MusicGen on Windows](https://youtu.be/v-YpvPkhdO4)
+Hugging Face stored the model in a specific location, which can be overriden by setting the `AUDIOCRAFT_CACHE_DIR` environment variable.
 
-#### I need help for running the demo on Colab
 
-Check [@camenduru tutorial on Youtube](https://www.youtube.com/watch?v=EGfxuTy9Eeo).
+## License
+* The code in this repository is released under the MIT license as found in the [LICENSE file](LICENSE).
+* The models weights in this repository are released under the CC-BY-NC 4.0 license as found in the [LICENSE_weights file](LICENSE_weights).
 
 
 ## Citation
+
+For the general framework of AudioCraft, please cite the following.
 ```
 @article{copet2023simple,
-      title={Simple and Controllable Music Generation},
-      author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
-      year={2023},
-      journal={arXiv preprint arXiv:2306.05284},
+    title={Simple and Controllable Music Generation},
+    author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
+    year={2023},
+    journal={arXiv preprint arXiv:2306.05284},
 }
 ```
 
-## License
-* The code in this repository is released under the MIT license as found in the [LICENSE file](LICENSE).
-* The weights in this repository are released under the CC-BY-NC 4.0 license as found in the [LICENSE_weights file](LICENSE_weights).
-
-[arxiv]: https://arxiv.org/abs/2306.05284
-[musicgen_samples]: https://ai.honu.io/papers/musicgen/
+When referring to a specific model, please cite as mentioned in the model specific README, e.g
+[./docs/MUSICGEN.md](./docs/MUSICGEN.md), [./docs/AUDIOGEN.md](./docs/AUDIOGEN.md), etc.
diff --git a/assets/a_duck_quacking_as_birds_chirp_and_a_pigeon_cooing.mp3 b/assets/a_duck_quacking_as_birds_chirp_and_a_pigeon_cooing.mp3
diff --git a/assets/sirens_and_a_humming_engine_approach_and_pass.mp3 b/assets/sirens_and_a_humming_engine_approach_and_pass.mp3
diff --git a/audiocraft/__init__.py b/audiocraft/__init__.py
@@ -3,8 +3,24 @@
 #
 # This source code is licensed under the license found in the
 # LICENSE file in the root directory of this source tree.
+"""
+AudioCraft is a general framework for training audio generative models.
+At the moment we provide the training code for:
+
+- [MusicGen](https://arxiv.org/abs/2306.05284), a state-of-the-art
+    text-to-music and melody+text autoregressive generative model.
+    For the solver, see `audiocraft.solvers.musicgen.MusicGenSolver`, and for the model,
+    `audiocraft.models.musicgen.MusicGen`.
+- [AudioGen](https://arxiv.org/abs/2209.15352), a state-of-the-art
+    text-to-general-audio generative model.
+- [EnCodec](https://arxiv.org/abs/2210.13438), efficient and high fidelity
+    neural audio codec which provides an excellent tokenizer for autoregressive language models.
+    See `audiocraft.solvers.compression.CompressionSolver`, and `audiocraft.models.encodec.EncodecModel`.
+- [MultiBandDiffusion](TODO), alternative diffusion-based decoder compatible with EnCodec that
+    improves the perceived quality and reduces the artifacts coming from adversarial decoders.
+"""
 
 # flake8: noqa
 from . import data, modules, models
 
-__version__ = '0.0.2'
+__version__ = '1.0.0'
diff --git a/audiocraft/adversarial/__init__.py b/audiocraft/adversarial/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+"""Adversarial losses and discriminator architectures."""
+
+# flake8: noqa
+from .discriminators import (
+    MultiPeriodDiscriminator,
+    MultiScaleDiscriminator,
+    MultiScaleSTFTDiscriminator
+)
+from .losses import (
+    AdversarialLoss,
+    AdvLossType,
+    get_adv_criterion,
+    get_fake_criterion,
+    get_real_criterion,
+    FeatLossType,
+    FeatureMatchingLoss
+)
diff --git a/audiocraft/adversarial/discriminators/__init__.py b/audiocraft/adversarial/discriminators/__init__.py
@@ -0,0 +1,10 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+# flake8: noqa
+from .mpd import MultiPeriodDiscriminator
+from .msd import MultiScaleDiscriminator
+from .msstftd import MultiScaleSTFTDiscriminator