Inference #28

voicegen · 2023-07-10T02:32:37Z

Hello, during the inference phase, do I only need to use the 886 audio files from your data/test_audiocaps_subset.json? I have been unable to obtain the results from your paper, even when using your checkpoint.

deepanwayx · 2023-07-10T05:44:48Z

Yes, we used those 886 audio files for evaluation. Can you specify which checkpoint you used and which results you were not able to obtain?

965694547 · 2023-07-10T06:20:51Z

Yes, we used those 886 audio files for evaluation. Can you specify which checkpoint you used and which results you were not able to obtain?

I use https://huggingface.co/declare-lab/tango to generate 886 audio files, and use Guidance Scale=3 Steps=200，and get {"frechet_distance": 28.07995041974766, "frechet_audio_distance": 2.2381015516014955, "kullback_leibler_divergence_sigmoid": 3.8415958881378174, "kullback_leibler_divergence_softmax": 2.097446918487549, "lsd": 2.0631229603209094, "psnr": 15.874651663776682, "ssim": 0.4171875863485156, "ssim_stft": 0.09866382013407798, "inception_score_mean": 7.612150196882789, "inception_score_std": 0.8235111705490618, "kernel_inception_distance_mean": 0.010067609062191894, "kernel_inception_distance_std": 1.404596756557554e-07}

965694547 · 2023-07-10T06:26:40Z

Do I need to control the length of the generated audio to be the same as the original audio length to adjust its metrics.

deepanwayx · 2023-07-11T05:13:05Z

No, the length doesn't have to be controlled.

I added the inference_hf.py script for running evaluation from our huggingface checkpoints. Can you try and check the scores you obtain from this script?

I just did two runs and got the following scores:

{
    "frechet_distance": 24.4243,
    "frechet_audio_distance": 1.7324,
    "kl_sigmoid": 3.5901,
    "kl_softmax": 1.3216,
    "lsd": 2.0861,
    "psnr": 15.6047,
    "ssim": 0.4061,
    "ssim_stft": 0.1027,
    "is_mean": 7.5181,
    "is_std": 0.6758,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:0",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974057_steps_200_guidance_3"
}

{
    "frechet_distance": 24.9405,
    "frechet_audio_distance": 1.6633,
    "kl_sigmoid": 3.551,
    "kl_softmax": 1.3122,
    "lsd": 2.0957,
    "psnr": 15.5877,
    "ssim": 0.405,
    "ssim_stft": 0.1027,
    "is_mean": 7.187,
    "is_std": 0.5192,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:3",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974524_steps_200_guidance_3"
}

Our results in the paper are average of multiple runs as there are some randomness in the diffusion inference process.

965694547 · 2023-07-11T07:49:54Z

Thank you for explaination.

965694547 · 2023-07-11T09:46:33Z

No, the length doesn't have to be controlled.

I added the inference_hf.py script for running evaluation from our huggingface checkpoints. Can you try and check the scores you obtain from this script?

I just did two runs and got the following scores:

{
    "frechet_distance": 24.4243,
    "frechet_audio_distance": 1.7324,
    "kl_sigmoid": 3.5901,
    "kl_softmax": 1.3216,
    "lsd": 2.0861,
    "psnr": 15.6047,
    "ssim": 0.4061,
    "ssim_stft": 0.1027,
    "is_mean": 7.5181,
    "is_std": 0.6758,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:0",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974057_steps_200_guidance_3"
}

{
    "frechet_distance": 24.9405,
    "frechet_audio_distance": 1.6633,
    "kl_sigmoid": 3.551,
    "kl_softmax": 1.3122,
    "lsd": 2.0957,
    "psnr": 15.5877,
    "ssim": 0.405,
    "ssim_stft": 0.1027,
    "is_mean": 7.187,
    "is_std": 0.5192,
    "kid_mean": 0.0066,
    "kid_std": 0.0,
    "Steps": 200,
    "Guidance Scale": 3,
    "Test Instances": 886,
    "scheduler_config": {
        "num_train_timesteps": 1000,
        "beta_start": 0.00085,
        "beta_end": 0.012,
        "beta_schedule": "scaled_linear",
        "trained_betas": null,
        "variance_type": "fixed_small",
        "clip_sample": false,
        "prediction_type": "v_prediction",
        "thresholding": false,
        "dynamic_thresholding_ratio": 0.995,
        "clip_sample_range": 1.0,
        "sample_max_value": 1.0,
        "_class_name": "DDIMScheduler",
        "_diffusers_version": "0.8.0",
        "set_alpha_to_one": false,
        "skip_prk_steps": true,
        "steps_offset": 1
    },
    "args": {
        "test_file": "data/test_audiocaps_subset.json",
        "text_key": "captions",
        "device": "cuda:3",
        "test_references": "data/audiocaps_test_references/subset",
        "num_steps": 200,
        "guidance": 3,
        "batch_size": 8,
        "num_test_instances": -1
    },
    "output_dir": "outputs/1688974524_steps_200_guidance_3"
}

Our results in the paper are average of multiple runs as there are some randomness in the diffusion inference process.

I found that the sampling rate of the reference audio has an impact on the final result. I would like to ask about the sampling rate of your reference audio before coverting to 16k Hz.

deepanwayx · 2023-07-15T03:23:18Z

All our reference audio files are in 16 KHz.

I checked the AudioLDM Eval repository, and they now mention that the sampling rate can have an effect on the evaluation scores.

Their paper and evaluation code indicate that their scores are reported for 16 KHz. So we also report results with the same sampling rate for a fair comparison.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference #28

Inference #28

voicegen commented Jul 10, 2023

deepanwayx commented Jul 10, 2023

965694547 commented Jul 10, 2023

965694547 commented Jul 10, 2023

deepanwayx commented Jul 11, 2023

965694547 commented Jul 11, 2023

965694547 commented Jul 11, 2023

deepanwayx commented Jul 15, 2023

Inference #28

Inference #28

Comments

voicegen commented Jul 10, 2023

deepanwayx commented Jul 10, 2023

965694547 commented Jul 10, 2023

965694547 commented Jul 10, 2023

deepanwayx commented Jul 11, 2023

965694547 commented Jul 11, 2023

965694547 commented Jul 11, 2023

deepanwayx commented Jul 15, 2023