Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on FMA-large not so great #1

Open
shenberg opened this issue Mar 8, 2024 · 2 comments
Open

Performance on FMA-large not so great #1

shenberg opened this issue Mar 8, 2024 · 2 comments
Assignees

Comments

@shenberg
Copy link

shenberg commented Mar 8, 2024

Hi,

Thank you for this repository! I'm exploring the space of audio fingerprints and you have the only modern repository that just works. With very minor modifications, I got it running on a mac, with a newer version of pytorch, on the GPU, for big performance gains!

I attempted to generate a database for FMA-large by downloading your model weights and modifying the relevant configurations. I encountered three problems along the way:

  1. When normalizing the vectors (emb_db/np.linalg.norm(emb_db, axis=1).reshape(-1,1)), the math was performed in float16 (utils/dataclass.py) and the norm calculation overflowed for some fingerprints. Casting emb_db to np.float32 solved this (it's enough to do this only for the norm calculation).
  2. The metadata similarly overflowed (MAX_VAL for float16 is ~65000 and there are more than 65k songs). Again, moving to float32 solved the issue.
  3. faiss crashed in many ways. The reason is that only conda install is supported while pip install may work but is unsupported according to the authors. Github issue comment here, and mixing pip installed pytorch and faiss causes issues with OpenMP. My solution was to conda install as much as possible (conda install -c pytorch faiss-cpu pytorch::pytorch torchvision torchaudio and then conda install scipy matplotlib and only then pip install natsort pytorch-lightning==1.9.5 soundfile).

Anyhow, after I got all the issues sorted out, I generated some 10,000 clean 8-second queries from fma-large and queried them against the DB. My accuracy was ~91% and on listening to a few mistakes, they were "reasonable". When I tested it against queries with degradations, accuracy dropped down to 21% (some details: noise from is from TUT, SNR between 0 and 5, RIR convolutions from the MIT RIR survey, highpass filter randomly between 0-30Hz - I didn't invent this, taken from [https://github.com/deezer/musicFPaugment] configuration 'full_light'). Note that these same settings, but in 8KHz, got ~65% accuracy with audfprint.

Can you help me figure out the difference? (This is the model from "Attention-based Audio Embeddings for Query-by-Example," right?)

@anupsingh15
Copy link
Collaborator

anupsingh15 commented Mar 15, 2024

Hi @shenberg

The current source code is for indexing the smaller databases. Hence, I used float16 wherever possible to avoid excessive memory usage. The possible source of error could be how you preprocess the audio you input to the model or how you add distortions to the audio query. Note that we resample the audio to 16kHZ and do not perform any filtering as a preprocessing step. Do you use our modules to load and add distortions to clean audio segments?

@shenberg
Copy link
Author

shenberg commented Mar 15, 2024

Hiya, thanks for the response!

I use different code to perform the distortions as I'm comparing performance to audfprint and other systems. I used the code at [https://github.com/deezer/musicFPaugment] modified to generate 16KHz distorted samples and to save them to .wav.

Note that audfprint configured to downsample the same files to 8KHz achieved ~70% accuracy.

@anupsingh15 anupsingh15 self-assigned this Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants