-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recognize TV shows using an HLS playlist #305
Comments
To be honest, you do want short clips. Google uses this method. It's there
a design reason that requires 60 seconds? Think of the algorithm, any noise
or variation will make the "beat" vary. In my attempt, I used 2 to 4
seconds.
…On Thu, Mar 14, 2024, 1:41 PM nathandebalthasar ***@***.***> wrote:
Hello,
I've been trying to recognize TV shows as well as ads ingested using
DejaVu in real time using an HLS playlist. The shows last from a few
minutes to hours and the ads generally last for a few dozen seconds.
The main problem lies in the fact that when doing the recognition on a TS
segment that should match an audio file ingested by DejaVu, the
input_confidence attribute, depending on the length of the segment, is
really low, or not close enough to 1.
When using 60-second TS segments, the input confidence value tends towards
0. Often, the value is <= 0.1 using the default settings and can grow to <=
0.2 using these
<https://github.com/denis-stepanov/advent?tab=readme-ov-file#dejavu-tuning>
settings.
Using 6-second segments, the value is closer to 1, around 0.5 to 0.9 most
of the time. However, the second result returned by the program is often
closer to 1, which will be a wrong audio.
The files ingested are WMV files, and the audio specs are the following:
- 3 audio tracks
- Codec WMA 9.2
- Constant bit rate mode at 96kbps
- 2 channels
- 48 kHz sample rate
What I did is transform these WMV files into ts files using ffmpeg to
match the ts segments characteristics, which are the following:
- Single audio track
- Codec AAC LC Version 4
- Muxing Mode: ADTS
- 2 channels
- 48 kHz sample rate
- Lossy compression mode
Also, something weird I noticed is that when taking a part of a TS file
that I transformed from a WMV file which is ingested by DejaVu, the
input_confidence will most of the time be 1 or close to 1. But when taking
the same part of audio from a ts segment of my HLS playlist, the result
will not be good, close to 0 for 60-second segments or close to 1 but not
enough using 6-second segments. How can one explain that?
How can you get more relevant results?
—
Reply to this email directly, view it on GitHub
<#305>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALQSG5GV7XCFHSHCR5IKE3YYHOM3AVCNFSM6AAAAABEWR7GKGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4DMOJSGQ3TIMY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Agree with the above. Shorter clips should yield better results. Curious about your use case. Can you share more? |
No particular reason to use 60 seconds segments, I was using 6 seconds segments at the beginning and at some points, the false positives were fewer using longer segments at the cost of a loss of precision.
Does it apply only to the files used during recognition? Or also the files that DejaVu ingests?
I'm building a solution that aims to recognize a given Television program, serie or ad in real time using TS segments from an HLS playlist. |
Hi @quannabe @mkommar, we tested with shorter clips but we ended up with low confidence results as well. The is how we proceeded. We have TV ads that can last between 10-20 seconds we ingested in DejaVu, if I take the exact same file and compare it with what DejaVu fingerprinted we obtain a very good confidence level (close to one or 1). Let's assume we have the following:
We ingest all of them in DejaVu, then if we provide Now let's do the same, we ingest:
The start and end of our If we run the recognition on this segment, this is where we end up with very low confidence. |
Interesting use case! I've had issues with query times greatly increasing as the audio library size increases. Have you run into this? |
Got it. Are you requiring passive listening or is it from a direct
recording that this identification will happen? Meaning is the use case
always going to have a direct recording from a source stream? Or will it
pick up audio from the background on a phone or Alexa device?
Mahesh
…On Wed, Mar 20, 2024, 10:32 AM William Sell ***@***.***> wrote:
Interesting use case!
I've had issues with query times greatly increasing as the audio library
size increases. Have you run into this?
—
Reply to this email directly, view it on GitHub
<#305 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALQSG3GQ2WFBVXNH2WSNE3YZGMZBAVCNFSM6AAAAABEWR7GKGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZG4YTOMRRGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hello,
I've been trying to recognize TV shows as well as ads ingested using DejaVu in real time using an HLS playlist. The shows last from a few minutes to hours and the ads generally last for a few dozen seconds.
The main problem lies in the fact that when doing the recognition on a TS segment that should match an audio file ingested by DejaVu, the input_confidence attribute, depending on the length of the segment, is really low, or not close enough to 1.
When using 60-second TS segments, the input confidence value tends towards 0. Often, the value is <= 0.1 using the default settings and can grow to <= 0.2 using these settings.
Using 6-second segments, the value is closer to 1, around 0.5 to 0.9 most of the time. However, the second result returned by the program is often closer to 1, which will be a wrong audio.
The files ingested are WMV files, and the audio specs are the following:
What I did is transform these WMV files into ts files using ffmpeg to match the ts segments characteristics, which are the following:
Also, something weird I noticed is that when taking a part of a TS file that I transformed from a WMV file which is ingested by DejaVu, the input_confidence will most of the time be 1 or close to 1. But when taking the same part of audio from a ts segment of my HLS playlist, the result will not be good, close to 0 for 60-second segments or close to 1 but not enough using 6-second segments. How can one explain that?
How can you get more relevant results?
The text was updated successfully, but these errors were encountered: