-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected "UNK" captions with single video prediction #62
Comments
Hi. Many thanks for such an elaborate issue description. I noticed that it fails to install correct packages with anaconda. Would it be possible for you to try it with miniconda? Could you try not to skip the aac transcoding? Most likely the issue is with the video being encoded with different encoding. I think, you need to look into this because it worked for the example video. Start by checking if the audio you give to vggish is playable and you can hear the sound as you expect it. |
Hi, thank you very much for your promt response. I did create the environment using miniconda. I created a minimal code snippet to extract the
import os
import subprocess
def which_ffmpeg() -> str:
'''Determines the path to ffmpeg library
Returns:
str -- path to the library
'''
result = subprocess.run(['which', 'ffmpeg'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
ffmpeg_path = result.stdout.decode('utf-8').replace('\n', '')
return ffmpeg_path
def extract_wav_from_mp4(video_path: str, tmp_path: str) -> str:
'''Extracts .wav file from .aac which is extracted from .mp4
We cannot convert .mp4 to .wav directly. For this we do it in two stages: .mp4 -> .aac -> .wav
Args:
video_path (str): Path to a video
audio_path_wo_ext (str):
Returns:
[str, str] -- path to the .wav and .aac audio
'''
assert which_ffmpeg() != '', 'Is ffmpeg installed? Check if the conda environment is activated.'
assert video_path.endswith('.mp4'), 'The file does not end with .mp4. Comment this if expected'
# extract video filename from the video_path
video_filename = os.path.split(video_path)[-1].replace('.mp4', '')
# the temp files will be saved in `tmp_path` with the same name
audio_wav_path = os.path.join(tmp_path, f'{video_filename}.wav')
# constructing shell commands and calling them
mp4_to_wav = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {video_path} {audio_wav_path}'
subprocess.call(mp4_to_wav.split())
return
# extract audio files from .mp4
extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/women_long_jump.mp4', '/home/mrt/Projects/BMT/sample')
extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/my_video.mp4', '/home/mrt/Projects/BMT/sample') Running this code snippet does in fact create
import os
import subprocess
def which_ffmpeg() -> str:
'''Determines the path to ffmpeg library
Returns:
str -- path to the library
'''
result = subprocess.run(['which', 'ffmpeg'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
ffmpeg_path = result.stdout.decode('utf-8').replace('\n', '')
return ffmpeg_path
def extract_wav_from_mp4(video_path: str, tmp_path: str) -> str:
'''Extracts .wav file from .aac which is extracted from .mp4
We cannot convert .mp4 to .wav directly. For this we do it in two stages: .mp4 -> .aac -> .wav
Args:
video_path (str): Path to a video
audio_path_wo_ext (str):
Returns:
[str, str] -- path to the .wav and .aac audio
'''
assert which_ffmpeg() != '', 'Is ffmpeg installed? Check if the conda environment is activated.'
assert video_path.endswith('.mp4'), 'The file does not end with .mp4. Comment this if expected'
# extract video filename from the video_path
video_filename = os.path.split(video_path)[-1].replace('.mp4', '')
# the temp files will be saved in `tmp_path` with the same name
audio_aac_path = os.path.join(tmp_path, f'{video_filename}.aac')
audio_wav_path = os.path.join(tmp_path, f'{video_filename}.wav')
# constructing shell commands and calling them
mp4_to_acc = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {video_path} -acodec copy {audio_aac_path}'
aac_to_wav = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {audio_aac_path} {audio_wav_path}'
subprocess.call(mp4_to_acc.split())
subprocess.call(aac_to_wav.split())
return audio_wav_path, audio_aac_path
# extract audio files from .mp4
audio_wav_path, audio_aac_path = extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/women_long_jump.mp4', '/home/mrt/Projects/BMT/sample')
audio_wav_path, audio_aac_path = extract_wav_from_mp4('/home/mrt/Projects/BMT/sample/my_video.mp4', '/home/mrt/Projects/BMT/sample') Running this code snippet produces The content returned by running
If I understand it correctly, your line of code expects a video containing an audio stream that uses an I will try to convert |
i think, the problem is with the video you are trying to use and yes it should work for any wav file. maybe your video is out of the domain of training videos.
this line of code expects the video to be mp4, then it extracts whatever the audio is encoded in and transcodes it to aac. it could be that your ffmpeg does not support transcoding to aac. try to do the same on google colab or some other machine. if the ffmpeg can't transcode from x to aac, the installation does not support this codec. are you sure your mp4 file is not .mkv? |
Thank you for your indications! I will try it on another machine/environment and see if another ffmpeg version supports the transcodification. You are right in that the video was not a I will work on your suggestions and let you know if they resolve the issue. Thank you very much for your time and consideration! |
may i ask you try to transcode my video into your format vp9/opus etc and repeat your steps? do you get the same result? if you are using youtube-dl, try to get a video with h264 and aac codecs and run on it |
also, i realized that you use which simply copies the codec (opus, instead of aac) for audio. can you specify aac there as suggested in #38. |
I changed line 28 as suggested in #38 and that seems to resolve the issue with extracting the appropiate
Unfortunately I won't be able to try your other suggestions until monday. I will update you once I do. Have a great weekend! |
sure, have a great weekend. did you try to run the prediction script where you were getting unks? |
I was not able to get that far today unfortunately. I was able to download the video with an On monday I will run the prediction script and try to get it to work however I can. Thank you very much once again for your time and consideration Vladimir. |
Hello, Vladimir. I tried converting ffmpeg -i my_video.webm -c:v libx264 -c:a aac -b:a 160k -crf 20 -preset slow -vf format=yuv420p -movflags +faststart my_video.mp4 This resulted in a > ffprobe my_video.mp4
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'my_video.mp4':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
encoder : Lavf58.76.100
Duration: 00:02:28.10, start: 0.000000, bitrate: 5550 kb/s
Stream #0:0(eng): Video: h264 (High) (avc1 / 0x31637661), yuv420p(tv, bt2020nc/bt2020/arib-std-b67), 1920x1080 [SAR 1:1 DAR 16:9], 5382 kb/s, 29.97 fps, 29.97 tbr, 30k tbn, 59.94 tbc (default)
Metadata:
handler_name : VideoHandler
vendor_id : [0][0][0][0]
Side data:
Mastering Display Metadata, has_primaries:1 has_luminance:1 r(0.6800,0.3200) g(0.2650,0.6900) b(0.1500 0.0600) wp(0.3127, 0.3290) min_luminance=0.005000, max_luminance=1000.000000
Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 160 kb/s (default)
Metadata:
handler_name : SoundHandler
vendor_id : [0][0][0][0] However, extracting
I then tried the opposite, converting ffmpeg -i women_long_jump.mp4 -c:v libvpx-vp9 -c:a libopus -b:v 0 -crf 20 women_long_jump_transcoded.webm After the transcoding, I simply renamed the file from > ffprobe women_long_jump_transcoded.mp4
Input #0, matroska,webm, from 'women_long_jump_transcoded.mp4':
Metadata:
COMPATIBLE_BRANDS: isommp42
MAJOR_BRAND : mp42
MINOR_VERSION : 0
ENCODER : Lavf58.76.100
Duration: 00:00:35.16, start: -0.007000, bitrate: 697 kb/s
Stream #0:0: Video: vp9 (Profile 0), yuv420p(tv, progressive), 480x360, SAR 1:1 DAR 4:3, 24.83 fps, 24.83 tbr, 1k tbn, 1k tbc (default)
Metadata:
HANDLER_NAME : ISO Media file produced by Google Inc. Created on: 05/06/2018.
VENDOR_ID : [0][0][0][0]
ENCODER : Lavc58.134.100 libvpx-vp9
DURATION : 00:00:35.086000000
Stream #0:1: Audio: opus, 48000 Hz, stereo, fltp (default)
Metadata:
HANDLER_NAME : ISO Media file produced by Google Inc. Created on: 05/06/2018.
VENDOR_ID : [0][0][0][0]
ENCODER : Lavc58.134.100 libopus
DURATION : 00:00:35.163000000 I extracted
Seeing these results, I begin to incline more into believing that this specific video is indeed out of the domain your networks were trained on. However, the video shows people, a track-like floor, and things that I would guess are similar to what the networks might have seen during training. It does not seem like this specific video is too far apart from the video of Can you see any other reason why this might be? EDIT I noticed that the dense captions for the |
Hello, Vladimir.
First of all congratulations for such a fantastic project. I was introduced to this work from many other papers who cited it and used it as a base to grow upon. I enjoyed your video presentation, and I think you are doing a very good job at keeping up with all the repo issues.
I ran the sample code
single_video_prediction.py
on the given example (women_long_jump.mp4
) without major issues (had to change CUDA and PyTorch versions from the conda environment as reported in #45).However, when I tried the code on a custom video, let's call it
my_video.mp4
, I got some errors.VGGish was unable to extract a
.wav
file from the audio because it had noaac
codec (I checked withffprobe my_video.mp4
and the audio usedopus
codec instead ofaac
). So, I changed these 2 lines in BMT/submodules/video_features/models/vggish/utils/utils.py for the following, which resolved the issue:After obtaining the
i3d
andvggish
features I tried running BMT on the video using the following command:Obtaining:
Checking it was iterating over a 0-d tensor, I tried removing the
NMS
and ran it again with:Obtaining a list of sentences with the token "UNK":
I am a bit at a loss here, as I have not much experience working with text and audio (only with image and video). Could you point me in the right direction? I am unsure of what might be the root cause. I suspect it could be one of the following:
torch
1.4.0 instead of 1.2.0, as if was the closest version that could work with my GPU. I kepttorchtext
at version 0.3.1 (same as in yours). However, the code works for the example video you provide, so it seems unlikely that this is the root cause..wav
file directly from the.mp4
, skipping the intermediate step of obtaining an.aac
file. I do not see any inconvenient in doing so, in fact, it seems like a more portable option. However, I remain unsure whether you did this for a specific reason I am unaware of.Desktop (please complete the following information):
You
conda
environmentThe text was updated successfully, but these errors were encountered: