-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SageMaker CreateTransformJob
Errors Out
#9
Comments
@r-token From the Can you run
and check the value of the You can then explicitly set the architecture when you build: docker build --platform linux/amd64 ..... Please let us know how you get on so we can update the code/docs as needed. |
That was it @eoinsha! I did see I can tell the SageMaker job is actually running now, but it still fails. CloudWatch now shows The traceback shows the following:
As well as |
I haven't seen that error and nothing obvious springs to mind. |
No, I haven't changed anything about the SageMaker instance. Interesting. I'll keep looking into it. I did see an error when running the |
I ended up needing to do the following:
Those steps resulted in the Whisper transcription working for a given audio file (woo!), but Step Functions doesn't seem to understand when that SageMaker transform job is done. It times out on the Do you know why the Batch Transform Job might not be communicating to Step Functions that it finished the job successfully? It just times out instead. I tried adding the The last thing that shows in the SageMaker TransformJobs CloudWatch logs is: |
Update: it looks like it's stuck inside the PID while loop: pids = set([nginx.pid, gunicorn.pid])
while True:
print('in the pid while loop')
pid, _ = os.wait()
if pid in pids:
break
print('Past the pid while loop, calling sigterm_handler') I can see "in the pid while loop" printed repeatedly. However, the If I remove that while loop manually and force it to move on, it then gets stuck inside the |
Ok, never mind! It was not stuck in that loop - I blame my lack of Python knowledge for not understanding that better. It simply takes a long time for Whisper to transcribe the audio, at least on the instance type I'm currently running (I tried both I think we can close this issue now @eoinsha. However, I am surprised it's taking 8 minutes to complete the transcription for a 30 second audio clip. Is that expected? I am planning on running it on a Regardless, the core changes necessary for me were:
|
After further investigation, I've found that the speed bottleneck is not the actual Whisper transcription process. The transcription itself is only taking about 3 seconds for a 30 second audio clip. So that's great. The bottleneck seems to be how long it takes the SageMaker Batch Transform Job to load in the 5 GB Docker image from ECR. That is what I think is taking about 8 minutes. Once it's loaded in, the job finishes quickly. Any advice on how to speed that up? Can SageMaker Batch Transform Jobs use a "warm" instance that always has that image loaded? Or is there another way to slim down the Docker image so it's quicker to load? |
I have a pretty similar bug, my SageMaker CreateTransformJob is loading and loading, after 30min I stoped the Job. My HeadObject Whisper Output is also in Caught error. In the SageMaker CreateTransformJob logs is nothing the log is empty. |
@r-token You're correct, that the transcription time is not the bottleneck here. There is a significant overhead in startup. I didn't observe that this was all down to ECR load. How did you measure this? We typically use it for 30-60 minute audio. The duration is usually in the 30 minute range. For longer episodes, I have had to increase the instance size to avoid it timing out. @timojDE - what audio duration are you using? |
I tried it with the sample audio files so about 10sec Edit: |
Hi there! Love the AWS Bites podcast. Thank you for open-sourcing this work.
Currently, the
SageMaker CreateTransformJob
is failing for me. Here's what my state machine graph looks like:You'll notice that the
HeadObject Whisper Output
step fails as well. I assume these two issues are related. I've detailed both for you below.HeadObject Whisper Output
error message is anS3.NoSuchKeyException
. And I can confirm that there is nowhisper-batch-output/file.json
object within the bucket specified. How is that supposed to get written? Here's the full error:SageMaker CreateTransformJob
error message in CloudWatch simply saysexec /app/serve: exec format error
. There are a few ideas online as to what might cause this, but I'm out of my wheelhouse here. And for all I know, this could be caused by theHeadObject Whisper Output
error listed above.Any idea on what could be going wrong? Happy to provide any additional information that might be helpful. I did take what you have here and adjust it to work with the Serverless Framework instead of AWS SAM - so something may have simply gotten lost in translation.
Thanks again for publishing this!
The text was updated successfully, but these errors were encountered: