You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using this Dockerfile to generate the virtualenv that we later provide to our Emr Serverless 7.1 Application to be used.
FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base
RUN dnf install -y gcc python3 python3-devel
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install \
venv-pack==0.2.0 \
pytz==2022.7.1 \
boto3==1.33.13 \
pandas==1.3.5 \
python-dateutil==2.8.2
RUN mkdir /output && venv-pack -o /output/pyspark_ge.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_ge.tar.gz /
Within the Spark application we have a part which is calling ['aws', 's3', 'mv'] by calling check_call from subprocess module.
In that case it seems like the virtualenv is not used but the global python is used which is coming without dateutil (python 3.9)
Of course one could rewrite the application to call from the code logic with the current running binary but I also expected that I could provide an option to tell the emr serverless application "in general" to use my virtualenv and not just when running my pyspark application. Is it possible or is this behavior expected?
The text was updated successfully, but these errors were encountered:
Hello,
We are using this Dockerfile to generate the virtualenv that we later provide to our Emr Serverless 7.1 Application to be used.
Within the Spark application we have a part which is calling ['aws', 's3', 'mv'] by calling check_call from subprocess module.
In that case it seems like the virtualenv is not used but the global python is used which is coming without dateutil (python 3.9)
Of course one could rewrite the application to call from the code logic with the current running binary but I also expected that I could provide an option to tell the emr serverless application "in general" to use my virtualenv and not just when running my pyspark application. Is it possible or is this behavior expected?
The text was updated successfully, but these errors were encountered: