You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug: Not able to run PySpark code with Delta on k8s cluster on GCP
Which Delta project/connector is this regarding?
Spark
Standalone
Flink
Kernel
Other (fill in here)
Describe the problem
I am attempting to execute the PySpark code, which involves reading and writing to Delta on a GCS bucket.
I am using the k8s cluster and running the spark-submit command.
For normal PySpark code involving Parquet files. It is a working file but for delta, I am getting the following error. I have been trying this with multiple different versions for the last week.
File "/tmp/spark-f1f15d72-8494-4597-8d63-ccc0b33a52e4/pyspark_gcs_parquet_read.py", line 38, in <module>
extract()
File "/tmp/spark-f1f15d72-8494-4597-8d63-ccc0b33a52e4/pyspark_gcs_parquet_read.py", line 27, in extract
input_df.write.mode("overwrite").format("delta").save("/tmp/delta-table2")
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1463, in save
File "/opt/bitnami/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/opt/bitnami/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o226.save.
: java.lang.VerifyError: Bad type on operand stack
Exception Details:
Location:
org/apache/spark/sql/delta/stats/StatisticsCollection$SqlParser$$anon$1.visitMultipartIdentifierList(Lorg/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext;)Lscala/collection/Seq; @17: invokevirtual
24/10/03 06:41:32 INFO SparkContext: SparkContext is stopping with exitCode 0.
Reason:
Type 'org/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext' (current frame, stack[1]) is not assignable to 'org/antlr/v4/runtime/ParserRuleContext'
Current Frame:
bci: @17
flags: { }
locals: { 'org/apache/spark/sql/delta/stats/StatisticsCollection$SqlParser$$anon$1', 'org/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext' }
stack: { 'org/apache/spark/sql/catalyst/parser/ParserUtils$', 'org/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext', 'scala/Option', 'scala/Function0' }
Bytecode:
0000000: b200 232b b200 23b6 0027 2a2b ba00 3f00
0000010: 00b6 0043 c000 45b0
```
I am using the following spark_submit command
### Environment information
* Delta Lake version: 3.2.1/3.0.0
* Spark version: 3.5.3/3.5.0
* Scala version: 2.12
### Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
- [ ] Yes. I can contribute a fix for this bug independently.
- [ ] Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
- [ ] No. I cannot contribute a bug fix at this time.
The text was updated successfully, but these errors were encountered:
Bug: Not able to run PySpark code with Delta on k8s cluster on GCP
Which Delta project/connector is this regarding?
Describe the problem
I am attempting to execute the PySpark code, which involves reading and writing to Delta on a GCS bucket.
I am using the k8s cluster and running the spark-submit command.
For normal PySpark code involving Parquet files. It is a working file but for delta, I am getting the following error. I have been trying this with multiple different versions for the last week.
spark-submit
--master k8s://
--deploy-mode cluster
--name spark-ankur-4
--packages com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.5,io.delta:delta-spark_2.12:3.0.0
--conf spark.executor.instances=1
--conf spark.kubernetes.container.image=bitnami/spark:3.5.3
--conf spark.driver.userClassPathFirst=true
--conf spark.executor.userClassPathFirst=true
--conf spark.kubernetes.tolerations.0.key=
--conf spark.kubernetes.tolerations.0.operator=Equal
--conf spark.kubernetes.tolerations.0.value=perf
--conf spark.kubernetes.tolerations.0.effect=NoSchedule
--conf spark.hadoop.hive.metastore.uris=thrift://
--conf spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
--conf spark.kubernetes.driver.podTemplateFile=pod_template.yaml
--conf spark.kubernetes.executor.podTemplateFile=pod_template.yaml
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
--conf spark.kubernetes.authenticate.executor.serviceAccountName=spark
--conf spark.kubernetes.namespace=default
--conf spark.hadoop.hadoop.security.authentication=simple
--conf spark.hadoop.hadoop.security.authorization=false
--conf spark.executorEnv.LD_PRELOAD=/opt/bitnami/common/lib/libnss_wrapper.so
--conf 'spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp'
gs://public-bucket-ankur/pyspark_gcs_parquet_read.py
def extract():
# Initialize Spark session with required configurations
spark = (
SparkSession.builder
.appName("spark-pi")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.jars", "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.11.jar")
.config("spark.hadoop.fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
.config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
.getOrCreate()
)
The text was updated successfully, but these errors were encountered: