Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] #3749

Open
1 of 5 tasks
ranjanankur314 opened this issue Oct 3, 2024 · 0 comments
Open
1 of 5 tasks

[BUG] #3749

ranjanankur314 opened this issue Oct 3, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@ranjanankur314
Copy link

ranjanankur314 commented Oct 3, 2024

Bug: Not able to run PySpark code with Delta on k8s cluster on GCP

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Describe the problem

I am attempting to execute the PySpark code, which involves reading and writing to Delta on a GCS bucket.
I am using the k8s cluster and running the spark-submit command.

For normal PySpark code involving Parquet files. It is a working file but for delta, I am getting the following error. I have been trying this with multiple different versions for the last week.

  File "/tmp/spark-f1f15d72-8494-4597-8d63-ccc0b33a52e4/pyspark_gcs_parquet_read.py", line 38, in <module>
    extract()
  File "/tmp/spark-f1f15d72-8494-4597-8d63-ccc0b33a52e4/pyspark_gcs_parquet_read.py", line 27, in extract
    input_df.write.mode("overwrite").format("delta").save("/tmp/delta-table2")
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1463, in save
  File "/opt/bitnami/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
  File "/opt/bitnami/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o226.save.
: java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    org/apache/spark/sql/delta/stats/StatisticsCollection$SqlParser$$anon$1.visitMultipartIdentifierList(Lorg/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext;)Lscala/collection/Seq; @17: invokevirtual
24/10/03 06:41:32 INFO SparkContext: SparkContext is stopping with exitCode 0.
  Reason:
    Type 'org/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext' (current frame, stack[1]) is not assignable to 'org/antlr/v4/runtime/ParserRuleContext'
  Current Frame:
    bci: @17
    flags: { }
    locals: { 'org/apache/spark/sql/delta/stats/StatisticsCollection$SqlParser$$anon$1', 'org/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext' }
    stack: { 'org/apache/spark/sql/catalyst/parser/ParserUtils$', 'org/apache/spark/sql/catalyst/parser/SqlBaseParser$MultipartIdentifierListContext', 'scala/Option', 'scala/Function0' }
  Bytecode:
    0000000: b200 232b b200 23b6 0027 2a2b ba00 3f00
    0000010: 00b6 0043 c000 45b0 
    ```
    
 I am using the following spark_submit command 
  

spark-submit
--master k8s://
--deploy-mode cluster
--name spark-ankur-4
--packages com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.5,io.delta:delta-spark_2.12:3.0.0
--conf spark.executor.instances=1
--conf spark.kubernetes.container.image=bitnami/spark:3.5.3
--conf spark.driver.userClassPathFirst=true
--conf spark.executor.userClassPathFirst=true
--conf spark.kubernetes.tolerations.0.key=
--conf spark.kubernetes.tolerations.0.operator=Equal
--conf spark.kubernetes.tolerations.0.value=perf
--conf spark.kubernetes.tolerations.0.effect=NoSchedule
--conf spark.hadoop.hive.metastore.uris=thrift://


--conf spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
--conf spark.kubernetes.driver.podTemplateFile=pod_template.yaml
--conf spark.kubernetes.executor.podTemplateFile=pod_template.yaml
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
--conf spark.kubernetes.authenticate.executor.serviceAccountName=spark
--conf spark.kubernetes.namespace=default
--conf spark.hadoop.hadoop.security.authentication=simple
--conf spark.hadoop.hadoop.security.authorization=false
--conf spark.executorEnv.LD_PRELOAD=/opt/bitnami/common/lib/libnss_wrapper.so
--conf 'spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp'
gs://public-bucket-ankur/pyspark_gcs_parquet_read.py


PySpark code snippet

def extract():
# Initialize Spark session with required configurations
spark = (
SparkSession.builder
.appName("spark-pi")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.jars", "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.11.jar")
.config("spark.hadoop.fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
.config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
.getOrCreate()
)

print(f"Spark Version {spark.sparkContext.version}")
print(f"{spark.sparkContext.getConf().getAll()}")

gcs_file_path = "gs://public-bucket-ankur/sample.snappy.parquet"
input_df = spark.read.parquet(f"{gcs_file_path}")
input_df.printSchema()
input_df.show(truncate=False)

print("Normal Parquet File completed")

input_df.write.mode("overwrite").format("delta").save("/tmp/delta-table2")

df = spark.read.format("delta").load("/tmp/delta-table2")
df.show()
  
### Environment information

* Delta Lake version: 3.2.1/3.0.0
* Spark version: 3.5.3/3.5.0 
* Scala version: 2.12

### Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

- [ ] Yes. I can contribute a fix for this bug independently.
- [ ] Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
- [ ] No. I cannot contribute a bug fix at this time.
@ranjanankur314 ranjanankur314 added the bug Something isn't working label Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant