[BUG] Cannot read/ write dataframe after loading file in Databricks 12.1 Runtime 3.3.1 Spark #724

jmichaelsoliven · 2023-03-30T04:01:58Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

When running pyspark code below after in Databricks 12.1 with 3.3.1 Spark runtime:

df = spark.read.format("com.crealytics.spark.excel")
.option("dataAddress", "'" + param_excel_sheet + "'!" + param_excel_row_start)
.option("header", False)
.option("treatEmptyValueAsNulls", True)
.option("maxRowsInMemory",20)
.option("inferSchema", "false")
.load(param_mountPoint + param_in_adls_raw_path + param_in_file_name)

df.show(truncate = False)

I received the following error:

An error occurred while calling o3150.showString.
: com.github.pjfanning.xlsx.exceptions.ParseException: Error reading XML stream
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.getRow(StreamingRowIterator.java:126)
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.hasNext(StreamingRowIterator.java:627)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)

I also tried writing dataframe to delta table and received below error:

An error occurred while calling o3063.save.
: com.github.pjfanning.xlsx.exceptions.ParseException: Error reading XML stream
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.getRow(StreamingRowIterator.java:126)
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.hasNext(StreamingRowIterator.java:627)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)

Excel has 11 sheets, I'm trying to read data from 1 sheet only that has 389,862 rows.

Expected Behavior

The resulting dataframe should display and write to delta table correctly

Steps To Reproduce

Set the ff parameters to your desire value:

param_excel_sheet = excel sheet ex: Sheet1
param_excel_row_start = row start ex: A2
param_mountPoint + param_in_adls_raw_path + param_in_file_name = folder path including filename

Then run below code.

df = spark.read.format("com.crealytics.spark.excel")
.option("dataAddress", "'" + param_excel_sheet + "'!" + param_excel_row_start)
.option("header", False)
.option("treatEmptyValueAsNulls", True)
.option("maxRowsInMemory",20)
.option("inferSchema", "false")
.load(param_mountPoint + param_in_adls_raw_path + param_in_file_name)

Environment

- Spark version:3.3.1
- Spark-Excel version:3.3.1_0.18.5
- OS: Windows
- Cluster environment: Standard_DS12_v2

Anything else?

No response

github-actions · 2023-03-30T04:02:12Z

Please check these potential duplicates:

[[BUG] Cannot read files into dataframe in Databricks 9.1 LTS Runtime 3.1.2 Spark #712] [BUG] Cannot read files into dataframe in Databricks 9.1 LTS Runtime 3.1.2 Spark (70.56%)
[[BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark #682] [BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark (65%)
If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.

github-actions · 2023-03-30T05:07:10Z

Please check these potential duplicates:

[[BUG] Cannot read files into dataframe in Databricks 9.1 LTS Runtime 3.1.2 Spark #712] [BUG] Cannot read files into dataframe in Databricks 9.1 LTS Runtime 3.1.2 Spark (70.56%)
[[BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark #682] [BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark (65%)
If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.

github-actions bot added the potential-duplicate label Mar 30, 2023

jmichaelsoliven changed the title ~~[BUG] Cannot read/ write dataframe after loading file in Databricks Runtime 3.3.1 Spark~~ [BUG] Cannot read/ write dataframe after loading file in Databricks 12.1 Runtime 3.3.1 Spark Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cannot read/ write dataframe after loading file in Databricks 12.1 Runtime 3.3.1 Spark #724

[BUG] Cannot read/ write dataframe after loading file in Databricks 12.1 Runtime 3.3.1 Spark #724

jmichaelsoliven commented Mar 30, 2023 •

edited

Loading

github-actions bot commented Mar 30, 2023

github-actions bot commented Mar 30, 2023

[BUG] Cannot read/ write dataframe after loading file in Databricks 12.1 Runtime 3.3.1 Spark #724

[BUG] Cannot read/ write dataframe after loading file in Databricks 12.1 Runtime 3.3.1 Spark #724

Comments

jmichaelsoliven commented Mar 30, 2023 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

github-actions bot commented Mar 30, 2023

github-actions bot commented Mar 30, 2023

jmichaelsoliven commented Mar 30, 2023 •

edited

Loading