Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot read/ write dataframe after loading file in Databricks 12.1 Runtime 3.3.1 Spark #724

Open
1 task done
jmichaelsoliven opened this issue Mar 30, 2023 · 2 comments

Comments

@jmichaelsoliven
Copy link

jmichaelsoliven commented Mar 30, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When running pyspark code below after in Databricks 12.1 with 3.3.1 Spark runtime:

df = spark.read.format("com.crealytics.spark.excel")
.option("dataAddress", "'" + param_excel_sheet + "'!" + param_excel_row_start)
.option("header", False)
.option("treatEmptyValueAsNulls", True)
.option("maxRowsInMemory",20)
.option("inferSchema", "false")
.load(param_mountPoint + param_in_adls_raw_path + param_in_file_name)

df.show(truncate = False)

I received the following error:

An error occurred while calling o3150.showString.
: com.github.pjfanning.xlsx.exceptions.ParseException: Error reading XML stream
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.getRow(StreamingRowIterator.java:126)
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.hasNext(StreamingRowIterator.java:627)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)

I also tried writing dataframe to delta table and received below error:

An error occurred while calling o3063.save.
: com.github.pjfanning.xlsx.exceptions.ParseException: Error reading XML stream
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.getRow(StreamingRowIterator.java:126)
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.hasNext(StreamingRowIterator.java:627)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)

Excel has 11 sheets, I'm trying to read data from 1 sheet only that has 389,862 rows.

Expected Behavior

The resulting dataframe should display and write to delta table correctly

Steps To Reproduce

Set the ff parameters to your desire value:

param_excel_sheet = excel sheet ex: Sheet1
param_excel_row_start = row start ex: A2
param_mountPoint + param_in_adls_raw_path + param_in_file_name = folder path including filename

Then run below code.

df = spark.read.format("com.crealytics.spark.excel")
.option("dataAddress", "'" + param_excel_sheet + "'!" + param_excel_row_start)
.option("header", False)
.option("treatEmptyValueAsNulls", True)
.option("maxRowsInMemory",20)
.option("inferSchema", "false")
.load(param_mountPoint + param_in_adls_raw_path + param_in_file_name)

Environment

- Spark version:3.3.1
- Spark-Excel version:3.3.1_0.18.5
- OS: Windows
- Cluster environment: Standard_DS12_v2

Anything else?

No response

@github-actions
Copy link

Please check these potential duplicates:

@github-actions
Copy link

Please check these potential duplicates:

@jmichaelsoliven jmichaelsoliven changed the title [BUG] Cannot read/ write dataframe after loading file in Databricks Runtime 3.3.1 Spark [BUG] Cannot read/ write dataframe after loading file in Databricks 12.1 Runtime 3.3.1 Spark Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant