Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] No thrown exception if schema is provieded, but there is no workbook/sheet (PDF with XLSX Extension) #844

Open
1 task done
dabruehl opened this issue Apr 3, 2024 · 0 comments

Comments

@dabruehl
Copy link

dabruehl commented Apr 3, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

If you store a PDF test.pdf with the extension XLSX -> text.xlsx and read the file with a schema provided, you get an empty dataframe. Even if there is no workbook and no data to read.

Expected Behavior

If there is no workbook/sheet the reader should throw an exception/error.

Exactly as it happens if you do not add a schema:

options = {
        "header": "true",
        "encoding": "UTF-8",
        "dataAddress": "workbook:A1",
        "mode" : "FAILFAST"
      }
df = spark.read.format("excel").options(**options).load("PATH_TO_YOUR_FILE/FILENAME.xlsx")
df.show()

Py4JJavaError: An error occurred while calling o575.load.
: java.io.IOException: Can't open workbook - unsupported file type: PDF
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:228)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
at com.crealytics.spark.excel.v2.ExcelHelper.getWorkbook(ExcelHelper.scala:120)
at com.crealytics.spark.excel.v2.ExcelHelper.getSheetData(ExcelHelper.scala:137)
at com.crealytics.spark.excel.v2.ExcelHelper.parseSheetData(ExcelHelper.scala:160)
at com.crealytics.spark.excel.v2.ExcelTable.infer(ExcelTable.scala:77)
at com.crealytics.spark.excel.v2.ExcelTable.inferSchema(ExcelTable.scala:48)
at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:71)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:71)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:65)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:83)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:81)
at org.apache.spark.sql.connector.catalog.Table.columns(Table.java:68)
at org.apache.spark.sql.execution.datasources.v2.FileTable.columns(FileTable.scala:37)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:95)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:92)
at com.crealytics.spark.excel.v2.ExcelDataSource.inferSchema(ExcelDataSource.scala:27)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:99)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:237)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:344)
at scala.Option.flatMap(Option.scala:271)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:342)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
at java.lang.Thread.run(Thread.java:750)

Steps To Reproduce

Store any PDF with the file extension .xlsx and then run following code:

from pyspark.sql.types import *

schema = StructType([
    StructField("Code", DoubleType()),
    StructField("Text", StringType())
])
options = {
        "header": "true",
        "encoding": "UTF-8",
        "dataAddress": "workbook:A1",
        "mode" : "FAILFAST"
      }
df = spark.read.format("excel").schema(schema).options(**options).load("PATH_TO_YOUR_FILE/FILENAME.xlsx")
df.show()

+----+----+
|Code|Text|
+----+----+
+----+----+

Environment

- Spark version: 3.5.0
- Spark-Excel version: 0.20.2
- OS: Databricks Runtime 14.3 LTS
- Cluster environment

Anything else?

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant