You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you store a PDF test.pdf with the extension XLSX -> text.xlsx and read the file with a schema provided, you get an empty dataframe. Even if there is no workbook and no data to read.
Expected Behavior
If there is no workbook/sheet the reader should throw an exception/error.
Py4JJavaError: An error occurred while calling o575.load.
: java.io.IOException: Can't open workbook - unsupported file type: PDF
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:228)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
at com.crealytics.spark.excel.v2.ExcelHelper.getWorkbook(ExcelHelper.scala:120)
at com.crealytics.spark.excel.v2.ExcelHelper.getSheetData(ExcelHelper.scala:137)
at com.crealytics.spark.excel.v2.ExcelHelper.parseSheetData(ExcelHelper.scala:160)
at com.crealytics.spark.excel.v2.ExcelTable.infer(ExcelTable.scala:77)
at com.crealytics.spark.excel.v2.ExcelTable.inferSchema(ExcelTable.scala:48)
at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:71)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:71)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:65)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:83)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:81)
at org.apache.spark.sql.connector.catalog.Table.columns(Table.java:68)
at org.apache.spark.sql.execution.datasources.v2.FileTable.columns(FileTable.scala:37)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:95)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:92)
at com.crealytics.spark.excel.v2.ExcelDataSource.inferSchema(ExcelDataSource.scala:27)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:99)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:237)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:344)
at scala.Option.flatMap(Option.scala:271)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:342)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
at java.lang.Thread.run(Thread.java:750)
Steps To Reproduce
Store any PDF with the file extension .xlsx and then run following code:
Is there an existing issue for this?
Current Behavior
If you store a PDF test.pdf with the extension XLSX -> text.xlsx and read the file with a schema provided, you get an empty dataframe. Even if there is no workbook and no data to read.
Expected Behavior
If there is no workbook/sheet the reader should throw an exception/error.
Exactly as it happens if you do not add a schema:
Py4JJavaError: An error occurred while calling o575.load.
: java.io.IOException: Can't open workbook - unsupported file type: PDF
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:228)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
at com.crealytics.spark.excel.v2.ExcelHelper.getWorkbook(ExcelHelper.scala:120)
at com.crealytics.spark.excel.v2.ExcelHelper.getSheetData(ExcelHelper.scala:137)
at com.crealytics.spark.excel.v2.ExcelHelper.parseSheetData(ExcelHelper.scala:160)
at com.crealytics.spark.excel.v2.ExcelTable.infer(ExcelTable.scala:77)
at com.crealytics.spark.excel.v2.ExcelTable.inferSchema(ExcelTable.scala:48)
at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:71)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:71)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:65)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:83)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:81)
at org.apache.spark.sql.connector.catalog.Table.columns(Table.java:68)
at org.apache.spark.sql.execution.datasources.v2.FileTable.columns(FileTable.scala:37)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:95)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:92)
at com.crealytics.spark.excel.v2.ExcelDataSource.inferSchema(ExcelDataSource.scala:27)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:99)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:237)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:344)
at scala.Option.flatMap(Option.scala:271)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:342)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
at java.lang.Thread.run(Thread.java:750)
Steps To Reproduce
Store any PDF with the file extension .xlsx and then run following code:
Environment
Anything else?
No response
The text was updated successfully, but these errors were encountered: