Unable to read 250MB file even with 100G driver memory and 100G executor memory #732

kondisettyravi · 2023-04-18T13:05:32Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

When we try to use the jar to read excel data from S3, the shell comes out with OOM issue. Unfortunately, I cannot share the file here.

Below is the code being used

val df = spark.read.format("excel").option("header","true").option("dataAddress", s"'${sheetName}'!A1:XFD1000000").option("maxByteArraySize", "2147483647").load(s"s3://<bucketname>/path/file.xlsx")

As soon as I run the command, it comes out of spark-shell with the OOM error as shown below.

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 32598"...
/usr/lib/spark/bin/spark-shell: line 47: 32598 Killed                  "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
[hadoop@ip-10-0-7-220 ~]$

Please suggest. Thanks.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- Spark version: 3.1.2
- Spark-Excel version:0.17.1
- OS:
- Cluster environment: EMR

Anything else?

No response

The text was updated successfully, but these errors were encountered:

nightscape · 2023-04-18T14:27:04Z

Why not try

.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)

kondisettyravi · 2023-04-19T06:24:00Z

I tried with this option and got

shadeio.poi.util.RecordFormatException: Tried to read data but the maximum length for this record type is 100,000,000.
If the file is not corrupt or large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
  at shadeio.poi.util.IOUtils.throwRecordTruncationException(IOUtils.java:610)
  at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:249)
  at shadeio.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:220)
  at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:81)
  at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98)
  at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132)
  at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312)
  at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97)
  at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
  at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
  at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
  at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
  at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
  at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:110)
  at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:126)
  at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69)
  at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
  at scala.Option.orElse(Option.scala:447)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
  at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:93)
  at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:91)
  at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:22)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:274)
  at scala.Option.map(Option.scala:230)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
  ... 47 elided

nightscape · 2023-04-19T11:27:19Z

Ok, so it fails during schema inference. Are you able to specify a schema manually?

kondisettyravi · 2023-04-19T11:29:37Z

Oh, we have many different files so specifying a schema isn't possible right now. We also tried without inferring the schema and it failed with Stackoverflow exception.

nightscape · 2023-04-19T11:32:31Z

Did you try the combination of specifying a schema and using maxRowsInMemory?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to read 250MB file even with 100G driver memory and 100G executor memory #732

Unable to read 250MB file even with 100G driver memory and 100G executor memory #732

kondisettyravi commented Apr 18, 2023 •

edited

Loading

nightscape commented Apr 18, 2023

kondisettyravi commented Apr 19, 2023

nightscape commented Apr 19, 2023

kondisettyravi commented Apr 19, 2023

nightscape commented Apr 19, 2023

Unable to read 250MB file even with 100G driver memory and 100G executor memory #732

Unable to read 250MB file even with 100G driver memory and 100G executor memory #732

Comments

kondisettyravi commented Apr 18, 2023 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

nightscape commented Apr 18, 2023

kondisettyravi commented Apr 19, 2023

nightscape commented Apr 19, 2023

kondisettyravi commented Apr 19, 2023

nightscape commented Apr 19, 2023

kondisettyravi commented Apr 18, 2023 •

edited

Loading