Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read 250MB file even with 100G driver memory and 100G executor memory #732

Open
1 task done
kondisettyravi opened this issue Apr 18, 2023 · 5 comments
Open
1 task done

Comments

@kondisettyravi
Copy link

kondisettyravi commented Apr 18, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When we try to use the jar to read excel data from S3, the shell comes out with OOM issue. Unfortunately, I cannot share the file here.

Below is the code being used

val df = spark.read.format("excel").option("header","true").option("dataAddress", s"'${sheetName}'!A1:XFD1000000").option("maxByteArraySize", "2147483647").load(s"s3://<bucketname>/path/file.xlsx")

As soon as I run the command, it comes out of spark-shell with the OOM error as shown below.

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 32598"...
/usr/lib/spark/bin/spark-shell: line 47: 32598 Killed                  "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
[hadoop@ip-10-0-7-220 ~]$ 

Please suggest. Thanks.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- Spark version: 3.1.2
- Spark-Excel version:0.17.1
- OS:
- Cluster environment: EMR

Anything else?

No response

@nightscape
Copy link
Owner

Why not try

.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)

@kondisettyravi
Copy link
Author

I tried with this option and got

shadeio.poi.util.RecordFormatException: Tried to read data but the maximum length for this record type is 100,000,000.
If the file is not corrupt or large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
  at shadeio.poi.util.IOUtils.throwRecordTruncationException(IOUtils.java:610)
  at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:249)
  at shadeio.poi.util.IOUtils.toByteArrayWithMaxLength(IOUtils.java:220)
  at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:81)
  at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98)
  at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132)
  at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312)
  at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97)
  at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
  at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
  at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
  at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
  at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
  at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:110)
  at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:126)
  at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69)
  at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
  at scala.Option.orElse(Option.scala:447)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
  at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
  at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:93)
  at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:91)
  at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:22)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:274)
  at scala.Option.map(Option.scala:230)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
  ... 47 elided

@nightscape
Copy link
Owner

Ok, so it fails during schema inference. Are you able to specify a schema manually?

@kondisettyravi
Copy link
Author

Oh, we have many different files so specifying a schema isn't possible right now. We also tried without inferring the schema and it failed with Stackoverflow exception.

@nightscape
Copy link
Owner

Did you try the combination of specifying a schema and using maxRowsInMemory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants