Cluster Stops Processing Files After Six Reads of Files have been completed #750

brian-custer · 2023-06-18T13:28:15Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

I am using com.crealytics:spark-excel_2.12:3.3.1_0.18.7 to process and read over 70 excel files in my data lake. It appears that the cluster stops working after reading approximately 6 workbooks. It appears to hang and it stops reading anymore workbooks. I am using a cluster with a minimum of 2 worker nodes to a maximum of 8 worker nodes. It appears as if the cluster runs out of memory or something that prevents it from reading any more workbooks.

Expected Behavior

Expected behavior is that I should be able to read all 70 workbooks in my data lake and append the data to an existing delta table in my unity catalog.

Steps To Reproduce

Use the following code to loop through all the files in a data lake folder and read the spreadsheets and append to a unity catalog table:

for file in dbutils.fs.ls (pathToParcelData):
print(counter)
fileDate = dt.datetime.utcfromtimestamp (file.modificationTime/1000).strftime('%Y-%m-%d')
df = spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true")
.option("dataAddress", "'Transaction Detail'!A1")
.load(file.path)
for field in df.schema.fieldNames():
df = df.withColumnRenamed (field, field.removesuffix(" "))
df = df.withColumnRenamed ("Invoice #", "invoicenum").withColumnRenamed ("Tracking #", "trackingnum").withColumnRenamed ("Control #", "controlnum").withColumnRenamed ("Invoice Date", "invoicedate").withColumnRenamed ("Invoice Amount", "invoiceamount").withColumnRenamed ("Ship Date", "shipdate").withColumnRenamed ("Delivery Date", "deliverydate").withColumnRenamed ("Service Level", "servicelevel").withColumnRenamed("Actual Weight", "actualweight").withColumnRenamed ("Bill Weight", "billweight").withColumnRenamed ("Audited Amount", "totalcharge").withColumnRenamed ("Zone", "zone").withColumnRenamed ("Manual", "glcode")
df = df.select ("invoicenum", "invoicedate", "invoiceamount", "trackingnum", "shipdate", "deliverydate", "servicelevel", "zone", "actualweight", "billweight", "glcode", "controlnum", "totalcharge")
df = df.withColumn ("zone", df["zone"].cast(StringType()))
if counter > 0:
df.write.mode("append").saveAsTable ("sources.shipping.parcel")
else:
df.write.mode("overwrite").saveAsTable ("sources.shipping.parcel")
counter +=1

Environment

- Spark version: 3.4.0
- Spark-Excel version: 3.3.1
- OS: databricks
- Cluster environment: Standard_DS3_v2 worker and executor nodes. 2-8 worker nodes

Anything else?

No response

nightscape · 2023-06-18T21:47:48Z

Can you try with format excel instead? This is the new v2 version.
It should also support reading multiple files at once, so you could try just pointing it directly to the directory.

brian-custer · 2023-06-19T02:35:00Z

Thanks for the response. Is it capable of reading up to 70 Excel files at a time. I changed my code as you suggested but the driver is hung. I suspect it is a memory issue but I’m not sure. From: Martin Mauch ***@***.***> Date: Sunday, June 18, 2023 at 2:48 PM To: crealytics/spark-excel ***@***.***> Cc: Brian Custer ***@***.***>, Author ***@***.***> Subject: EXTERNAL - Re: [crealytics/spark-excel] Cluster Stops Processing Files After Six Reads of Files have been completed (Issue #750) CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Can you try with format excel instead? This is the new v2 version. It should also support reading multiple files at once, so you could try just pointing it directly to the directory. — Reply to this email directly, view it on GitHub<#750 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A3FCQ5AIGK3GDU2MSAX5NKTXL5ZQ5ANCNFSM6AAAAAAZK4WR4U>. You are receiving this because you authored the thread.Message ID: ***@***.***>

nightscape · 2023-06-21T13:06:41Z

Can you post the new code you're using? Btw, the spark-excel version you mentioned in the first issue doesn't look right. It should probably be 0.18.???

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Stops Processing Files After Six Reads of Files have been completed #750

Cluster Stops Processing Files After Six Reads of Files have been completed #750

brian-custer commented Jun 18, 2023

nightscape commented Jun 18, 2023

brian-custer commented Jun 19, 2023 via email

nightscape commented Jun 21, 2023

Cluster Stops Processing Files After Six Reads of Files have been completed #750

Cluster Stops Processing Files After Six Reads of Files have been completed #750

Comments

brian-custer commented Jun 18, 2023

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

nightscape commented Jun 18, 2023

brian-custer commented Jun 19, 2023 via email

nightscape commented Jun 21, 2023