Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Stops Processing Files After Six Reads of Files have been completed #750

Open
1 task done
brian-custer opened this issue Jun 18, 2023 · 3 comments
Open
1 task done

Comments

@brian-custer
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I am using com.crealytics:spark-excel_2.12:3.3.1_0.18.7 to process and read over 70 excel files in my data lake. It appears that the cluster stops working after reading approximately 6 workbooks. It appears to hang and it stops reading anymore workbooks. I am using a cluster with a minimum of 2 worker nodes to a maximum of 8 worker nodes. It appears as if the cluster runs out of memory or something that prevents it from reading any more workbooks.

Expected Behavior

Expected behavior is that I should be able to read all 70 workbooks in my data lake and append the data to an existing delta table in my unity catalog.

Steps To Reproduce

Use the following code to loop through all the files in a data lake folder and read the spreadsheets and append to a unity catalog table:

for file in dbutils.fs.ls (pathToParcelData):
print(counter)
fileDate = dt.datetime.utcfromtimestamp (file.modificationTime/1000).strftime('%Y-%m-%d')
df = spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true")
.option("dataAddress", "'Transaction Detail'!A1")
.load(file.path)
for field in df.schema.fieldNames():
df = df.withColumnRenamed (field, field.removesuffix(" "))
df = df.withColumnRenamed ("Invoice #", "invoicenum").withColumnRenamed ("Tracking #", "trackingnum").withColumnRenamed ("Control #", "controlnum").withColumnRenamed ("Invoice Date", "invoicedate").withColumnRenamed ("Invoice Amount", "invoiceamount").withColumnRenamed ("Ship Date", "shipdate").withColumnRenamed ("Delivery Date", "deliverydate").withColumnRenamed ("Service Level", "servicelevel").withColumnRenamed("Actual Weight", "actualweight").withColumnRenamed ("Bill Weight", "billweight").withColumnRenamed ("Audited Amount", "totalcharge").withColumnRenamed ("Zone", "zone").withColumnRenamed ("Manual", "glcode")
df = df.select ("invoicenum", "invoicedate", "invoiceamount", "trackingnum", "shipdate", "deliverydate", "servicelevel", "zone", "actualweight", "billweight", "glcode", "controlnum", "totalcharge")
df = df.withColumn ("zone", df["zone"].cast(StringType()))
if counter > 0:
df.write.mode("append").saveAsTable ("sources.shipping.parcel")
else:
df.write.mode("overwrite").saveAsTable ("sources.shipping.parcel")
counter +=1

Environment

- Spark version: 3.4.0
- Spark-Excel version: 3.3.1
- OS: databricks
- Cluster environment: Standard_DS3_v2 worker and executor nodes. 2-8 worker nodes

Anything else?

No response

@nightscape
Copy link
Owner

Can you try with format excel instead? This is the new v2 version.
It should also support reading multiple files at once, so you could try just pointing it directly to the directory.

@brian-custer
Copy link
Author

brian-custer commented Jun 19, 2023 via email

@nightscape
Copy link
Owner

Can you post the new code you're using? Btw, the spark-excel version you mentioned in the first issue doesn't look right. It should probably be 0.18.???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants