You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using com.crealytics:spark-excel_2.12:3.3.1_0.18.7 to process and read over 70 excel files in my data lake. It appears that the cluster stops working after reading approximately 6 workbooks. It appears to hang and it stops reading anymore workbooks. I am using a cluster with a minimum of 2 worker nodes to a maximum of 8 worker nodes. It appears as if the cluster runs out of memory or something that prevents it from reading any more workbooks.
Expected Behavior
Expected behavior is that I should be able to read all 70 workbooks in my data lake and append the data to an existing delta table in my unity catalog.
Steps To Reproduce
Use the following code to loop through all the files in a data lake folder and read the spreadsheets and append to a unity catalog table:
Can you try with format excel instead? This is the new v2 version.
It should also support reading multiple files at once, so you could try just pointing it directly to the directory.
Thanks for the response. Is it capable of reading up to 70 Excel files at a time. I changed my code as you suggested but the driver is hung. I suspect it is a memory issue but I’m not sure.
From: Martin Mauch ***@***.***>
Date: Sunday, June 18, 2023 at 2:48 PM
To: crealytics/spark-excel ***@***.***>
Cc: Brian Custer ***@***.***>, Author ***@***.***>
Subject: EXTERNAL - Re: [crealytics/spark-excel] Cluster Stops Processing Files After Six Reads of Files have been completed (Issue #750)
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Can you try with format excel instead? This is the new v2 version.
It should also support reading multiple files at once, so you could try just pointing it directly to the directory.
—
Reply to this email directly, view it on GitHub<#750 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A3FCQ5AIGK3GDU2MSAX5NKTXL5ZQ5ANCNFSM6AAAAAAZK4WR4U>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
Can you post the new code you're using? Btw, the spark-excel version you mentioned in the first issue doesn't look right. It should probably be 0.18.???
Is there an existing issue for this?
Current Behavior
I am using com.crealytics:spark-excel_2.12:3.3.1_0.18.7 to process and read over 70 excel files in my data lake. It appears that the cluster stops working after reading approximately 6 workbooks. It appears to hang and it stops reading anymore workbooks. I am using a cluster with a minimum of 2 worker nodes to a maximum of 8 worker nodes. It appears as if the cluster runs out of memory or something that prevents it from reading any more workbooks.
Expected Behavior
Expected behavior is that I should be able to read all 70 workbooks in my data lake and append the data to an existing delta table in my unity catalog.
Steps To Reproduce
Use the following code to loop through all the files in a data lake folder and read the spreadsheets and append to a unity catalog table:
for file in dbutils.fs.ls (pathToParcelData):
print(counter)
fileDate = dt.datetime.utcfromtimestamp (file.modificationTime/1000).strftime('%Y-%m-%d')
df = spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true")
.option("dataAddress", "'Transaction Detail'!A1")
.load(file.path)
for field in df.schema.fieldNames():
df = df.withColumnRenamed (field, field.removesuffix(" "))
df = df.withColumnRenamed ("Invoice #", "invoicenum").withColumnRenamed ("Tracking #", "trackingnum").withColumnRenamed ("Control #", "controlnum").withColumnRenamed ("Invoice Date", "invoicedate").withColumnRenamed ("Invoice Amount", "invoiceamount").withColumnRenamed ("Ship Date", "shipdate").withColumnRenamed ("Delivery Date", "deliverydate").withColumnRenamed ("Service Level", "servicelevel").withColumnRenamed("Actual Weight", "actualweight").withColumnRenamed ("Bill Weight", "billweight").withColumnRenamed ("Audited Amount", "totalcharge").withColumnRenamed ("Zone", "zone").withColumnRenamed ("Manual", "glcode")
df = df.select ("invoicenum", "invoicedate", "invoiceamount", "trackingnum", "shipdate", "deliverydate", "servicelevel", "zone", "actualweight", "billweight", "glcode", "controlnum", "totalcharge")
df = df.withColumn ("zone", df["zone"].cast(StringType()))
if counter > 0:
df.write.mode("append").saveAsTable ("sources.shipping.parcel")
else:
df.write.mode("overwrite").saveAsTable ("sources.shipping.parcel")
counter +=1
Environment
Anything else?
No response
The text was updated successfully, but these errors were encountered: