Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Excel File with Macros Detected as "Potentially" Malicious. Unable to read Excel as a result. #832

Open
1 task done
nova-jj opened this issue Feb 22, 2024 · 1 comment

Comments

@nova-jj
Copy link

nova-jj commented Feb 22, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Within an Azure Databricks Environment we're using this library to read Excel files stored in a Storage Account accessed using either the ABFSS or DBFS protocols, suggesting this is a file issue and not a protocol issue.
.
Attempting to read the file with newer versions of the spark-excel library result in the following error caused by macros in the workbook: crealytics excel workbook java.io.IOException: The file appears to be potentially malicious. "This file embeds more internal file entries than expected."

We have reverted to a previous version that does not present this error and are looking for a solution that allows us to bypass the macro detection in our workbook which does contain macros, but are required as part of the workbook.

Expected Behavior

Reading the file into a dataframe should not be met with this error, OR, an option to override the macro detection in order to be able to force-read when "potentially" maliciousness is present.

Steps To Reproduce

The following python code produces our error:

file_path= "dbfs:/FileStore/our_excel_file.xlsm"
df = spark.read.format("com.crealytics.spark.excel").option("header", "true").load(file_path)
df = df.toPandas()

Environment

- Spark version: 3.4.1 via Databricks Runtime 13.3
- Spark-Excel version: 3.5.0_0.20.3
- OS: Windows but remote-run from Databricks clusters
- Cluster environment: Multiple cluster configurations representing dev/stg/prd using the same Databricks Runtime and Spark Versions.

Anything else?

We have reverted to using the previous version maven coordinates: com.crealytics:spark-excel_2.12:0.13.7 for our install which does not produce this issue.

@nightscape
Copy link
Owner

spark-excel doesn't do anything in that regard.
It must be an upstream library that performs this check. Can you try to find out if this comes from POI?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants