[BUG] Filters on partition columns don't work | Spark 3.3.1 | com.crealytics:spark-excel_2.12:3.3.1_0.18.5 #727

gaya3dk2490 · 2023-04-03T16:24:14Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

There is some weird behaviour when filtering columns on a dataframe produced by the excel reader.

I have some excel files, partitioned in Azure Storage account and I am trying to fire a simple read from Databricks (Run time 12.1, Spark 3.3.1)

Example Path on Storage account - /landing/excel/version=x/day=x where version and day will become partition columns on read

I have version=1 and version=2 and day=1 as sample partitions.

Below read stores 2 rows into dataframe df

val df = spark.read
      .format("excel")              
      .option("dataAddress", dataAddress) 
      .option("header", "true")       
      .option("inferSchema", true)   
      .load(myExcelPath)

schema inferred


root
 |-- int_col: integer (nullable = true)
 |-- string_col: string (nullable = true)
 |-- version: integer (nullable = true)
 |-- day: integer (nullable = true)

Now, if you filter on the df produced for version=1 , it always returns all results

df.filter(col("version") === 1) returns 2 rows (version =1 and version =2 )

Also tried the following variants

df.filter(col("version") === lit(1)) and df.filter($"version" === 1)

Try filtering on a value of version that doesn't exist, returns all rows

df.filter(col("version") === 100) returns 2 rows

Note: Filters on other normal columns work fine, so there seems to be something wrong on predicate pushdown

Expected Behavior

Filter on dataframe partition columns should return only rows from that partition

Steps To Reproduce

Read a simple excel file stored in a partition on any storage (local or cloud)
Filter dataframe on the partition

Environment

- Spark version: 3.31
- Spark-Excel version: 0.18.5
- OS: Mac/ Databricks
- Cluster environment - Databricks 12.1 run time

Anything else?

No response

The text was updated successfully, but these errors were encountered:

nightscape · 2023-04-04T05:25:30Z

Not sure if this is a typo, but afaik you need to use === instead of == when comparing columns. Also the value might need to be wrapped in lit.

gaya3dk2490 · 2023-04-04T05:52:51Z

@nightscape apologies, that was a typo :) edited the original question

gaya3dk2490 · 2023-04-04T07:50:58Z

Update:

I downgraded the library to com.crealytics:spark-excel_2.12:3.2.2_0.18.5 and that has no problems with filters on partition columns!

this is definitely a bug in the latest version on Spark 3.3.1

nightscape · 2023-04-04T09:16:14Z

Ok, interesting!
Might be a change in the API that we'd need to account for.
@gaya3dk2490 if you don't mind, you could skim the Spark changelogs if there's sth. in there regarding predicate push-down.
Maybe you can also find a corresponding change in the CSV reader (from which a lot of the code was taken).

intelligencecompany · 2023-04-21T12:03:53Z

I did a temp workaround to temporary save it as a parquet and reload the dataframe as soon as I want to apply a filter:

df.Write()
.Mode("overwrite")
.Parquet($"xxx");

df.Unpersist();

df = spark.Read()
.Parquet($"xxx");

df = df.Filter("condition");

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Filters on partition columns don't work | Spark 3.3.1 | com.crealytics:spark-excel_2.12:3.3.1_0.18.5 #727

[BUG] Filters on partition columns don't work | Spark 3.3.1 | com.crealytics:spark-excel_2.12:3.3.1_0.18.5 #727

gaya3dk2490 commented Apr 3, 2023 •

edited

Loading

nightscape commented Apr 4, 2023 •

edited

Loading

gaya3dk2490 commented Apr 4, 2023

gaya3dk2490 commented Apr 4, 2023

nightscape commented Apr 4, 2023

intelligencecompany commented Apr 21, 2023

[BUG] Filters on partition columns don't work | Spark 3.3.1 | com.crealytics:spark-excel_2.12:3.3.1_0.18.5 #727

[BUG] Filters on partition columns don't work | Spark 3.3.1 | com.crealytics:spark-excel_2.12:3.3.1_0.18.5 #727

Comments

gaya3dk2490 commented Apr 3, 2023 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

nightscape commented Apr 4, 2023 • edited Loading

gaya3dk2490 commented Apr 4, 2023

gaya3dk2490 commented Apr 4, 2023

nightscape commented Apr 4, 2023

intelligencecompany commented Apr 21, 2023

gaya3dk2490 commented Apr 3, 2023 •

edited

Loading

nightscape commented Apr 4, 2023 •

edited

Loading