Data set mismatch: empty queries results and reproducibility issues #11

Bouncner · 2021-06-24T09:54:39Z

We execute the join order benchmark as one of our "default" benchmarks in Hyrise. We recently found that several queries yield empty results and wondered if this might be a problem with Hyrise, the data set, or if it is part of the JOB by intention.

One example is query 32a, which selects k.keyword ='10,000-mile-club'. For our data set (we generated the data using the "frozen data set"), there is no such keyword but rather 10000-mile-club (see also here). Interestingly, this keyword with the comma exists in the CWI data set.

As a query's performance can thus differ vastly between the two data sets due to the empty results, we would consider it a bug in either the CWI data set or the query's SQL. Should this be fixed by adjusting the dataset generation from frozen data or by adjusting the query's SQL by removing the , from 10,000...?

For other queries, we found that inner joins following several filters lead to empty result sets (e.g., query 5a). We are again curious if this is done by intention? As far as we can tell, this is the case for both data sets (CWI and frozen).

Pinging @Bensk1 here as well as he did the data generation test.

The text was updated successfully, but these errors were encountered:

gregrahn · 2021-07-01T16:38:06Z

@Bouncner - Happy to incorporate any changes to make this better/easier.
Saw this tweet - https://twitter.com/hyrise_db/status/1410207307024187405?s=20

HennyNile · 2023-03-14T07:44:12Z

Hey, I also met this problem in Postgres!

Firstly, thanks for your great work!

I run JOB in PostgreSQL 15 and found queries 2c.sql, 5a.sql, 5b.sql, 10b.sql and 32a.sql will yield empty results. Such queries will be processed quickly even if they have complex join relations, making them useless to catch the relation between the runtime and the query's complexity.

I am also curious about if this is done by intention. If not, will you try to replace such queries with empty results?

Bouncner · 2023-03-14T09:22:17Z

@HennyNile : are you using the static data set that is listed in the initial paper?

Also, take a look at this pull request. We contacted Viktor Leis some time ago and he confirmed that there are empty results and they are on purpose. But I think 2c should yield a non-empty result.

HennyNile · 2023-03-14T12:20:50Z

@Bouncner For convenience, I just used a dumped dataset in Harvard website which even has fewer rows than the initial dataset you mentioned. Maybe I should change to use the initial dataset.

There is no problem As the CSV files of the paper data set do not follow RFC 4180 (i.e., using "" to escape an " in a string, which Hyrise assumes to be the case for CSV files) in Postgres.
For example, in your pr, you mentioned

Exemplary lines before adaption:

movie_info.csv: 127472,2130098,13,"FACT: Dunn uploads a file from an Apple Powerbook in \"C:\\\", which would be appropriate for a DOS/Windows system.",
person_info.csv: 2010004,1163574,25,"CD: \"All-Time Hits, Vol. 2\"\\",

Reformatted:

movie_info.csv: 127472,2130098,13,"FACT: Dunn uploads a file from an Apple Powerbook in ""C:\"", which would be appropriate for a DOS/Windows system.",
person_info.csv: 2010004,1163574,25,"CD: ""All-Time Hits, Vol. 2""\",

In PG, it is

table movie_info: 127472, 2130098, 13, FACT: Dunn uploads a file from an Apple Powerbook in "C:\", which would be appropriate for a DOS/Windows system.
table person_info: 2010004, 1163574, 25, CD: "All-Time Hits, Vol. 2"\

So, I think the empty results in Postgres are not caused by this problem.

For 2c, there are only 2 filter predicates and I found both predicates only fetch one row. This may lead to an empty result.

There are 5 empty results in my experiments. In my opinion, the empty results of my experiments are intentional or caused by the wrong dumped dataset. When I test the initial dataset, I will share the result with you.

Bouncner · 2023-03-15T13:20:39Z

Viktor Leis wrote to us "If I remember correctly, there are around 10 queries with empty results sets.".

He strongly suggested to use only the data set used in the paper. At least for us, that removed a number of empty results.

HennyNile · 2023-03-17T05:29:41Z

I see. I will use the initial dataset used in the paper in the later work. Many thanks!

Bouncner · 2023-03-21T15:46:07Z

@Bouncner - Happy to incorporate any changes to make this better/easier.
Saw this tweet - https://twitter.com/hyrise_db/status/1410207307024187405?s=20

Sorry, that took a while. :(

Bouncner mentioned this issue Jun 24, 2021

Malformed CSVs #10

Open

Bouncner changed the title ~~Empty queries results and reproducibility~~ Data set mismatch: empty queries results and reproducibility issues Jun 24, 2021

Bouncner mentioned this issue Jun 25, 2021

Change data set for Join Order Benchmark hyrise/hyrise#2376

Merged

Bouncner mentioned this issue Mar 21, 2023

Update readme to note about potential issues with "frozen data set" #16

Open

amascolo mentioned this issue Oct 8, 2024

Added new JOB queries edin-dal/sdql#12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data set mismatch: empty queries results and reproducibility issues #11

Data set mismatch: empty queries results and reproducibility issues #11

Bouncner commented Jun 24, 2021

gregrahn commented Jul 1, 2021

HennyNile commented Mar 14, 2023 •

edited

Loading

Bouncner commented Mar 14, 2023 •

edited

Loading

HennyNile commented Mar 14, 2023

Bouncner commented Mar 15, 2023

HennyNile commented Mar 17, 2023

Bouncner commented Mar 21, 2023

Data set mismatch: empty queries results and reproducibility issues #11

Data set mismatch: empty queries results and reproducibility issues #11

Comments

Bouncner commented Jun 24, 2021

gregrahn commented Jul 1, 2021

HennyNile commented Mar 14, 2023 • edited Loading

Bouncner commented Mar 14, 2023 • edited Loading

HennyNile commented Mar 14, 2023

Bouncner commented Mar 15, 2023

HennyNile commented Mar 17, 2023

Bouncner commented Mar 21, 2023

HennyNile commented Mar 14, 2023 •

edited

Loading

Bouncner commented Mar 14, 2023 •

edited

Loading