Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data set mismatch: empty queries results and reproducibility issues #11

Open
Bouncner opened this issue Jun 24, 2021 · 7 comments
Open

Comments

@Bouncner
Copy link

We execute the join order benchmark as one of our "default" benchmarks in Hyrise. We recently found that several queries yield empty results and wondered if this might be a problem with Hyrise, the data set, or if it is part of the JOB by intention.

One example is query 32a, which selects k.keyword ='10,000-mile-club'. For our data set (we generated the data using the "frozen data set"), there is no such keyword but rather 10000-mile-club (see also here). Interestingly, this keyword with the comma exists in the CWI data set.

As a query's performance can thus differ vastly between the two data sets due to the empty results, we would consider it a bug in either the CWI data set or the query's SQL. Should this be fixed by adjusting the dataset generation from frozen data or by adjusting the query's SQL by removing the , from 10,000...?

For other queries, we found that inner joins following several filters lead to empty result sets (e.g., query 5a). We are again curious if this is done by intention? As far as we can tell, this is the case for both data sets (CWI and frozen).

Pinging @Bensk1 here as well as he did the data generation test.

@Bouncner Bouncner changed the title Empty queries results and reproducibility Data set mismatch: empty queries results and reproducibility issues Jun 24, 2021
@gregrahn
Copy link
Owner

gregrahn commented Jul 1, 2021

@Bouncner - Happy to incorporate any changes to make this better/easier.
Saw this tweet - https://twitter.com/hyrise_db/status/1410207307024187405?s=20

@HennyNile
Copy link

HennyNile commented Mar 14, 2023

Hey, I also met this problem in Postgres!

Firstly, thanks for your great work!

I run JOB in PostgreSQL 15 and found queries 2c.sql, 5a.sql, 5b.sql, 10b.sql and 32a.sql will yield empty results. Such queries will be processed quickly even if they have complex join relations, making them useless to catch the relation between the runtime and the query's complexity.

I am also curious about if this is done by intention. If not, will you try to replace such queries with empty results?

@Bouncner
Copy link
Author

Bouncner commented Mar 14, 2023

@HennyNile : are you using the static data set that is listed in the initial paper?

Also, take a look at this pull request. We contacted Viktor Leis some time ago and he confirmed that there are empty results and they are on purpose. But I think 2c should yield a non-empty result.

@HennyNile
Copy link

@Bouncner For convenience, I just used a dumped dataset in Harvard website which even has fewer rows than the initial dataset you mentioned. Maybe I should change to use the initial dataset.

There is no problem As the CSV files of the paper data set do not follow RFC 4180 (i.e., using "" to escape an " in a string, which Hyrise assumes to be the case for CSV files) in Postgres.
For example, in your pr, you mentioned

Exemplary lines before adaption:

movie_info.csv: 127472,2130098,13,"FACT: Dunn uploads a file from an Apple Powerbook in \"C:\\\", which would be appropriate for a DOS/Windows system.",
person_info.csv: 2010004,1163574,25,"CD: \"All-Time Hits, Vol. 2\"\\",

Reformatted:

movie_info.csv: 127472,2130098,13,"FACT: Dunn uploads a file from an Apple Powerbook in ""C:\"", which would be appropriate for a DOS/Windows system.",
person_info.csv: 2010004,1163574,25,"CD: ""All-Time Hits, Vol. 2""\",

In PG, it is

table movie_info: 127472, 2130098, 13, FACT: Dunn uploads a file from an Apple Powerbook in "C:\", which would be appropriate for a DOS/Windows system.
table person_info: 2010004, 1163574, 25, CD: "All-Time Hits, Vol. 2"\

So, I think the empty results in Postgres are not caused by this problem.

For 2c, there are only 2 filter predicates and I found both predicates only fetch one row. This may lead to an empty result.

There are 5 empty results in my experiments. In my opinion, the empty results of my experiments are intentional or caused by the wrong dumped dataset. When I test the initial dataset, I will share the result with you.

@Bouncner
Copy link
Author

Viktor Leis wrote to us "If I remember correctly, there are around 10 queries with empty results sets.".

He strongly suggested to use only the data set used in the paper. At least for us, that removed a number of empty results.

@HennyNile
Copy link

I see. I will use the initial dataset used in the paper in the later work. Many thanks!

@Bouncner
Copy link
Author

@Bouncner - Happy to incorporate any changes to make this better/easier.
Saw this tweet - https://twitter.com/hyrise_db/status/1410207307024187405?s=20

Sorry, that took a while. :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants