Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple joins in one query #984

Merged
merged 33 commits into from
Dec 6, 2022
Merged

Multiple joins in one query #984

merged 33 commits into from
Dec 6, 2022

Conversation

mgirlich
Copy link
Collaborator

@mgirlich mgirlich commented Aug 25, 2022

Closes #865.

One has to be quite careful when a series of left/inner_join() can be done in a single query but it produces much nicer queries in the end.

Remarks

  • For simplicity, I just used vec_as_names() to produce unique names. We might want something else? I just remembered that we considered using abbreviate(). I'll take a look at that in another PR.
  • The lazy_join_query structure is basically not needed in this our codebase anymore. It would make sense to remove it but as it is exported I wanted to check whether it is in use somewhere or not.

Names for Subqueries

lf <- lazy_frame(x = 1, a = 1)

left_join(
  lf %>% filter(x == 1),
  lf %>% filter(a > 1),
  by = "x"
)

Produces

SELECT `LHS`.`x` AS `x`, `LHS`.`a` AS `a.x`, `RHS`.`a` AS `a.y`
FROM (
  SELECT *
  FROM `df`
  WHERE (`x` = 1.0)
) `LHS`
LEFT JOIN (
  SELECT *
  FROM `df`
  WHERE (`a` > 1.0)
) `RHS`
  ON (`LHS`.`x` = `RHS`.`x`)

Self-Join

lf <- lazy_frame(x = 1, a = 1)

left_join(
  lf,
  lf,
  by = "x"
)
SELECT `df_LHS`.`x` AS `x`, `df_LHS`.`a` AS `a.x`, `df_RHS`.`a` AS `a.y`
FROM `df` AS `df_LHS`
LEFT JOIN `df` AS `df_RHS`
  ON (`df_LHS`.`x` = `df_RHS`.`x`)

User Provided Alias

left_join(
  lf,
  lf,
  by = "x",
  x_as = "df"
) %>% 
  left_join(lf, by = "x")

The alias provided by the user always wins. So this produces

SELECT
  `df`.`x` AS `x`,
  `df`.`a` AS `a.x`,
  `df...2`.`a` AS `a.y`,
  `df...3`.`a` AS `a`
FROM `df`
LEFT JOIN `df` AS `df...2`
  ON (`df`.`x` = `df...2`.`x`)
LEFT JOIN `df` AS `df...3`
  ON (`df`.`x` = `df...3`.`x`)

Note that an alias for x can also be provided later on. So, this produces the same query:

left_join(
  lf,
  lf,
  by = "x"
) %>% 
  left_join(lf, by = "x", x_as = "df")

Conflicting Alias

In case of a conflicting alias a subquery is used

left_join(
  lf,
  lf,
  x_as = "df",
  by = "x"
) %>% 
  left_join(lf, by = "x", x_as = "df2")
SELECT `df2`.*, `a`
FROM (
  SELECT `df`.`x` AS `x`, `df`.`a` AS `a.x`, `df...2`.`a` AS `a.y`
  FROM `df`
  LEFT JOIN `df` AS `df...2`
    ON (`df`.`x` = `df...2`.`x`)
) `df2`
LEFT JOIN `df`
  ON (`df2`.`x` = `df`.`x`)

* Transfer attributes while inlining

* Change semi-join vars to tibble

* Document
* Add `types` to `copy_inline()`

* Remove commented code

* Clarify `types` documentation

* Remove incorrect `types` argumen

* Check `types` argument

* Test that `types` argument works

* Remove unnecessary code
* `rows_*()` casts `y` columns if it copies them

* Fix `name = NULL` case

* Check containment before copying

* Only use type inference for Postgres
Conflicts:
	R/verb-copy-to.R
	R/verb-joins.R
	R/verb-select.R
	tests/testthat/test-verb-joins.R
@mgirlich
Copy link
Collaborator Author

@hadley I think there are enough nice new features to make another release. Is there anything else you want to include?
Maybe you want to have a look at the PR #918 and your issues #896, #897, #970, and #863.

@hadley
Copy link
Member

hadley commented Aug 30, 2022

@mgirlich I don't have time for those issues at the moment (and I'd need to think more about #918) so think we're good to go for a release. Are you thinking 2.2.2 or 2.3.0? (The main difference is whether or not you think we should do a blog post). Do you want to handle most of the release (i.e. work through the bullets in use_release_issue()) or do you want me to do it?

@mgirlich
Copy link
Collaborator Author

mgirlich commented Sep 1, 2022

The release contains

  • much better error messages
  • a couple of small bugfixes and minor features
  • new translations for some backends
  • much nicer SQL

While I really like these features I don't necessarily see the need for a blog post. Also there are a couple of other improvements for nicer SQL still missing (e.g. this PR) and it feels a bit weird to blog about half finished work.

I created a release issue #994.

@hadley
Copy link
Member

hadley commented Nov 22, 2022

Can you please give me a suggested reading guide for this PR?

@mgirlich
Copy link
Collaborator Author

  • sql_query_multi_join.DBIConnection() to get a feeling on the data structures involved
    • joins is a new structure that is documented in this method
    • table_vars is the structure you proposed in the last PR (a named list of table variables)
    • vars are the variables going into the SELECT clause
  • lazy_multi_join_query() gives you another quick overview over the data structures
  • add_join() this is the complicated part. Dig into it with this:
lf <- lazy_frame(x = 1, a = 1, .name = "df1")
lf2 <- lazy_frame(x = 1, b = 2, .name = "df2")
lf3 <- lazy_frame(x = 1, b = 2, .name = "df3")

# exit in the `new_query` branch
debugonce(add_join)
out <- left_join(lf, lf2, by = "x")
# extend the multi join query
debugonce(add_join)
out %>% inner_join(lf3, by = "x")
  • sql_build.lazy_multi_join_query() which has some special handling for joins with only two tables involved: the naming logic is slightly different when there are only two tables and it dispatches differently for right and full join.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I followed every detail, but overall the code makes sense to me and I don't get the sense it's more complicated than it needs to be. Thanks for working on this!!

R/db-sql.R Outdated Show resolved Hide resolved
R/db-sql.R Outdated Show resolved Hide resolved
R/db-sql.R Outdated Show resolved Hide resolved
R/db-sql.R Outdated Show resolved Hide resolved
R/db-sql.R Show resolved Hide resolved
y_as <- join_alias$y

as_current <- x_lq$table_names$as
if (!is_null(x_as)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit fiddly to me and I wonder if we're putting too much emphasis on x_as/y_as which should be relatively rarely used features. I think it would be fine if the last specified alias wins (assuming that's easier to implement)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still require a check if sql_on is used, e.g.

left_join(x, y, x_as = "x", y_as = "y", sql_on = "x.a = y.a") %>% 
  left_join(
    z,
    x_as = "dummy",
    y_as = "z",
    sql_on = "dummy.a = z.a"
  )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels pretty fiddly to me too; I'm not sure we need to support so many edge cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is quite fiddly. But I'm not sure there is a nice alternative. We need some kind of check to make sure that

left_join(lf, lf, sql_on = "...") %>% 
  left_join(lf, sql_on = "...")

works. Of course, we could simply start a new query but in the end the checks won't be much simpler. Also, at least for now it feels like more work to change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lets leave it as is then.

tests/testthat/_snaps/verb-joins.md Show resolved Hide resolved
tests/testthat/test-verb-joins.R Outdated Show resolved Hide resolved
tests/testthat/test-verb-joins.R Outdated Show resolved Hide resolved
R/verb-joins.R Outdated Show resolved Hide resolved
@mgirlich
Copy link
Collaborator Author

mgirlich commented Dec 5, 2022

@hadley Is there anything else you want to have changed or can we merge this? 😄

@hadley
Copy link
Member

hadley commented Dec 5, 2022

Looks good to me!

@mgirlich mgirlich merged commit b16399f into main Dec 6, 2022
@mgirlich mgirlich deleted the multi_join branch December 6, 2022 06:29
@hadley
Copy link
Member

hadley commented Dec 6, 2022

Woo hooo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple joins in one query
2 participants