Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic use of row_number() #77

Closed
lawrenceadams opened this issue Oct 6, 2024 · 4 comments · Fixed by #81
Closed

Non-deterministic use of row_number() #77

lawrenceadams opened this issue Oct 6, 2024 · 4 comments · Fixed by #81
Labels
bug Something isn't working

Comments

@lawrenceadams
Copy link
Collaborator

row_number is sprinkled out the repository, however it is used in various different ways which are likely to give unexpected/non-deterministic behaviour between runs. We usually will have a logical means of ordering them - if not date, then an ID of some sort. They tend to be used as:

  • row_number() over ()
  • row_number() over(order by (select null)

The variation is likely to be from a mixture of copy-pasted sources (e.g. T-SQL doesn't allow ...OVER () but Postgres/DuckDB do). The main two offenders I can find so far:

ROW_NUMBER() OVER () AS location_id

row_number() over (order by (select null)) as provider_id,

Although this is probably has low impact downstream, it may be causing unexpected behaviour e.g. #47

@lawrenceadams lawrenceadams added the bug Something isn't working label Oct 6, 2024
@lawrenceadams
Copy link
Collaborator Author

Interestingly the former has been solved upstream:

https://github.com/OHDSI/ETL-Synthea/pull/200/files

@katy-sadowski
Copy link
Collaborator

I definitely agree we should choose a consistent and deterministic way of doing this, and I like ROW_NUMBER() OVER() ... are you sure SQL Server doesn't support it? My Googling says yes but I've never actually used SQL Server.

@lawrenceadams
Copy link
Collaborator Author

lawrenceadams commented Oct 7, 2024

I definitely agree we should choose a consistent and deterministic way of doing this, and I like ROW_NUMBER() OVER() ... are you sure SQL Server doesn't support it? My Googling says yes but I've never actually used SQL Server.

Sorry I made my point badly - you're absolutely right it does support it, but it does not support an empty OVER () clause (like my first example above), instead you need to provide an order by sequence or do OVER (SELECT NULL) (which I think we should avoid)

https://stackoverflow.com/questions/44105691/row-number-without-order-by

@katy-sadowski
Copy link
Collaborator

Ahh OK! That makes more sense 🙃 and agree, we should choose some ordering key for consistency even if there are dupes in a table (which will happen in OMOP).

@lawrenceadams lawrenceadams linked a pull request Oct 8, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants