Non-deterministic use of `row_number()` #77

lawrenceadams · 2024-10-06T14:24:14Z

row_number is sprinkled out the repository, however it is used in various different ways which are likely to give unexpected/non-deterministic behaviour between runs. We usually will have a logical means of ordering them - if not date, then an ID of some sort. They tend to be used as:

row_number() over ()
row_number() over(order by (select null)

The variation is likely to be from a mixture of copy-pasted sources (e.g. T-SQL doesn't allow ...OVER () but Postgres/DuckDB do). The main two offenders I can find so far:

dbt-synthea/models/omop/location.sql

Line 2 in ae79114

ROW_NUMBER() OVER () AS location_id

dbt-synthea/models/omop/provider.sql

Line 2 in 75a1e12

row_number() over (order by (select null)) as provider_id,

Although this is probably has low impact downstream, it may be causing unexpected behaviour e.g. #47

The text was updated successfully, but these errors were encountered:

lawrenceadams · 2024-10-06T20:24:39Z

Interestingly the former has been solved upstream:

https://github.com/OHDSI/ETL-Synthea/pull/200/files

katy-sadowski · 2024-10-07T00:13:38Z

I definitely agree we should choose a consistent and deterministic way of doing this, and I like ROW_NUMBER() OVER() ... are you sure SQL Server doesn't support it? My Googling says yes but I've never actually used SQL Server.

lawrenceadams · 2024-10-07T06:40:11Z

I definitely agree we should choose a consistent and deterministic way of doing this, and I like ROW_NUMBER() OVER() ... are you sure SQL Server doesn't support it? My Googling says yes but I've never actually used SQL Server.

Sorry I made my point badly - you're absolutely right it does support it, but it does not support an empty OVER () clause (like my first example above), instead you need to provide an order by sequence or do OVER (SELECT NULL) (which I think we should avoid)

https://stackoverflow.com/questions/44105691/row-number-without-order-by

katy-sadowski · 2024-10-08T00:35:41Z

Ahh OK! That makes more sense 🙃 and agree, we should choose some ordering key for consistency even if there are dupes in a table (which will happen in OMOP).

lawrenceadams added the bug Something isn't working label Oct 6, 2024

lawrenceadams linked a pull request Oct 8, 2024 that will close this issue

refactor: use determinisitic provider id #81

Merged

lawrenceadams closed this as completed in #81 Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic use of `row_number()` #77

Non-deterministic use of `row_number()` #77

lawrenceadams commented Oct 6, 2024

lawrenceadams commented Oct 6, 2024

katy-sadowski commented Oct 7, 2024

lawrenceadams commented Oct 7, 2024 •

edited

Loading

katy-sadowski commented Oct 8, 2024

Non-deterministic use of row_number() #77

Non-deterministic use of row_number() #77

Comments

lawrenceadams commented Oct 6, 2024

lawrenceadams commented Oct 6, 2024

katy-sadowski commented Oct 7, 2024

lawrenceadams commented Oct 7, 2024 • edited Loading

katy-sadowski commented Oct 8, 2024

Non-deterministic use of `row_number()` #77

Non-deterministic use of `row_number()` #77

lawrenceadams commented Oct 7, 2024 •

edited

Loading