Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if instantiated cohort conforms to cohort table form #36

Open
gowthamrao opened this issue Jun 1, 2022 · 5 comments
Open

Check if instantiated cohort conforms to cohort table form #36

gowthamrao opened this issue Jun 1, 2022 · 5 comments
Labels
help wanted Extra attention is needed

Comments

@gowthamrao
Copy link
Member

In OHDSI we define a cohort as a set of persons who satisfy certain criteria for duration of time (ie. cohort_start_date and cohort_end_date). A person cannot have overlapping dates within the same cohortId.

This is not checked, but is implicitly expected. If this is violated, it can cause headache.

How about adding a simple check. We can do it with a SQL like this

SELECT c1.*,
	c2.*
FROM cohort c1
INNER JOIN cohort c2 
ON c1.cohort_definition_id = c2.cohort_definition_id
	AND c1.subject_id = c2.subject_id
	AND (
		c1.cohort_start_date <> c2.cohort_start_date
		OR c1.cohort_end_date <> c2.cohort_end_date
		)
	AND c1.cohort_start_date <= c2.cohort_end_date
	AND c1.cohort_end_date >= c2.cohort_start_date;
@anthonysena
Copy link
Collaborator

anthonysena commented Jul 14, 2022

@gowthamrao - would a function in CohortGenerator like this be acceptable:

isCohortTable(connection, connectionDetails, schema, cohortTable) which attempts to query the schema.cohortTable via the connection and if it exists, it checks that the required columns are present? Checking the data types may be a bit tricky across RDBMS platforms but we could also add some checks for that as well.

The SQL would be something like:

SELECT *
FROM @schema.@cohortTable
WHERE 1 = 0
;

This way only the table structure is returned and not the full contents of the table. You could check it on any number of tables you are interested in via R.

@gowthamrao
Copy link
Member Author

gowthamrao commented Aug 19, 2022

Thanks @anthonysena i think you bought up important validation of cohort table structure.

  1. Structure validation - It should have the fields cohortDefinitionId, subjectId, cohortStartDate, cohortEndDate.
  2. Type validation - It should have integer/bigint, int/bigInt, date, date

Those are important and they check the structure and data type of the cohort table.

But I am most worried about the generated cohort. Is it conforming to cohort era rule as described here OHDSI/CohortAlgebra#2

So i am suggesting a third validation step to the generated cohort output:
3) Cohort era check: is the instantiated cohort eraFy'd.

Now i dont know if CohortGenerator is appropriate for this function - because these are activities on a generated cohort. So may cohortAlgebra is the right place as it deals with generated cohorts, and does not generate cohorts itself.

@anthonysena
Copy link
Collaborator

i think you bought up important validation of cohort table structure.

Structure validation - It should have the fields cohortDefinitionId, subjectId, cohortStartDate, cohortEndDate.
Type validation - It should have integer/bigint, int/bigInt, date, date

@gowthamrao - I think we're aligned on the cohort table structure part. I'm less clear on this part:

  1. Cohort era check: is the instantiated cohort eraFy'd.

From my perspective, CohortGenerator is concerned with generating the cohort and putting the results into the cohort tables. Ensuring that the cohorts are validated is important but probably not a concern of the CohortGenerator package. I believe the Circe cohort definition editor in ATLAS has functionality to ensure a person does not have overlapping time in a cohort. So perhaps I'm missing when this could happen in a generated cohort?

@gowthamrao
Copy link
Member Author

perhaps I'm missing when this could happen in a generated cohort?

Well - remember, it is possible to write custom SQL and get a cohort table structure as output i.e. structure validation and type validation; but does the SQL ensure that required era'fy is present?

If the cohort has any two rows for the same person that has overlapping cohort_dates - then it is NOT a cohort.

CohortGenerator only requires a cohortDefinitionSet. As long as the cohortDefinitionSet has an SQL its fine with it. - correct? what if the output is not a cohort?

@anthonysena
Copy link
Collaborator

If the cohort has any two rows for the same person that has overlapping cohort_dates - then it is NOT a cohort.

CohortGenerator only requires a cohortDefinitionSet. As long as the cohortDefinitionSet has an SQL its fine with it. - correct? what if the output is not a cohort?

You are correct - CohortGenerator only requires the SQL so this could pose a problem. Trying to validate that it produced a proper cohort could be tricky but if someone wanted to take a swing at implementing this type of check, I'd be happy to review.

@anthonysena anthonysena added the help wanted Extra attention is needed label Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants