You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If there are any changes in these field values, then the generated index is no longer the same as previous uploads and duplicate records get added to the database. This means if we encounter changes in these values we need to delete the old records and then upload the new records. See example discussed in #126 (comment) and latest example on Slack.
If the index fields are too specific, then we have to be wary of more data changes. We see this in titer uploads that use tdb/elife_upload under the hood. Within elife_upload, a row counter is appended to the source field, so a change in the order of records can create duplicate records.
If the index fields are not specific enough, then records with the same data can overwrite each other. We see this in the CDC titer uploads, which do not append the row counter to the source. Records that have different passage details that get categorized as the same passage category (e.g. both "S1" and "S3" -> "cell") will create the same index and thus overwrite each other in the database.
Possible solution
Solutions I can think of at the moment, but would love to hear other ideas:
Specify different index fields per CC's titer upload to tailor them to the different data. However, this may be even more confusing in the long run because we'd have to be wary of different field changes per CC.
Move away from fauna/rethinkdb for titer data. I think we'd create a standardized TSV per Excel/TSV file that we receive and they can all be concatenated into one TSV as our central "database". Then we'd only have to be wary of changes per file instead of changes in specific fields. However, this will mean we'd have to reconstruct the central "database" every time we get new data.
The text was updated successfully, but these errors were encountered:
Context
Currently all titer data uploaded to fauna use the tdb/upload's
index_fields
to create the record index.If there are any changes in these field values, then the generated index is no longer the same as previous uploads and duplicate records get added to the database. This means if we encounter changes in these values we need to delete the old records and then upload the new records. See example discussed in #126 (comment) and latest example on Slack.
If the index fields are too specific, then we have to be wary of more data changes. We see this in titer uploads that use
tdb/elife_upload
under the hood. Withinelife_upload
, a row counter is appended to thesource
field, so a change in the order of records can create duplicate records.If the index fields are not specific enough, then records with the same data can overwrite each other. We see this in the CDC titer uploads, which do not append the row counter to the source. Records that have different passage details that get categorized as the same passage category (e.g. both "S1" and "S3" -> "cell") will create the same index and thus overwrite each other in the database.
Possible solution
Solutions I can think of at the moment, but would love to hear other ideas:
The text was updated successfully, but these errors were encountered: