-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/merge-treedata-staging #56
base: main
Are you sure you want to change the base?
Conversation
Initial commit for the merge process. The process is split into 5 pieces: - create a temp table with the staging data - insert based on a left join - update based on data differences - delete / deactivate for any removed trees (a point of discussion) - remove data from the staging table This is all done in one stored procedure that takes an optional id_source_name. This will be called from the tree-sources repo
; | ||
|
||
/* | ||
For the delete, we will probably want to keep track of old treedata rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we want to do in the case of deletes? We might not be able to actually delete rows because of foreign keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've talked about this a ton and we are almost there with functionality to return our data back to the source cities but we'll need to gather their emails and or build out a 311 api.
What are you planning on deleting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have trees in our db that do not exist in the source data...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of the case where a tree exists, then dies and gets cut down. The city the next season goes and plants a new tree in that same tree pit (because new tree pits are expensive). They record the new tree and maybe delete the old tree, assuming it was in the database.
What do we do with the old tree in our database in that scenario? This is all hypothetical. I'm not sure any city does that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have trees in our db that do not exist in the source data...
Similar to the update, we can delete based on id_source_name
. If a tree gets created in our db with the same id_source_name
as one of our data sources, then we won't be able to easily differentiate it from the data in a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of the case where a tree exists, then dies and gets cut down. The city the next season goes and plants a new tree in that same tree pit (because new tree pits are expensive). They record the new tree and maybe delete the old tree, assuming it was in the database.
What do we do with the old tree in our database in that scenario? This is all hypothetical. I'm not sure any city does that.
They do replant in the same hole pretty much every time a tree dies so we need to figure it out. We have functionality on the FE to edit and delete but we removed it last year because it is a can of worms. The question is, how do we know for sure which is the correct data. What if a user replanted something and the city has old data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have trees in our db that do not exist in the source data...
Similar to the update, we can delete based on
id_source_name
. If a tree gets created in our db with the sameid_source_name
as one of our data sources, then we won't be able to easily differentiate it from the data in a file.
Yeah that's a real issue. Because the id_source_name IS the city name. I suppose we need to differentiate between what's city data and what is user data coming in from the field via the FE. Even still, which takes precedence. We could have a "field_data" vs "source_data" column value or something and then just grab whichever has the most recent date. A lot of sources don't have a modified or last maintained fiield tho.
AND ( | ||
_tree.dbh <> treedata.dbh | ||
OR _tree.address <> treedata.address | ||
OR _tree.health <> treedata.health |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to add some more columns here, but this is just a draft PR to get me started.
AND (c_id_source_name is null or ts.id_source_name = c_id_source_name); | ||
|
||
/* | ||
Insert with a left join based on the id_tree column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upserts can work, but left joins can be faster and more importantly, cleaner. The code to do an upsert on a bunch of columns in one sql statement can get big and cumbersome. It can be nicer to split it out.
updated, | ||
planted, | ||
note, | ||
address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to add some more columns here, like dbh_min
and dbh_max
. I just need some more sample data to play with to test this out.
We just merged the PR to run the staging tables from the tree-sources process, so this should be reasonable to test.
DELETE | ||
FROM treedata_staging | ||
WHERE 1=1 | ||
AND c_id_source_name is null or id_source_name = c_id_source_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIXME: Put the or
statement into parenthesis.
note, | ||
address | ||
) | ||
SELECT id as id_tree, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Be careful with the ids, there's a lot of them... Wondering if we shot ourselves in the foot with calling the treehash "id"...
id is the unique hash of the tree that we create with our tree-id repo.
id_tree is the serialized db table id
id_source_name is the source's (like san_francisco)
id_reference is the city's/sources reference id(this may or may not exist)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given all the changes lately, I need to refresh my database. In the latest version I have, I don't have an id
column on the treedata
table. I also called id
on treedata_staging
what I think should be id_tree
, so I will have to fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The id might be hiding midway down the table. Its sandwiched
last_watered timestamp without time zone,
id bigint NOT NULL,
id_source_name character varying(255),
FROM _tree | ||
LEFT JOIN treedata on ( | ||
treedata.id_source_name = _tree.id_source_name | ||
AND treedata.id_tree = _tree.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should not be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They shouldn't, but they are. I need to fix the tree-sources
repo to fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I am reading it wrong, I was looking at this code
https://github.com/waterthetrees/tree-sources/blob/main/src/stages/normalize.js#L95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean these: treedata.id_tree = _tree.id shouldnt be the same.
One is the table's serial number, the other is the created hash of the tree based off common, scientific, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lines 406, 408, 446, and 447 are all different. id_wtt needs to be removed at some point as it is no longer in use.
reposting here:
line 446: id is the unique hash of the tree that we create with our tree-id repo.
line 406: id_tree is the serialized db table id
line 447: id_source_name is the source's (like san_francisco)
line 408: id_reference is the city's/sources reference number(this may or may not exist)
HI @tzinckgraf Are you interested in working on this anymore or should I un-assign? Is this a variation of the other branch with a similar name? |
Hey @zoobot, I apologize, I won't have the bandwidth for at least another month. You can unassign. |
Initial commit for the merge process. The process is split into 5 pieces:
This is all done in one stored procedure that takes an optional id_source_name. This will be called from the tree-sources repo