feature/merge-treedata-staging #56

tzinckgraf · 2023-03-29T03:49:59Z

Initial commit for the merge process. The process is split into 5 pieces:

create a temp table with the staging data
insert based on a left join
update based on data differences
delete / deactivate for any removed trees (a point of discussion)
remove data from the staging table

This is all done in one stored procedure that takes an optional id_source_name. This will be called from the tree-sources repo

Initial commit for the merge process. The process is split into 5 pieces: - create a temp table with the staging data - insert based on a left join - update based on data differences - delete / deactivate for any removed trees (a point of discussion) - remove data from the staging table This is all done in one stored procedure that takes an optional id_source_name. This will be called from the tree-sources repo

tzinckgraf · 2023-03-29T03:50:42Z

stored_procedures/public.merge_treedata.sql

+    ;
+
+    /*
+        For the delete, we will probably want to keep track of old treedata rows.


What do we want to do in the case of deletes? We might not be able to actually delete rows because of foreign keys.

We've talked about this a ton and we are almost there with functionality to return our data back to the source cities but we'll need to gather their emails and or build out a 311 api.

What are you planning on deleting?

We may have trees in our db that do not exist in the source data...

I'm thinking of the case where a tree exists, then dies and gets cut down. The city the next season goes and plants a new tree in that same tree pit (because new tree pits are expensive). They record the new tree and maybe delete the old tree, assuming it was in the database.

What do we do with the old tree in our database in that scenario? This is all hypothetical. I'm not sure any city does that.

We may have trees in our db that do not exist in the source data...

Similar to the update, we can delete based on id_source_name. If a tree gets created in our db with the same id_source_name as one of our data sources, then we won't be able to easily differentiate it from the data in a file.

I'm thinking of the case where a tree exists, then dies and gets cut down. The city the next season goes and plants a new tree in that same tree pit (because new tree pits are expensive). They record the new tree and maybe delete the old tree, assuming it was in the database.

What do we do with the old tree in our database in that scenario? This is all hypothetical. I'm not sure any city does that.

They do replant in the same hole pretty much every time a tree dies so we need to figure it out. We have functionality on the FE to edit and delete but we removed it last year because it is a can of worms. The question is, how do we know for sure which is the correct data. What if a user replanted something and the city has old data.

We may have trees in our db that do not exist in the source data...

Similar to the update, we can delete based on id_source_name. If a tree gets created in our db with the same id_source_name as one of our data sources, then we won't be able to easily differentiate it from the data in a file.

Yeah that's a real issue. Because the id_source_name IS the city name. I suppose we need to differentiate between what's city data and what is user data coming in from the field via the FE. Even still, which takes precedence. We could have a "field_data" vs "source_data" column value or something and then just grab whichever has the most recent date. A lot of sources don't have a modified or last maintained fiield tho.

tzinckgraf · 2023-03-29T03:51:02Z

stored_procedures/public.merge_treedata.sql

+    AND     (
+            _tree.dbh <> treedata.dbh
+        OR  _tree.address <> treedata.address
+        OR  _tree.health <> treedata.health


I need to add some more columns here, but this is just a draft PR to get me started.

tzinckgraf · 2023-03-29T03:51:44Z

stored_procedures/public.merge_treedata.sql

+    AND     (c_id_source_name is null or ts.id_source_name = c_id_source_name);
+
+    /*
+        Insert with a left join based on the id_tree column


Upserts can work, but left joins can be faster and more importantly, cleaner. The code to do an upsert on a bunch of columns in one sql statement can get big and cumbersome. It can be nicer to split it out.

tzinckgraf · 2023-03-29T03:52:37Z

stored_procedures/public.merge_treedata.sql

+            updated,
+            planted,
+            note,
+            address


I need to add some more columns here, like dbh_min and dbh_max. I just need some more sample data to play with to test this out.

We just merged the PR to run the staging tables from the tree-sources process, so this should be reasonable to test.

tzinckgraf · 2023-03-29T03:53:06Z

stored_procedures/public.merge_treedata.sql

+    DELETE
+    FROM    treedata_staging
+    WHERE   1=1
+    AND     c_id_source_name is null or id_source_name = c_id_source_name


FIXME: Put the or statement into parenthesis.

zoobot · 2023-03-29T04:02:48Z

stored_procedures/public.merge_treedata.sql

+        note,
+        address
+    )
+    SELECT  id as id_tree,


Note: Be careful with the ids, there's a lot of them... Wondering if we shot ourselves in the foot with calling the treehash "id"...

id is the unique hash of the tree that we create with our tree-id repo.
id_tree is the serialized db table id
id_source_name is the source's (like san_francisco)
id_reference is the city's/sources reference id(this may or may not exist)

I think given all the changes lately, I need to refresh my database. In the latest version I have, I don't have an id column on the treedata table. I also called id on treedata_staging what I think should be id_tree, so I will have to fix that.

The id might be hiding midway down the table. Its sandwiched

last_watered timestamp without time zone, id bigint NOT NULL, id_source_name character varying(255),

zoobot · 2023-03-29T04:05:07Z

stored_procedures/public.merge_treedata.sql

+    FROM    _tree
+    LEFT JOIN treedata on (
+            treedata.id_source_name = _tree.id_source_name
+        AND treedata.id_tree = _tree.id


These should not be the same

They shouldn't, but they are. I need to fix the tree-sources repo to fix that.

Unless I am reading it wrong, I was looking at this code
https://github.com/waterthetrees/tree-sources/blob/main/src/stages/normalize.js#L95

I mean these: treedata.id_tree = _tree.id shouldnt be the same.
One is the table's serial number, the other is the created hash of the tree based off common, scientific, etc.

lines 406, 408, 446, and 447 are all different. id_wtt needs to be removed at some point as it is no longer in use.

reposting here:
line 446: id is the unique hash of the tree that we create with our tree-id repo.
line 406: id_tree is the serialized db table id
line 447: id_source_name is the source's (like san_francisco)
line 408: id_reference is the city's/sources reference number(this may or may not exist)

zoobot · 2023-07-27T01:50:43Z

HI @tzinckgraf Are you interested in working on this anymore or should I un-assign? Is this a variation of the other branch with a similar name?

tzinckgraf · 2023-08-01T01:12:06Z

Hey @zoobot, I apologize, I won't have the bandwidth for at least another month. You can unassign.

tzinckgraf requested review from mwpark2014 and zoobot March 29, 2023 03:49

tzinckgraf commented Mar 29, 2023

View reviewed changes

zoobot reviewed Mar 29, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature/merge-treedata-staging #56

feature/merge-treedata-staging #56

tzinckgraf commented Mar 29, 2023

tzinckgraf Mar 29, 2023

zoobot Mar 29, 2023

zoobot Mar 29, 2023

tzinckgraf Mar 29, 2023 •

edited

Loading

tzinckgraf Mar 29, 2023

zoobot Mar 29, 2023

zoobot Mar 29, 2023

tzinckgraf Mar 29, 2023

tzinckgraf Mar 29, 2023

tzinckgraf Mar 29, 2023

tzinckgraf Mar 29, 2023

zoobot Mar 29, 2023

tzinckgraf Mar 29, 2023

zoobot Mar 29, 2023

zoobot Mar 29, 2023

tzinckgraf Mar 29, 2023

tzinckgraf Mar 29, 2023

zoobot Mar 29, 2023

zoobot Mar 29, 2023 •

edited

Loading

zoobot Mar 29, 2023 •

edited

Loading

zoobot commented Jul 27, 2023

tzinckgraf commented Aug 1, 2023

feature/merge-treedata-staging #56

Are you sure you want to change the base?

feature/merge-treedata-staging #56

Conversation

tzinckgraf commented Mar 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tzinckgraf Mar 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zoobot Mar 29, 2023 • edited Loading

Choose a reason for hiding this comment

zoobot Mar 29, 2023 • edited Loading

Choose a reason for hiding this comment

zoobot commented Jul 27, 2023

tzinckgraf commented Aug 1, 2023

tzinckgraf Mar 29, 2023 •

edited

Loading

zoobot Mar 29, 2023 •

edited

Loading

zoobot Mar 29, 2023 •

edited

Loading