Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/merge-treedata-staging #56

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions stored_procedures/public.merge_treedata.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@

create or replace procedure public.merge_treedata (
c_id_source_name character varying(255)
)
language plpgsql
as $$
begin

DROP TABLE IF EXISTS _tree;

CREATE TEMPORARY TABLE _tree AS
SELECT geom,
id,
id_reference,
id_source_name,
city,
country,
email,
download,
info,
lat,
lng,
count,

-- crosswalk fields, as per opentrees schema
scientific,
genus,
species,
variety,
common,
dbh,
health,
height,
crown,
spread,
ule,
updated,
planted,
note,
address
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add some more columns here, like dbh_min and dbh_max. I just need some more sample data to play with to test this out.

We just merged the PR to run the staging tables from the tree-sources process, so this should be reasonable to test.

FROM treedata_staging ts
WHERE 1=1
AND (c_id_source_name is null or ts.id_source_name = c_id_source_name);

/*
Insert with a left join based on the id_tree column
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upserts can work, but left joins can be faster and more importantly, cleaner. The code to do an upsert on a bunch of columns in one sql statement can get big and cumbersome. It can be nicer to split it out.

*/
INSERT INTO treedata (
id_tree,
id_reference,
id_source_name,
geom,
city,
country,
email,
lat,
lng,
count,
scientific,
genus,
species,
variety,
common,
dbh,
health,
height,
date_planted,
note,
address
)
SELECT id as id_tree,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Be careful with the ids, there's a lot of them... Wondering if we shot ourselves in the foot with calling the treehash "id"...

id is the unique hash of the tree that we create with our tree-id repo.
id_tree is the serialized db table id
id_source_name is the source's (like san_francisco)
id_reference is the city's/sources reference id(this may or may not exist)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given all the changes lately, I need to refresh my database. In the latest version I have, I don't have an id column on the treedata table. I also called id on treedata_staging what I think should be id_tree, so I will have to fix that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The id might be hiding midway down the table. Its sandwiched

    last_watered timestamp without time zone,
    id bigint NOT NULL,
    id_source_name character varying(255),

id_reference,
id_source_name,
geom,
city,
country,
email,
lat,
lng,
count,
scientific,
genus,
species,
variety,
common,
dbh,
health,
height,
date_planted,
note,
address
FROM _tree
LEFT JOIN treedata on (
treedata.id_source_name = _tree.id_source_name
AND treedata.id_tree = _tree.id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should not be the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They shouldn't, but they are. I need to fix the tree-sources repo to fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am reading it wrong, I was looking at this code
https://github.com/waterthetrees/tree-sources/blob/main/src/stages/normalize.js#L95

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean these: treedata.id_tree = _tree.id shouldnt be the same.
One is the table's serial number, the other is the created hash of the tree based off common, scientific, etc.

Copy link
Member

@zoobot zoobot Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2023-03-28 at 9 21 00 PM

Copy link
Member

@zoobot zoobot Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines 406, 408, 446, and 447 are all different. id_wtt needs to be removed at some point as it is no longer in use.

reposting here:
line 446: id is the unique hash of the tree that we create with our tree-id repo.
line 406: id_tree is the serialized db table id
line 447: id_source_name is the source's (like san_francisco)
line 408: id_reference is the city's/sources reference number(this may or may not exist)

)
WHERE 1=1
AND treedata.id_tree is null
;

/*
Update with a join. Add more columns.
*/
UPDATE treedata
SET dbh = _tree.dbh,
address = _tree.address,
health = _tree.health
FROM _tree
WHERE 1=1
AND _tree.id = treedata.id_tree
-- check any columns that changed
-- FIXME add more columns
AND (
_tree.dbh <> treedata.dbh
OR _tree.address <> treedata.address
OR _tree.health <> treedata.health
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add some more columns here, but this is just a draft PR to get me started.

)
;

/*
For the delete, we will probably want to keep track of old treedata rows.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we want to do in the case of deletes? We might not be able to actually delete rows because of foreign keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've talked about this a ton and we are almost there with functionality to return our data back to the source cities but we'll need to gather their emails and or build out a 311 api.

What are you planning on deleting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have trees in our db that do not exist in the source data...

Copy link
Contributor Author

@tzinckgraf tzinckgraf Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of the case where a tree exists, then dies and gets cut down. The city the next season goes and plants a new tree in that same tree pit (because new tree pits are expensive). They record the new tree and maybe delete the old tree, assuming it was in the database.

What do we do with the old tree in our database in that scenario? This is all hypothetical. I'm not sure any city does that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have trees in our db that do not exist in the source data...

Similar to the update, we can delete based on id_source_name. If a tree gets created in our db with the same id_source_name as one of our data sources, then we won't be able to easily differentiate it from the data in a file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of the case where a tree exists, then dies and gets cut down. The city the next season goes and plants a new tree in that same tree pit (because new tree pits are expensive). They record the new tree and maybe delete the old tree, assuming it was in the database.

What do we do with the old tree in our database in that scenario? This is all hypothetical. I'm not sure any city does that.

They do replant in the same hole pretty much every time a tree dies so we need to figure it out. We have functionality on the FE to edit and delete but we removed it last year because it is a can of worms. The question is, how do we know for sure which is the correct data. What if a user replanted something and the city has old data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have trees in our db that do not exist in the source data...

Similar to the update, we can delete based on id_source_name. If a tree gets created in our db with the same id_source_name as one of our data sources, then we won't be able to easily differentiate it from the data in a file.

Yeah that's a real issue. Because the id_source_name IS the city name. I suppose we need to differentiate between what's city data and what is user data coming in from the field via the FE. Even still, which takes precedence. We could have a "field_data" vs "source_data" column value or something and then just grab whichever has the most recent date. A lot of sources don't have a modified or last maintained fiield tho.

Do we want some sort of audit trail for deleted trees?
Maybe an "active" column that we can use.
We probably want an "active" column because of foreign key constraints
*/

/*
Delete the rows from the staging table based on the
id_source_name
*/
DELETE
FROM treedata_staging
WHERE 1=1
AND c_id_source_name is null or id_source_name = c_id_source_name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME: Put the or statement into parenthesis.

;

end; $$
;