Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/merge-treedata-staging #56

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tzinckgraf
Copy link
Contributor

Initial commit for the merge process. The process is split into 5 pieces:

  • create a temp table with the staging data
  • insert based on a left join
  • update based on data differences
  • delete / deactivate for any removed trees (a point of discussion)
  • remove data from the staging table

This is all done in one stored procedure that takes an optional id_source_name. This will be called from the tree-sources repo

Initial commit for the merge process. The process is split into 5
pieces:
- create a temp table with the staging data
- insert based on a left join
- update based on data differences
- delete / deactivate for any removed trees (a point of discussion)
- remove data from the staging table

This is all done in one stored procedure that takes an optional
id_source_name. This will be called from the tree-sources repo
@tzinckgraf tzinckgraf requested review from mwpark2014 and zoobot March 29, 2023 03:49
;

/*
For the delete, we will probably want to keep track of old treedata rows.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we want to do in the case of deletes? We might not be able to actually delete rows because of foreign keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've talked about this a ton and we are almost there with functionality to return our data back to the source cities but we'll need to gather their emails and or build out a 311 api.

What are you planning on deleting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have trees in our db that do not exist in the source data...

Copy link
Contributor Author

@tzinckgraf tzinckgraf Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of the case where a tree exists, then dies and gets cut down. The city the next season goes and plants a new tree in that same tree pit (because new tree pits are expensive). They record the new tree and maybe delete the old tree, assuming it was in the database.

What do we do with the old tree in our database in that scenario? This is all hypothetical. I'm not sure any city does that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have trees in our db that do not exist in the source data...

Similar to the update, we can delete based on id_source_name. If a tree gets created in our db with the same id_source_name as one of our data sources, then we won't be able to easily differentiate it from the data in a file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of the case where a tree exists, then dies and gets cut down. The city the next season goes and plants a new tree in that same tree pit (because new tree pits are expensive). They record the new tree and maybe delete the old tree, assuming it was in the database.

What do we do with the old tree in our database in that scenario? This is all hypothetical. I'm not sure any city does that.

They do replant in the same hole pretty much every time a tree dies so we need to figure it out. We have functionality on the FE to edit and delete but we removed it last year because it is a can of worms. The question is, how do we know for sure which is the correct data. What if a user replanted something and the city has old data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have trees in our db that do not exist in the source data...

Similar to the update, we can delete based on id_source_name. If a tree gets created in our db with the same id_source_name as one of our data sources, then we won't be able to easily differentiate it from the data in a file.

Yeah that's a real issue. Because the id_source_name IS the city name. I suppose we need to differentiate between what's city data and what is user data coming in from the field via the FE. Even still, which takes precedence. We could have a "field_data" vs "source_data" column value or something and then just grab whichever has the most recent date. A lot of sources don't have a modified or last maintained fiield tho.

AND (
_tree.dbh <> treedata.dbh
OR _tree.address <> treedata.address
OR _tree.health <> treedata.health
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add some more columns here, but this is just a draft PR to get me started.

AND (c_id_source_name is null or ts.id_source_name = c_id_source_name);

/*
Insert with a left join based on the id_tree column
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upserts can work, but left joins can be faster and more importantly, cleaner. The code to do an upsert on a bunch of columns in one sql statement can get big and cumbersome. It can be nicer to split it out.

updated,
planted,
note,
address
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add some more columns here, like dbh_min and dbh_max. I just need some more sample data to play with to test this out.

We just merged the PR to run the staging tables from the tree-sources process, so this should be reasonable to test.

DELETE
FROM treedata_staging
WHERE 1=1
AND c_id_source_name is null or id_source_name = c_id_source_name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME: Put the or statement into parenthesis.

note,
address
)
SELECT id as id_tree,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Be careful with the ids, there's a lot of them... Wondering if we shot ourselves in the foot with calling the treehash "id"...

id is the unique hash of the tree that we create with our tree-id repo.
id_tree is the serialized db table id
id_source_name is the source's (like san_francisco)
id_reference is the city's/sources reference id(this may or may not exist)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given all the changes lately, I need to refresh my database. In the latest version I have, I don't have an id column on the treedata table. I also called id on treedata_staging what I think should be id_tree, so I will have to fix that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The id might be hiding midway down the table. Its sandwiched

    last_watered timestamp without time zone,
    id bigint NOT NULL,
    id_source_name character varying(255),

FROM _tree
LEFT JOIN treedata on (
treedata.id_source_name = _tree.id_source_name
AND treedata.id_tree = _tree.id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should not be the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They shouldn't, but they are. I need to fix the tree-sources repo to fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am reading it wrong, I was looking at this code
https://github.com/waterthetrees/tree-sources/blob/main/src/stages/normalize.js#L95

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean these: treedata.id_tree = _tree.id shouldnt be the same.
One is the table's serial number, the other is the created hash of the tree based off common, scientific, etc.

Copy link
Member

@zoobot zoobot Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2023-03-28 at 9 21 00 PM

Copy link
Member

@zoobot zoobot Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines 406, 408, 446, and 447 are all different. id_wtt needs to be removed at some point as it is no longer in use.

reposting here:
line 446: id is the unique hash of the tree that we create with our tree-id repo.
line 406: id_tree is the serialized db table id
line 447: id_source_name is the source's (like san_francisco)
line 408: id_reference is the city's/sources reference number(this may or may not exist)

@zoobot
Copy link
Member

zoobot commented Jul 27, 2023

HI @tzinckgraf Are you interested in working on this anymore or should I un-assign? Is this a variation of the other branch with a similar name?

@tzinckgraf
Copy link
Contributor Author

Hey @zoobot, I apologize, I won't have the bandwidth for at least another month. You can unassign.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Help Wanted
Development

Successfully merging this pull request may close these issues.

2 participants