'universe' target should not be an option for normalization #326

michellemho · 2017-09-21T15:33:33Z

There are two types of "targets" in bigmetadata; universe and denominator. The denominator targets indicate how the variable can be normalized. The universe targets captures the variable used to calculate the median or per capita measurement. We should remove universe targets from the normalization option. We might also just remove the universe target entirely, because is there any known use for it? I'm not sure.

javitonino · 2017-09-27T07:11:27Z

We have removed universe completely the last dump and are testing right now to see if we find any problems.

michellemho · 2017-11-03T19:05:45Z

Discovered a problem with the removal of 'universe' metadata today!

Enriching polygons with US median household income returns the following error: "Analysis B2 failed: {"Median or average aggregation for polygons requires a denominator to provide weights. Please review the provided options"}"

Previously, I thought universe targets were not used in the DO query, but I was proven wrong today. I believe it's use comes up here, with predenominated polygon interpolation weighted by universe:

https://github.com/CartoDB/observatory-extension/blob/develop/src/pg/sql/41_observatory_augmentation.sql#L718

javitonino · 2017-11-06T08:07:31Z

TODO:

Review if the universe makes sense as a weight for medians
Revert previous commit (may need additional work to merge correctly)
Review that all median columns have universe
Think about universe denominators in metadata generation

antoniocarlon · 2017-11-06T11:54:07Z

The universe denominator is taken into account to aggregate measurements that are medians or averages. This makes sense as the correct way to weight already aggregated and weighted data (like medians and averages) is unweight the measurement, aggregate and weight again (it doesn't make sense to weight using other measurement like the area of the affected geometry).

It is done using a formula like this:

SUM(numerator * denominator) / SUM(denominator)

~~The main conclusion is that we need to revert the changes made to remove universe.~~

javitonino · 2017-11-06T12:03:37Z

Question time.

Shouldn't this also take into account the area? I mean, assuming "average age" and "total population":

For each intersected boundary: Multiply (average age)*(total pop) to get the sum of ages
Calculate the totals (weighted by area) in the intersection area. That is, the total denominator (sum of total_pop) and the total age (sum of "sum of ages")
Calculate the value (sum of ages / total pop) of the intersection area

I don't think we want to directly revert the changes since that had their own set of problems. From what you say, I think it makes sense to:

Revert the ETL changes (add universe denominators back)
Modify meta generation so it does not include universe (or at least, not present them to the user)
Make the extension automatically use the universe when it needs to do an aggregation. I don't think it makes sense to ask the user to select this, as it's the only viable option)

Does this make sense for medians? I don't think there is a way you can recover original data from them.

antoniocarlon · 2017-11-06T12:15:12Z

1. I'm currently at this very exact point so I'm removing the last sentence from my previous comment. I'm investigating several measurements to see whether it makes sense to weight the medians/averages with the affected area of the geometry. I'm updating this comment with the conclusions

The current approach is incorrect. Given this example:

We need to take into account the area of the intersection of the user geometry with the DO geometries (A, B and C) to correctly measure the weight of each part. This is valid for all the medians/averages in the ETL.
We need to bring back the universes, but we need to investigate how to take care of the medians/averages and universes because the current approach is an approximation to simplify the vals part of OBS_GetData.
Nope, you can't recover original data from medians, but multiplying the median by the weight can be a good enough approximation (take this with a grain of salt). That's the main reason because the original approach wasn't so horrible.

michellemho · 2017-11-08T15:55:35Z

Answering the questions:

I thought the current approach does take into account the of the intersection of the user geometry and the DO geometry (see line 716). The formula is:

SUM(numer * denom * (% user geom in OBS geom)) / SUM(denom * (% user geom in OBS geom))

There appears to be some formula _procgeoms.pct_obs that achieves this, but I'm not sure how.
@javitonino, I agree with the steps you suggest. We need to revert, remove universe from the metadata available to users, and automatically use the universe for aggregation. Can we make use of the denom_reltype to help determine this? The denom_reltype is returned by OBS_GetMeta. When writing the OBSColumns task, the targets of a column is written as a dictionary of all related columns, with the format OBSColumn: reltype. Reltypes can be universe or denominator.
I agree with @antoniocarlon, you can't recover original data with medians but it's a "good enough" approximation with plenty of grains of salt!

andrewbt · 2017-11-08T22:40:26Z

Wanted to add some two cents on the big picture customer perspective and some resources I've found when researching this problem.

The big picture “objective” is that the DO analysis should be able to enrich arbitrary polygons with measures that are medians and measures that are averages. It’s a common customer use case to draw an AOI and ask “what is the median income in this area?” or "what's the average age?"

I saw this issue/other project with discussion that sounds very helpful in terms of performing "medians of medians": censusreporter/census-aggregate#1

It seems there's something called Pareto Interpolation: https://en.wikipedia.org/wiki/Pareto_interpolation

And also: http://mumford.albany.edu/census/CityProfiles/Profiles/MHHINote.htm

While reading those links and their approaches, I got another related idea that’s a bit broader but would also be great to address somehow. It’s also been a common customer request (one we haven’t been able to deliver on) to ask “how many people making more than $50,000 annually live in this arbitrary AOI?“, or “how many people between the ages of 20 and 45 live in this AOI?” But we’ve never been able to do “population above $X”/“pop between $X and Y” or “pop above A age”/“pop between A and B ages” queries because the Census data (at least for the USA) has those values binned at $5,000 and 5 year increments - we have “pop with age 25 to 29", 30-34, 35-39, etc etc and similar $5k bins for income counts (and probably other measures too). Someone would have to enrich a separate column for each 5 value bin to approximate their requested range, and then use SQL arithmetic / field calculation to add up those multiple column values and come up with a sum total for their range, or at least a 5-value rounded estimate of their range.

If we think about brand demographic profiles like “affluent millenials”, they’re defined as ranges - 20-35 y/o, $50-$100k, etc. There’s no easy way to use the DO as it is to analyze for profiles like that, and we might be able to get there using some of the same approaches as doing median income for polygons...

antoniocarlon · 2017-11-09T14:04:03Z

Answering @michellemho

You are right, the OBS_GetData function takes into account the user geometry and weights the area of the DO geometries affected by the user geometry. The formula that you mention is equivalent to the method that @javitonino proposed:

SUM(numerator * denominator * affected_user_area / observatory_geom_area) / SUM(denominator * affected_user_area / observatory_geom_area)

The calculation of each affected_user_area / observatory_geom_area (named _procgeoms.pct_obs as you said) is calculated here using the following algorithm:

If the user geometry completely lays within a DO geometry, pct_obs is calculated as ST_Area(user_geometry) / ST_Area(DO_geometry)
If a DO geometry completely lays within the user geometry, pct_obs is 1
Else, we calculate the intersection and pct_obs is calculated as ST_Area(ST_Insersection(user_geometry, do_geometry)) / ST_Area(DO_geometry)

The OBS_GetMeta function returns the available denominator for a given numerator using the OBS_META table and sets prenormalized as normalization when the aggregation is median or average and the denom_reltype is universe (see here) even if the user doesn't explicitly sets the denom_id in the OBS_GetMeta parameters so I think that OBS_GetMeta is currently behaving as expected.

michellemho added the data observatory label Sep 21, 2017

javitonino self-assigned this Sep 26, 2017

javitonino mentioned this issue Sep 26, 2017

Remove universe column relationship #329

Merged

javitonino removed their assignment Sep 26, 2017

antoniocarlon assigned antoniocarlon and unassigned antoniocarlon Sep 27, 2017

antoniocarlon closed this as completed in #329 Sep 27, 2017

michellemho reopened this Nov 3, 2017

antoniocarlon self-assigned this Nov 6, 2017

antoniocarlon mentioned this issue Nov 9, 2017

Solve universe target issues #371

Merged

antoniocarlon removed their assignment Nov 10, 2017

javitonino assigned javitonino and antoniocarlon and unassigned javitonino Nov 10, 2017

antoniocarlon removed their assignment Nov 14, 2017

javitonino assigned javitonino and unassigned javitonino Nov 16, 2017

javitonino mentioned this issue Nov 16, 2017

Solve the universes problem #378

Open

antoniocarlon assigned antoniocarlon and javitonino and unassigned antoniocarlon Nov 17, 2017

javitonino closed this as completed in #371 Nov 17, 2017

javitonino removed their assignment Nov 17, 2017

antoniocarlon mentioned this issue Nov 28, 2017

Solving errors executing ETL tasks (embedded simplification) #384

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'universe' target should not be an option for normalization #326

'universe' target should not be an option for normalization #326

michellemho commented Sep 21, 2017

javitonino commented Sep 27, 2017

michellemho commented Nov 3, 2017

javitonino commented Nov 6, 2017 •

edited

Loading

antoniocarlon commented Nov 6, 2017 •

edited

Loading

javitonino commented Nov 6, 2017 •

edited

Loading

antoniocarlon commented Nov 6, 2017 •

edited

Loading

michellemho commented Nov 8, 2017

andrewbt commented Nov 8, 2017

antoniocarlon commented Nov 9, 2017

'universe' target should not be an option for normalization #326

'universe' target should not be an option for normalization #326

Comments

michellemho commented Sep 21, 2017

javitonino commented Sep 27, 2017

michellemho commented Nov 3, 2017

javitonino commented Nov 6, 2017 • edited Loading

antoniocarlon commented Nov 6, 2017 • edited Loading

javitonino commented Nov 6, 2017 • edited Loading

antoniocarlon commented Nov 6, 2017 • edited Loading

michellemho commented Nov 8, 2017

andrewbt commented Nov 8, 2017

antoniocarlon commented Nov 9, 2017

javitonino commented Nov 6, 2017 •

edited

Loading

antoniocarlon commented Nov 6, 2017 •

edited

Loading

javitonino commented Nov 6, 2017 •

edited

Loading

antoniocarlon commented Nov 6, 2017 •

edited

Loading