Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'universe' target should not be an option for normalization #326

Closed
michellemho opened this issue Sep 21, 2017 · 9 comments
Closed

'universe' target should not be an option for normalization #326

michellemho opened this issue Sep 21, 2017 · 9 comments

Comments

@michellemho
Copy link
Contributor

There are two types of "targets" in bigmetadata; universe and denominator. The denominator targets indicate how the variable can be normalized. The universe targets captures the variable used to calculate the median or per capita measurement. We should remove universe targets from the normalization option. We might also just remove the universe target entirely, because is there any known use for it? I'm not sure.

image

@javitonino
Copy link

We have removed universe completely the last dump and are testing right now to see if we find any problems.

@michellemho
Copy link
Contributor Author

Discovered a problem with the removal of 'universe' metadata today!

Enriching polygons with US median household income returns the following error: "Analysis B2 failed: {"Median or average aggregation for polygons requires a denominator to provide weights. Please review the provided options"}"

Previously, I thought universe targets were not used in the DO query, but I was proven wrong today. I believe it's use comes up here, with predenominated polygon interpolation weighted by universe:

https://github.com/CartoDB/observatory-extension/blob/develop/src/pg/sql/41_observatory_augmentation.sql#L718

@javitonino
Copy link

javitonino commented Nov 6, 2017

TODO:

  • Review if the universe makes sense as a weight for medians
  • Revert previous commit (may need additional work to merge correctly)
  • Review that all median columns have universe
  • Think about universe denominators in metadata generation

@antoniocarlon
Copy link
Contributor

antoniocarlon commented Nov 6, 2017

The universe denominator is taken into account to aggregate measurements that are medians or averages. This makes sense as the correct way to weight already aggregated and weighted data (like medians and averages) is unweight the measurement, aggregate and weight again (it doesn't make sense to weight using other measurement like the area of the affected geometry).

It is done using a formula like this:

SUM(numerator * denominator) / SUM(denominator)

The main conclusion is that we need to revert the changes made to remove universe.

@antoniocarlon antoniocarlon self-assigned this Nov 6, 2017
@javitonino
Copy link

javitonino commented Nov 6, 2017

Question time.

  1. Shouldn't this also take into account the area? I mean, assuming "average age" and "total population":
  • For each intersected boundary: Multiply (average age)*(total pop) to get the sum of ages
  • Calculate the totals (weighted by area) in the intersection area. That is, the total denominator (sum of total_pop) and the total age (sum of "sum of ages")
  • Calculate the value (sum of ages / total pop) of the intersection area
  1. I don't think we want to directly revert the changes since that had their own set of problems. From what you say, I think it makes sense to:
  • Revert the ETL changes (add universe denominators back)
  • Modify meta generation so it does not include universe (or at least, not present them to the user)
  • Make the extension automatically use the universe when it needs to do an aggregation. I don't think it makes sense to ask the user to select this, as it's the only viable option)
  1. Does this make sense for medians? I don't think there is a way you can recover original data from them.

@antoniocarlon
Copy link
Contributor

antoniocarlon commented Nov 6, 2017

1. I'm currently at this very exact point so I'm removing the last sentence from my previous comment. I'm investigating several measurements to see whether it makes sense to weight the medians/averages with the affected area of the geometry. I'm updating this comment with the conclusions

  1. The current approach is incorrect. Given this example:
    img-7327
    We need to take into account the area of the intersection of the user geometry with the DO geometries (A, B and C) to correctly measure the weight of each part. This is valid for all the medians/averages in the ETL.
  2. We need to bring back the universes, but we need to investigate how to take care of the medians/averages and universes because the current approach is an approximation to simplify the vals part of OBS_GetData.
  3. Nope, you can't recover original data from medians, but multiplying the median by the weight can be a good enough approximation (take this with a grain of salt). That's the main reason because the original approach wasn't so horrible.

@michellemho
Copy link
Contributor Author

Answering the questions:

  1. I thought the current approach does take into account the of the intersection of the user geometry and the DO geometry (see line 716). The formula is:

    SUM(numer * denom * (% user geom in OBS geom)) / SUM(denom * (% user geom in OBS geom))

    There appears to be some formula _procgeoms.pct_obs that achieves this, but I'm not sure how.

  2. @javitonino, I agree with the steps you suggest. We need to revert, remove universe from the metadata available to users, and automatically use the universe for aggregation. Can we make use of the denom_reltype to help determine this? The denom_reltype is returned by OBS_GetMeta. When writing the OBSColumns task, the targets of a column is written as a dictionary of all related columns, with the format OBSColumn: reltype. Reltypes can be universe or denominator.

  3. I agree with @antoniocarlon, you can't recover original data with medians but it's a "good enough" approximation with plenty of grains of salt!

@andrewbt
Copy link

andrewbt commented Nov 8, 2017

Wanted to add some two cents on the big picture customer perspective and some resources I've found when researching this problem.

The big picture “objective” is that the DO analysis should be able to enrich arbitrary polygons with measures that are medians and measures that are averages. It’s a common customer use case to draw an AOI and ask “what is the median income in this area?” or "what's the average age?"

I saw this issue/other project with discussion that sounds very helpful in terms of performing "medians of medians": censusreporter/census-aggregate#1

It seems there's something called Pareto Interpolation: https://en.wikipedia.org/wiki/Pareto_interpolation

And also: http://mumford.albany.edu/census/CityProfiles/Profiles/MHHINote.htm

While reading those links and their approaches, I got another related idea that’s a bit broader but would also be great to address somehow. It’s also been a common customer request (one we haven’t been able to deliver on) to ask “how many people making more than $50,000 annually live in this arbitrary AOI?“, or “how many people between the ages of 20 and 45 live in this AOI?” But we’ve never been able to do “population above $X”/“pop between $X and Y” or “pop above A age”/“pop between A and B ages” queries because the Census data (at least for the USA) has those values binned at $5,000 and 5 year increments - we have “pop with age 25 to 29", 30-34, 35-39, etc etc and similar $5k bins for income counts (and probably other measures too). Someone would have to enrich a separate column for each 5 value bin to approximate their requested range, and then use SQL arithmetic / field calculation to add up those multiple column values and come up with a sum total for their range, or at least a 5-value rounded estimate of their range.

If we think about brand demographic profiles like “affluent millenials”, they’re defined as ranges - 20-35 y/o, $50-$100k, etc. There’s no easy way to use the DO as it is to analyze for profiles like that, and we might be able to get there using some of the same approaches as doing median income for polygons...

@antoniocarlon
Copy link
Contributor

Answering @michellemho

  1. You are right, the OBS_GetData function takes into account the user geometry and weights the area of the DO geometries affected by the user geometry. The formula that you mention is equivalent to the method that @javitonino proposed:
SUM(numerator * denominator * affected_user_area / observatory_geom_area) / SUM(denominator * affected_user_area / observatory_geom_area)

The calculation of each affected_user_area / observatory_geom_area (named _procgeoms.pct_obs as you said) is calculated here using the following algorithm:

  • If the user geometry completely lays within a DO geometry, pct_obs is calculated as ST_Area(user_geometry) / ST_Area(DO_geometry)
  • If a DO geometry completely lays within the user geometry, pct_obs is 1
  • Else, we calculate the intersection and pct_obs is calculated as ST_Area(ST_Insersection(user_geometry, do_geometry)) / ST_Area(DO_geometry)
  1. The OBS_GetMeta function returns the available denominator for a given numerator using the OBS_META table and sets prenormalized as normalization when the aggregation is median or average and the denom_reltype is universe (see here) even if the user doesn't explicitly sets the denom_id in the OBS_GetMeta parameters so I think that OBS_GetMeta is currently behaving as expected.

@antoniocarlon antoniocarlon removed their assignment Nov 10, 2017
@antoniocarlon antoniocarlon removed their assignment Nov 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants