-
Notifications
You must be signed in to change notification settings - Fork 3
medians, etc are aggregated incorrectly #1
Comments
other (non-median) kinds of tables which are maybe not amenable to the basic aggregation method.
Other table types that don't have a "denominator column" and so may need a bit more thought:
|
…OK. Skeptical about MOE aggregation
Is this the same neck of the woods we're wandering here? |
I think it is. Looks like you guys have a better line on best practices. |
We have an algorithm working that seems pretty accurate based on some initial testing: https://github.com/datadesk/latimes-calculate/blob/pareto/calculate/pareto.py |
Seems being the key word in that sentence. |
OK, so now that I read the code, we are not working on the same problem. The STF3-P80 table from 1990 referenced in Steve's original SAS offers counts of households in each income bracket. I don't exactly understand why they did the exercise, since P80A offers the median income -- unless they were specifically interested to see how the formula compared to the published value. My question is: given a list of median income by census tract in a neighborhood, can I compute the median income for the neighborhood? |
The objective of the so-called "Pareto" formula is to calculate an estimated median in cases where you are combining geographies. |
I missed that part. Your code refers to bins of known size, etc. Show me what I overlooked! |
The idea is to sum the counts/people for each bin in the areas you want to combine, then run the algorithm. We need it right now to calculate median income for combined counties. To combine it across multiple areas you need a table with actual counts of people, and the only way to get that is to go with the brackets (that we know of). It's absolutely an estimate, but when we run it against data like the table above -- where you have the brackets and an actual median -- the estimate comes back very accurate. I've reached out to the Census for help/confirmation of the method, but I haven't heard back yet. |
OK I heard back from the Census and they pointed me to pages 16 and 17 of this document: http://www.census.gov/content/dam/Census/programs-surveys/sipp/tech-documentation/source-accuracy-statements/2008/SIPP%202008%20Panel%20Wave%2005%20-%20Core%20Source%20and%20Accuracy%20Statements.pdf What we've been calling Pareto is actually linear interpolation (we'll have to rename that), though it seems they use both depending on the application. I think we're going to stick with linear for our purposes. |
so if it's cause I just ate lunch forgive me, but here's where I break down
(Doig quote, lines 39-45) With age buckets, there's a linear progression, so you can say that the midpoint is "0.032 into the range." With medians by geography, even with population counts, there's no "range" to be some fraction into. And there's no income. Is your idea that for each explicit median value (as in table P80A in the example) we would need to identify a bucketed table (P80) and to produce a "P80A equivalent"? For my purposes, I'd have to scout around in the ACS to see if I can find those. |
I made a gist which may identify pairs suitable for use with these interpolation methods. And some tables I'm less clear about. |
I was running into the same issue. Pareto interpolation was very helpful. Thanks a lot! |
It seems we might want to distinguish between two types of functions: aggregations and reductions. Aggregations are functions that take a sequence of values and returns a value of the same semantic type. For count data a Then there are reductions, like medians, averages, and standard deviations. An average of counts is not the count of the composite. Things might be a little confusing because sometimes a reduction of reductions is an aggregation. A weighted average of averages can be the overall average for the composite. However this is generally not true. The unweighted average of averages is not the overall average (except when each subunit has the same number of elements). The average of medians is not the overall median of the composite and unless conditions are unusual the average of medians does not even approximate the median of the composite. The same is true of the median of medians. For calculating the median of a composite with census data, the most correct procedure will be to aggregate (i.e. sum) the binned data and then interpolate the composite median using a linear, Pareto, or some other appropriate method. |
Here is a prescribed method for averaging medians https://www.chrispy.net/pipermail/ctpp-news/2011-January/005355.html
The text was updated successfully, but these errors were encountered: