What would be the best way to query and return all the variables available on a table? #62

palewire · 2018-11-28T19:50:39Z

Let's say I'm interested in exploring a topic I've never worked with before. Say, for example, Census data tracking languages spoken at home.

In this use case, even if I am a very skilled Python programmer (lol, remember this is a fantasy) and even if I have some experience with Census data, I am unlikely to know either the table id (in this case, B16001) or the fields on the table (e.g. that B16001_024E is the total number of people who speak Greek at home).

In a perfect world, I think this library would help the user find the table they're interested in, and then enumerate the fields within it.

I took a baby step in this direction with PR #61, which allows you get to a raw JSON list of all tables, but it's a long way from a solution. The good news is that all of the data we need to solve the problem is available in the Census API. https://api.census.gov/data/2016/acs/acs5/groups/B16001.json

If the crowd here agrees this is a worthy problem statement and use case, I'd like to start a discussion about how to solve it.

mr-fuller · 2018-11-28T20:39:42Z

I'm a fan of how some packages in R return census metadata in a data frame, so I have played around with returning a pandas data frame with the information. An example of a returned data frame might look something like:

However, I'm not sure if printing a data frame to the console works as well in a typical python development environment.

Also, if this thread gains traction, issue #55 would become redundant.

palewire · 2018-11-28T20:44:13Z

Thanks for the input. I don't know if a dataframe output is within the scope of where this package wants to end up, but I do wonder if we could achieve some of the same things by packaging the JSON metadata from the API in a vanilla Python object to help grab extra data from related URLs, offer annotations and labels with values, etc.

JoeGermuska · 2018-11-28T21:50:55Z

As noted in #61 cenpy has some of this functionality, although it depends upon dataframes -- see the demo notebook to see it in action.

Taking the variable names as they come from the API, they're pretty unwieldy. The median string length of variable labels is 65, in part because they're concatenations of all of the parent labels. If you split the concatenation, the median variable has four parts (three parents).

What the heck, here's the full describe on those two factors:

Definitely interested in the problem; wonder about the relative value of reproducing things that cenpy already does -- I've only recently been playing around with any of the python census libraries.

JoeGermuska · 2018-11-28T22:16:54Z

this is a crude way to format labels, leveraging cenpy, but which could be adapted to raw API output. (v is as defined in the previous comment)

It sort of breaks down for tables which aren't in a tree structure, like B01002 Median Age:

I guess one could infer here that the labels are median age, but I'd have preferred output more like:

Median Age
  Total
  Male
  Female

There are a few detailed tables I can't recall off hand which are kind of mixed this way -- the labels aren't a strict tree. And I don't think this would adapt very well to the data profiles or subject tables.

fgregg · 2018-11-28T22:27:38Z

Could the census be open to having all tables be tree structured? @loganpowell

palewire · 2018-11-28T22:35:05Z

IMHO, having the library so closely integrated with pandas is a bug, not a feature. I'd like to see a lower-level mapping of Python to data, which can then be rolled up into dataframes further on down the line.

JoeGermuska · 2018-11-28T22:36:38Z

The challenge is that there are no variables corresponding to those "missing" tree nodes.

In the Excel spreadsheet which has served as the best structured metadata we have for Census Reporter, variables have "line numbers", and the non-variable labels have n.5 numbers...

select * from acs2017_1yr.census_column_metadata where table_id = 'B01002';
 table_id | line_number |  column_id  | column_title  | indent | parent_column_id 
----------+-------------+-------------+---------------+--------+------------------
 B01002   |         0.5 | B01002000.5 | Median age -- |      0 | 
 B01002   |         1.0 | B01002001   | Total:        |      1 | B01002000.5
 B01002   |         2.0 | B01002002   | Male          |      1 | B01002000.5
 B01002   |         3.0 | B01002003   | Female        |      1 | B01002000.5
(4 rows)

knaaptime · 2018-11-28T22:40:03Z

the labels are also structured quite differently for different surveys (ACS vs, e.g. 2000 or 1990 decennial)

loganpowell · 2018-11-29T21:08:31Z

Hi folks! Sorry for the delay. I believe @palewire with https://api.census.gov/data/2016/acs/acs5/groups/B16001.json is on the right track. Unfortunately, @fgregg the Census API team is not taking feature requests (tree structure) as their top priority for the foreseeable future is getting the rest of the house of Census data into the API... You all have noted that the /variables.json list (especially for ACS) is huge (@40k variables).

ljwolf · 2019-03-15T16:28:54Z

Would be happy to collaborate/merge where possible. I've just merged and released a better default variable explorer using dataframes in cenpy. More than happy to recommend census as a good frontend to packages of datasets built on top of cenpy, or to consider where to merge things together...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What would be the best way to query and return all the variables available on a table? #62

What would be the best way to query and return all the variables available on a table? #62

palewire commented Nov 28, 2018 •

edited

Loading

mr-fuller commented Nov 28, 2018

palewire commented Nov 28, 2018 •

edited

Loading

JoeGermuska commented Nov 28, 2018 •

edited

Loading

JoeGermuska commented Nov 28, 2018

fgregg commented Nov 28, 2018

palewire commented Nov 28, 2018 •

edited

Loading

JoeGermuska commented Nov 28, 2018 •

edited

Loading

knaaptime commented Nov 28, 2018

loganpowell commented Nov 29, 2018 •

edited

Loading

ljwolf commented Mar 15, 2019

What would be the best way to query and return all the variables available on a table? #62

What would be the best way to query and return all the variables available on a table? #62

Comments

palewire commented Nov 28, 2018 • edited Loading

mr-fuller commented Nov 28, 2018

palewire commented Nov 28, 2018 • edited Loading

JoeGermuska commented Nov 28, 2018 • edited Loading

JoeGermuska commented Nov 28, 2018

fgregg commented Nov 28, 2018

palewire commented Nov 28, 2018 • edited Loading

JoeGermuska commented Nov 28, 2018 • edited Loading

knaaptime commented Nov 28, 2018

loganpowell commented Nov 29, 2018 • edited Loading

ljwolf commented Mar 15, 2019

palewire commented Nov 28, 2018 •

edited

Loading

palewire commented Nov 28, 2018 •

edited

Loading

JoeGermuska commented Nov 28, 2018 •

edited

Loading

palewire commented Nov 28, 2018 •

edited

Loading

JoeGermuska commented Nov 28, 2018 •

edited

Loading

loganpowell commented Nov 29, 2018 •

edited

Loading