-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What would be the best way to query and return all the variables available on a table? #62
Comments
I'm a fan of how some packages in R return census metadata in a data frame, so I have played around with returning a pandas data frame with the information. An example of a returned data frame might look something like: However, I'm not sure if printing a data frame to the console works as well in a typical python development environment. Also, if this thread gains traction, issue #55 would become redundant. |
Thanks for the input. I don't know if a dataframe output is within the scope of where this package wants to end up, but I do wonder if we could achieve some of the same things by packaging the JSON metadata from the API in a vanilla Python object to help grab extra data from related URLs, offer annotations and labels with values, etc. |
As noted in #61 Taking the variable names as they come from the API, they're pretty unwieldy. The median string length of variable labels is 65, in part because they're concatenations of all of the parent labels. If you split the concatenation, the median variable has four parts (three parents). What the heck, here's the full Definitely interested in the problem; wonder about the relative value of reproducing things that |
this is a crude way to format labels, leveraging It sort of breaks down for tables which aren't in a tree structure, like B01002 Median Age: I guess one could infer here that the labels are median age, but I'd have preferred output more like:
There are a few detailed tables I can't recall off hand which are kind of mixed this way -- the labels aren't a strict tree. And I don't think this would adapt very well to the data profiles or subject tables. |
Could the census be open to having all tables be tree structured? @loganpowell |
IMHO, having the library so closely integrated with pandas is a bug, not a feature. I'd like to see a lower-level mapping of Python to data, which can then be rolled up into dataframes further on down the line. |
The challenge is that there are no variables corresponding to those "missing" tree nodes. In the Excel spreadsheet which has served as the best structured metadata we have for Census Reporter, variables have "line numbers", and the non-variable labels have select * from acs2017_1yr.census_column_metadata where table_id = 'B01002';
table_id | line_number | column_id | column_title | indent | parent_column_id
----------+-------------+-------------+---------------+--------+------------------
B01002 | 0.5 | B01002000.5 | Median age -- | 0 |
B01002 | 1.0 | B01002001 | Total: | 1 | B01002000.5
B01002 | 2.0 | B01002002 | Male | 1 | B01002000.5
B01002 | 3.0 | B01002003 | Female | 1 | B01002000.5
(4 rows) |
the labels are also structured quite differently for different surveys (ACS vs, e.g. 2000 or 1990 decennial) |
Hi folks! Sorry for the delay. I believe @palewire with https://api.census.gov/data/2016/acs/acs5/groups/B16001.json is on the right track. Unfortunately, @fgregg the Census API team is not taking feature requests (tree structure) as their top priority for the foreseeable future is getting the rest of the house of Census data into the API... You all have noted that the |
Would be happy to collaborate/merge where possible. I've just merged and released a better default variable explorer using dataframes in |
Let's say I'm interested in exploring a topic I've never worked with before. Say, for example, Census data tracking languages spoken at home.
In this use case, even if I am a very skilled Python programmer (lol, remember this is a fantasy) and even if I have some experience with Census data, I am unlikely to know either the table id (in this case, B16001) or the fields on the table (e.g. that B16001_024E is the total number of people who speak Greek at home).
In a perfect world, I think this library would help the user find the table they're interested in, and then enumerate the fields within it.
I took a baby step in this direction with PR #61, which allows you get to a raw JSON list of all tables, but it's a long way from a solution. The good news is that all of the data we need to solve the problem is available in the Census API. https://api.census.gov/data/2016/acs/acs5/groups/B16001.json
If the crowd here agrees this is a worthy problem statement and use case, I'd like to start a discussion about how to solve it.
The text was updated successfully, but these errors were encountered: