Introduction
Most U.S. Census data sets are keyed by geography. Concepts like population, median income, and gender, race, or age ranges of residents are only meaningful when we tie them to geographies like states, counties, or census tracts. The U.S. Census provides data at a wide variety of geographies, which nest in a hierarchy as shown here:
The lines in this image represent containment. Regions are fully contained in the nation; divisions are contained in regions; states are contained in divisions, counties are contained in states, census tracts are contained in counties; block groups are contained in tracts, and blocks are contained in block groups. All of these geographies down the center of the diagram are referred to as on-spine.
But there are other off-spine geographies, like Core Based Statistical Areas (CBSAs), congressional districts and many others. They may be contained by large on-spine geographies, but they don’t properly contain smaller on-spine geographies larger than block.
For example, a CBSA is not necessarily contained in any on-spine geography below the nation. CBSAs like the Kansas City CBSA or the New York City CBSA, for example, cross state lines.
Containing Geographies with censusdis
Often we want to look as smaller geographical areas, like census tracts, but only those that are contained within an off-spine geography like a CBSA or a congressional district. Unfortunately, the U.S. Census API does not let us do this directly. And early versions of the censusdis package that wraps the U.S. Census API for Python users didn’t either. But now, through the censusdis.data.contained_within()
API we can easily make this kind of query.
Here is an example of how we can query all of the census tracts in the New York City area CBSA (note that in the Census API and censusdis, CBSAs are called metropolitan_statistical_area_micropolitan_statistical_area
s):
import censusdis.data as ced
from censusdis.datasets import ACS5
from censusdis.msa_msa import NEW_YORK_NEWARK_JERSEY_CITY_NY_NJ_PA_METRO_AREA
df_ny_tracts = ced.contained_within(
metropolitan_statistical_area_micropolitan_statistical_area=NEW_YORK_NEWARK_JERSEY_CITY_NY_NJ_PA_METRO_AREA
).download(
ACS5,
2020,
["NAME", "B19013_001E"],
state="*",
county="*",
tract="*"
)
The first clause of the query indicates that we are looking to download data for geographies that are contained within the NYC CBSA. This doesn’t mean we want data tied to that CBSA, but that we want to restrict the data from the next clause of the query to be for geographies that are contained by the CBSA.
The second clause is where we specify the dataset, vintage, variable names, and geographies we want, just like in a normal call to ced.download()
. But in this case, we are asking for all states, all counties, and all tracts in the country. That’s a lot of census tracts. But because of the ced.contained_within()
, we won’t actually download all of these. Instead, censusdis will first use maps in downloads from the U.S. Census to figure out what states overlap the CBSA, get data only from those, and then use the maps again to filter that data down to those that are physically contained in the CBSA.
If we plot the data on a map, it looks like this:
Demonstration Notebooks
There is a lot more to this API, which is best demonstrated with sample notebooks. Some of the notebooks currently available are:
- Block Groups in CBSAs.ipynb – The notebook that created the plot above.
- Geographies Contained within Geographies.ipynb – A notebook showing three different use cases including tracts within a census place, counties within a multi-state CBSA, and tracts within urban areas in a state.
- Congressional Districts.ipynb – A detailed example where we try to load and manipulate data for census tracts within congressional districts manually, then switch to
ced.contained_within
. - Zip Code Tabulation Areas.ipynb – Illustrates how pre-2020 ZCTAs were nested in states in the data model but from 2020 on they are not, so we have to change our download strategy.
Here are some of the maps these notebooks produce:
Please see the notebooks themselves for more details.
Conclusions
The notion of querying geographies contained within others, even if they are not both on-spine, is powerful. And now, it is simple to do for a wide variety of use cases. We hope you will try it out for your own applications. Please try it out, and raise any suggestions or feedback as an issue or discussion in the censusdis project.