Command-Line censusdis for Data Pipelines and One-Time Analysis

Command-Line censusdis for Data Pipelines and One-Time Analysis

censusdis is a package for discovering, loading, analyzing, and computing diversity, integration, and segregation metrics to U.S. Census demographic data.

Recently, I found myself writing a number of small scripts to use censusdis to download data and construct new variables for projects like The Impact of Demographics and Income on Eviction Rates, Diversity and Integration in America: An Interactive Visualization, and many more. I would then invoke these small programs from shell scripts or Makefiles. Many of these scripts were repetitive, with just a few changes here and there to specify what variables and geographies to use and how to compute derived variables, many of which were fractional populations (e.g. the fraction of residents in an area belonging to each of several racial or ethnic groups). This often involves identifying what I call the leaves of a group of variables. This notion is embedded in the U.S. Census data model, but before censusdis it was not well exposed in Python.

It made sense to pull these repetitive tasks back into a canonical form in a single command-line interface in the censusdis project, with the specific details of each task specified via a configuration file. And that is exactly what this post is about. The newest version of censusdis includes a command line utility to download, manipulate and plot U.S. Census data without writing any repetitive Python code.

Why is this a big deal? Beyond saving me some keystrokes, it means that it is now easy for data engineers and others who maintain data pipelines to integrate downloading data from the census into their tools and processes with a single configuration-driven tool.

In the rest of this post I’m going to show you some of the key capabilities of this new feature. You should be able to copy, paste, and run the examples here and then come up with your own configurations to solve your own census data needs.

Getting Started

The easiest way to get started, whether you are using the command line or the traditional Python interface to censusdis, is to pip install it in a virtual environment, with the shell command

pip install censusdis

You should make sure you have version 0.99.3 or later to ensure all the examples in this post work. Once you have installed it, the censusdis tool should be available in your shell. Try it out, and get the top-level help by running

censusdis --help

This should give you a help message like the following:

usage: censusdis [-h] [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                 [--logfile LOGFILE]
                 {download,plot} ...

options:
  -h, --help            show this help message and exit
  --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Logging level.
  --logfile LOGFILE     Optional file path that logs should be appended to.
                        The file will be created if it does not exist.

command:
  Choose one of the following commands.

  {download,plot}
    download            Download data from the U.S. Census API.
    plot                Plot data on a map.

Let’s skip past --help, which we already used, and the logging options --log and --logfile. After that we see that there are two commands, download and plot. We will look at download in this blog post and then in a later one we will look at plot.

Downloading Data with the download Command

Detailed help on the download command can be seen using

censusdis download --help

This produces detailed help on download which is the first command we are going to use. The help is

usage: censusdis download [-h] [--api-key API_KEY] -o OUTPUT dataspec

positional arguments:
  dataspec              A dataspec YAML file.

options:
  -h, --help            show this help message and exit
  --api-key API_KEY     Optional API key. Alternatively, store your key
                        in ~/.censusdis/api_key.txt. It you don't have a
                        key, you may get throttled or blocked. Get one
                        from https://api.census.gov/data/key_signup.html
  -o OUTPUT, --output OUTPUT
                        Output file to store the data in. Format 
                        will be determined from the file extansion. 
                        .csv or .geojson (the latter if your spec 
                        has with_geometry: true).

The key argument here is dataspec, which is the name of a YAML file that specifies what data we want to download from the U.S. Census API, and how we want to manipulate it to produce our final dataset. To write your first data specification file, open up your favorite editor and create a file called data1.yaml, and copy and paste the following into it:

!DataSpec
dataset: ACS5
vintage: 2020
geography:
  state: '*'
specs:
  !VariableList
  variables:
    - NAME        
    - B01003_001E

If you are using an editor that has a mode for YAML files, you may even get nice highlighting. If you are using a plain text editor, make sure you get the indentation right. YAML can be quite sensitive about extra or missing spaces. Note that this and all of the other YAML files in this blog post are also available on github at https://github.com/vengroff/censusdis-cli-demo.

Let’s go over the contents of the file. The !DataSpec tag on line 1 indicates that the rest of the file is a data specification. Line 2 specifies the dataset we want to use, the American Community Survey (ACS), 5-year version. The U.S. Census publishes a large number of data sets. The ACS is one of the most popular. Line 3 indicates we want data for the 5-year period ending in 2020. Lines 4 and 5 indicate the geography we want to use. In our case, we want data at the state level, and we want data for all states, indicated by ‘*’.

The rest of the file specifies what data we want. This is the meat of the file, and we will see increasingly complex versions of it as we continue. For now, we are simply specifying a list of two variables we want from every state. The first, NAME, is the name of the state. The second, B01003_001E, is the total population of the state.

All of the thousands of variables and groups of variables in the ACS5 can be found here. Censusdis also has some better ways of discovering them, but that is a topic for another post. I also discuss it a bit in this tutorial video. I go through an example of this starting around the 51:00 mark.

Now we can download the data we have specified using the command

censusdis download data1.yaml -o data1.csv

This will download the data we specified and store it in a local file called data1.csv. If you open the CSV file with Excel or any other editor that handles CSV files, you will see that the beginning of it looks something like this:

The first column, STATE, is a code called a FIPS code that identifies each state. This column was put there as a result of the geography we requested, where we indicated we wanted all states. The second column, NAME, is the name of the state, and the third, B01003_001E, is the other variable we requested, which is the population of the state.

Although they are not all shown above, the file contains a total of 52 rows of data, one for each state, one for the District of Columbia, and one for Puerto Rico. Although the latter two are not states, the U.S. Census Bureau data sets often treat them more or less as if they were.

Now that we have this data, we can use any of our favorite tools that can work with data in CSV files to sort it, analyze it, plot it, or do whatever else we want to do with the data. We might also put a command like the one we use into a shell script, a Makefile, or any other sort of specification of a data pipeline we might want to construct.

Downloading a Group of Variables

In many U.S. Census data sets, variables are organized into groups. For example, the ACS data in the example above has tens of thousands of variables organized into groups to make them easier to manage. Some groups, like B01003, have just a single estimate variable like B01003_001E, which we already downloaded. Others, like B16010, have dozens.

Often, we want to download all the variables in a group. But it would be tedious to list them all out individually in a !VariableList like the one we used above. Luckily, we can use a !Group instead. Here is a specification file that downloads the single variable NAME and all the variables in the group B16010:

!DataSpec
dataset: ACS5
vintage: 2020
geography:
  state: '*'
specs:
  - !VariableList
    variables:
      - NAME
  - !Group
    group: B16010

If you copy this and save it in a file called group.yaml, then you can run:

censusdis download group.yaml -o group.csv

Now if you open group.csv, you will see that it has 55 columns, one for STATE, one for NAME, and 53 others, one for each variable in the group. The first few rows and columns look like this:

Computing Fractional Variables

Often, instead of the absolute count of different variables in a group, we want to know the fraction of the total represented by each variable. We can do this by adding a special field identifying a variable to use as a denominator. For example, consider this specification:

!DataSpec
dataset: ACS5
vintage: 2020
geography:
  state: '*'
specs:
  - !VariableList
    variables:
      - NAME
  - !Group
    group: B25003
    denominator: B25003_001E

This file downloads data from the group B25003, which only has three variables. They represent the number of housing units (B25003_001), the number that are owner occupied (B25003_002), and the number that are rentals (B25003_003).

On line 12, we added denominator: B25003_001E to the group. This indicates that we want to compute fractional values of each variable using the variable B25003_001 as the denominator. If we store the specification above in fraction.yaml and run

censusdis download fraction.yaml -o fraction.csv

then the first few lines of the resulting CSV look like like this:

The three downloaded variables are there as we expect. But there are three more, each of which has a name like frac_VARIABLE where VARIABLE is the name of one of the variables we downloaded. These fractions were computed by dividing each variable by the denominator B25003_001E. As expected, the frac_B25003_001E values are all 1.0. frac_B25003_002E is the fraction of housing units that are owner occupied and frac_B25003_003E is the fraction that are renter occupied. So, for example, in Utah, 70.5% of housing units are owner occupied, whereas in the District of Columbia, only 42.5% are.

Groups, Trees, and Leaves

In the previous section, we saw that the group B25003 had one variable that was the sum of the other two and could easily be used as a denominator for fractions. Within many groups, variables have this kind of additive property. But in many of these groups, the the variables are arranged in deeper hierarchies than the one we just saw.

For example, in the group B16010 there is a variable B16010_001E that estimates the total population aged 25 and over. This is then subdivided into B16010_002E, which estimates the population 25 and over who have not completed high school, B16010_015E, which estimates the population 25 and over who are high school graduates or equivalent, B16010_028E, for those with some college or associate’s degree, and B16010_041E, for those with a bachelor’s degree or higher.

Each of the variables just mentioned is further subdivided based on whether they are in the labor force or not. For example, B16010_003E estimates the population 25 and over who have not completed high school but are in the labor force and B16010_009E estimates the population 25 and over who have not completed high school and are not in the labor force. The other educational levels are similarly divided based on membership in the labor force.

Finally, each of the estimates like B16010_003E and the other education and labor force membership estimates are further subdivided based on languages they speak. For example, B16010_004E estimates the population 25 and over who have not completed high school, are in the labor force and speak only English. We won’t go through all of them, but the group as a whole branches out all over like a tree. Visually, the group looks like

+ Estimate
    + Total: (B16010_001E)
        + Less than high school graduate: (B16010_002E)
            + In labor force: (B16010_003E)
                + Speak only English (B16010_004E)
                + Speak Spanish (B16010_005E)
                + Speak other Indo-European languages (B16010_006E)
                + Speak Asian and Pacific Island languages (B16010_007E)
                + Speak other languages (B16010_008E)
            + Not in labor force: (B16010_009E)
                + Speak only English (B16010_010E)
                + Speak Spanish (B16010_011E)
                + Speak other Indo-European languages (B16010_012E)
                + Speak Asian and Pacific Island languages (B16010_013E)
                + Speak other languages (B16010_014E)
        + High school graduate (includes equivalency): (B16010_015E)
            + In labor force: (B16010_016E)
                + Speak only English (B16010_017E)
                + Speak Spanish (B16010_018E)
                + Speak other Indo-European languages (B16010_019E)
                + Speak Asian and Pacific Island languages (B16010_020E)
                + Speak other languages (B16010_021E)
            + Not in labor force: (B16010_022E)
                + Speak only English (B16010_023E)
                + Speak Spanish (B16010_024E)
                + Speak other Indo-European languages (B16010_025E)
                + Speak Asian and Pacific Island languages (B16010_026E)
                + Speak other languages (B16010_027E)
        + Some college or associate's degree: (B16010_028E)
            + In labor force: (B16010_029E)
                + Speak only English (B16010_030E)
                + Speak Spanish (B16010_031E)
                + Speak other Indo-European languages (B16010_032E)
                + Speak Asian and Pacific Island languages (B16010_033E)
                + Speak other languages (B16010_034E)
            + Not in labor force: (B16010_035E)
                + Speak only English (B16010_036E)
                + Speak Spanish (B16010_037E)
                + Speak other Indo-European languages (B16010_038E)
                + Speak Asian and Pacific Island languages (B16010_039E)
                + Speak other languages (B16010_040E)
        + Bachelor's degree or higher: (B16010_041E)
            + In labor force: (B16010_042E)
                + Speak only English (B16010_043E)
                + Speak Spanish (B16010_044E)
                + Speak other Indo-European languages (B16010_045E)
                + Speak Asian and Pacific Island languages (B16010_046E)
                + Speak other languages (B16010_047E)
            + Not in labor force: (B16010_048E)
                + Speak only English (B16010_049E)
                + Speak Spanish (B16010_050E)
                + Speak other Indo-European languages (B16010_051E)
                + Speak Asian and Pacific Island languages (B16010_052E)
                + Speak other languages (B16010_053E)

We call the innermost variables, like B16010_004E, B16010_005E, and B16010_053E, leaves because just like on a tree, they are at the ends of the branches.

Downloading the Leaves of a Group

Sometimes, we have research questions where we are only interested in the leaves of a group. In groups where the variables represent counts the sum of the variables at all the leaves adds up to the total. If we get all the variables, including the non-leaves, it is easier to mistakenly add variables together in ways that produce double counts. For example, if we add B16010_003E and B16010_004E, the result does not make sense. It includes everyone in B16010_003E, but double counts those in B16010_004E, who are already counted in B16010_003E.

Downloading only the leaves of a group is easy. We just have to modify our YAML file by adding leaves_only: true to the !Group specification, giving us

!DataSpec
dataset: ACS5
vintage: 2020
geography:
  state: '*'
specs:
  - !VariableList
    variables:
      - NAME
  - !Group
    group: B16010
    leaves_only: true

Save this in leaves.yaml and run:

censusdis download leaves.yaml -o leaves.csv

The result is a CSV file very similar to group.csv, but without any of the columns for variables that are not leaves. For example, you will see that B16010_001E, B16010_002E, and B16010_003E are not present in leaves.csv, nor are any of the other non-leaves.

Fractional Variables over Leaves

As you might have already guessed, we can put two features together to specify that we want just leaves, but we want fractional values for them as well. If we put the following into fracleaves.yaml:

!DataSpec
dataset: ACS5
vintage: 2020
geography:
  state: '*'
specs:
  - !VariableList
    variables:
      - NAME
  - !Group
    group: B16010
    leaves_only: true
    denominator: true

and run

censusdis download fracleaves.yaml -o fracleaves.csv

We get a file with values for all of the leaves, but not the non-leaf variables. We also get a fractional value for each of the variables we downloaded. For example, we get a column called frac_B16010_004E. But notice back in our YAML file that we did not specify B16010_001E or any other variable as the denominator. We just said true, as in yes, we want a denominator. So instead of using a variable, we used the sum of all the variables in the group (or leaves of the group in this case).

More with Fractional Variable Denominators

We can specify a denominator in either a !VariableList or a !Group, using either true to add up the variables or using a specific variable, which does not have to be in the variable list or the group. This gives us a lot of flexibility to compute a variety of useful fractions.

For example, suppose we wanted to know the number of people per housing unit in each state. We know from some of our previous examples that B16010_001E estimates the number of housing units and B01003_001E estimates the total population. So, we could do the following:

!DataSpec
dataset: ACS5
vintage: 2020
geography:
  state: '*'
specs:
  - !VariableList
    variables:
      - NAME
  - !VariableList
    variables:
      - B01003_001E
    denominator: B16010_001E
    frac_prefix: per_home_

The beginning of the resulting CSV looks like:

We can see in the per_home_B01003_001E the ratio of population to homes in each state.

More Geographies

So far, all the data we have downloaded has been at the state level, for all fifty states plus DC and Puerto Rico. But what if we just want a few specific states, not all of them? Instead of using '*' for the state, we can use the abbreviations for the states we want. For example, the specification

!DataSpec
dataset: ACS5
vintage: 2020
geography:
  state: [NY, NJ, CT]
specs:
  !VariableList
  variables:
    - NAME
    - B01003_001E

Will download the name and population of just three states we specified. The result is

We can also choose other geographies besides states. Suppose we want county-level data for all counties in New Jersey. We can do that with

!DataSpec
dataset: ACS5
vintage: 2020
geography:
  state: NJ
  county: '*'
specs:
  !VariableList
  variables:
    - NAME
    - B01003_001E

Notice that we are asking for just the state of NJ, and we are using the '*' for county to say we want all counties. The results are

We get the name and population of all twenty one counties in New Jersey.

Census data sets are available at a variety of geographic levels. One of the commonly used levels are census tracts. Tracts are subdivisions of counties defined by the U.S. Census bureau. They typically have a few thousand residents each. To look at all census tracts in the state of New Jersey, we could specify our geography as follows:

!DataSpec
dataset: ACS5
vintage: 2020
geography:
  state: NJ
  county: '*'
  tract: '*'
specs:
  !VariableList
  variables:
    - NAME
    - B01003_001E

There are over 2,000 tracts in New Jersey. The first few of them that come back look like this:

Notice that as we add levels to our geography, new columns are added on the left side. So now, instead of just having the FIPS code for the state of New Jersey (34) we also have the code for the county and the tract. The first few rows in the result are all from Essex County (013), but if we scroll down we will see other counties.

Symbolic names are not available below the state level. But we can still use the FIPS codes to narrow our search to just one county or a specific set of counties. For example, the FIPS code of Hudson County, NJ is 017. So we can query it as follows:

!DataSpec
dataset: ACS5
vintage: 2020
geography:
  state: NJ
  county: '017'
  tract: '*'
specs:
  !VariableList
  variables:
    - NAME
    - B01003_001E

Note that the leading 0 is an important part of the county FIPS code, so we had to put 017 in single quotes. Also note that if we wanted several counties, but not all, we could provide a list of them like we did for states earlier.

As expected, this looks a lot like our last result, but it only contains tracts in Hudson County, not all the tracts in all counties in the state.

Conclusions

We have shown off some of the basics of what the command-line censusdis tool can do. This isn’t all of it, but it’s a good start and we hope that it inspires you to come up with your own specifications to meet your specific needs. In particular, the different features like geographies and fractional variables can be combined to produce data sets tailored to your specific requirements.

If you would like to learn more about the options that are available for use in the data specification files, please refer to the documentation for the Python classes that are behind each of the tags we discussed. For each of the classes, there is a list of parameters that can be used in the YAML file with the corresponding tag. For example, consider !DataSpec. We already discussed dataset, vintage, and geography. The other two are related to plotting, which we will cover in another post. The !VariableList tag’s parameters are similarly documented.

At the very beginning of this post we noted that there is also a plot command. In our next post, we discuss it in detail and demonstrate some of what it can do.