Introduction
In the process of creating our Diversity and Integration visualization, we created a nationwide dataset derived from U.S. Census data. We are now making this dataset available for others to use. This dataset contains counts of the number of people who self-represented as being members of various racial and ethnic groups in the 2020 census.
The data set contains both some raw data that is the same as what is available from the U.S. Census API and some derived data. The U.S. Census data comes from the Decennial Census P.L. 94-171 Redistricting Data from 2020. The names of many of the variables in our data may be familiar to those accustomed to working with U.S. Census data. But don’t worry if you haven’t worked with this data before. We’ll explain it below.
U.S. Census Variables
The census variables we include are from a group of variables called P2, though we do not include all variables from that group. The ones we do include are what we call leaf variables. A leaf variable is a variable that is not a sum of others. For example, the variable P2_005N is a leaf variable that counts the number of people who identify as neither Hispanic or Latino and racially white alone, not in combination with any other race. P2_004N, on the other hand, is not a leaf variable because it represents the number of people who identify as neither Hispanic or Latino and only one race. But because the one race they identify as is not specified, it could be white, Black, Asian, or any of several other designations. Thus, the value of P2_004N for a given area is the sum of P2_005N, P2_006N (people who identify as Black or African American alone), P2_007N (people who identify as American Indian and Alaska Native alone), P2_008N (people who identify as asian alone), and so on. P2_004N is actually the sum of six different leaf variables.
Derived Leaf Variables
One limitation of the P2 group is that it counts all people who identify as Hispanic or Latino in a single leaf variable P2_002N. We don’t include this variable in our data because we want to be able to analyze race within ethnicity. We’d like to be able to know the number of Black Hispanic or Latino people in an area, as well as the number of white Hispanic or Latino people in that area and so on. The P2 group does not give us this information.
There is another group of variables, called P1, that doesn’t consider ethnicity at all, only race. So, for example, it has leaf variables like P1_003N, which counts all people, regardless of ethnicity, who identify as white alone, not in combination with any other race and P1_004N, which counts all people, regardless of ethnicity, who identify as Black or African-American alone, not in combination with any other race.
We used a combination of leaf variables from P1 and P2 to create a new set of derived variables that count the number of people who identify as Hispanic or Latino and or one or more specific races. For example, we creates a leaf variable called hl_005N that counts the number of people who identify as neither Hispanic or Latino and racially white alone, not in combination with any other race. We computed this as
hl_005N = P1_003N – P2_005N.
In English, we took the total number of people who identify as as white alone, not in combination with any other race (P1_003N) and subtracted off the number of people who identify as neither Hispanic or Latino and white alone, not in combination with any other race. What this leaves is the number of people who identify as Hispanic or Latino and white alone, not in combination with any other race.
For every leaf variable P2_0XXN in P2 we created a corresponding derived variable hl_0XXN in exactly this same manner.
Metadata
In addition to the census leaf variables, the corresponding hl_0XXN variables, and diversity and inclusion, the data files contain three metadata columns, STATE, COUNTY, and TRACT. The STATE and COUNTY columns contain the FIPS codes for the state and county each tract is in. The TRACT column is the census tract. Note that tract ids are only unique within a state and county. They can be reused in other states and/or counties.
Diversity and Integration
Using the original P2_0XXN leaf variables from P2 and our new derived hl_0XXN variables, we computed diversity and integration in each of the over eighty thousand census tracts in the United States. A census tract is an area that typically has one to three thousand residents. This was done using the math described in the documentation of the divintseg
open source package. You can also read more about the methodology in our post about our Diversity and Integration visualization.
Data Format
We have published our data in two formats. First, it is available as csv files. This is for maximum compatibility with whatever software you might want to use to analyze the data. Whether you prefer Microsoft Excel, Google Sheets, R, Python/Pandas, or just about any other package designed to work with data, you should be able to load a csv file. We have created one csv file per state. This helps keep them of manageable size and lets people who are just interested in their local area deal with a dataset that focuses on what they care about.
The second format is geojson, which is designed for use in geographic information systems. In addition to the data that the csv files contain, these files also contain the geography of each of the blocks. This makes is easier to produce maps. In Python, we normally work with these files using the GeoPandas package, which enables us to load, analyze, and plot them. This is how we produced parts of the maps in our Diversity and Integration visualization. But many other GIS tools can read GeoJson and/or convert it to other formats.
Data Files
The following table contains links to the state-by-state data files you can download.
State/District/Territory | CSV File | GeoJson File |
Alabama | ||
Alaska | ||
Arizona | ||
Arkansas | ||
California | ||
Colorado | ||
Connecticut | ||
District of Columbia | ||
Delaware | ||
Florida | ||
Georgia | ||
Hawaii | ||
Idaho | ||
Illinois | ||
Indiana | ||
Iowa | ||
Kansas | ||
Kentucky | ||
Louisiana | ||
Maine | ||
Maryland | ||
Massachusetts | ||
Minnesota | ||
Mississippi | ||
Michigan | ||
Missouri | ||
Montana | ||
Nebraska | ||
Nevada | ||
New Hampshire | ||
New Jersey | ||
New Mexico | ||
New York | ||
North Carolina | ||
North Dakota | ||
Ohio | ||
Oklahoma | ||
Oregon | ||
Pennsylvania | ||
Rhode Island | ||
South Carolina | ||
South Dakota | ||
Tennessee | ||
Texas | ||
Utah | ||
Vermont | ||
Virginia | ||
Washington | ||
West Virginia | ||
Wisconsin | ||
Wyoming | ||
Puerto Rico |
In addition to the state-by-state files, there are also larger files containing data from all the states. They are as follows:
CSV File | GeoJson File |
Conclusions
We hope this data is useful to others. We look forward to hearing how you use it, and we are happy to hear comments or answer questions about it. You can reach us by email at info at datapinions dot com.