Diversity and Integration Data

Introduction

In the process of creating our Diversity and Integration visualization, we created a nationwide dataset derived from U.S. Census data. We are now making this dataset available for others to use. This dataset contains counts of the number of people who self-represented as being members of various racial and ethnic groups in the 2020 census.

The data set contains both some raw data that is the same as what is available from the U.S. Census API and some derived data. The U.S. Census data comes from the Decennial Census P.L. 94-171 Redistricting Data from 2020. The names of many of the variables in our data may be familiar to those accustomed to working with U.S. Census data. But don’t worry if you haven’t worked with this data before. We’ll explain it below.

U.S. Census Variables

The census variables we include are from a group of variables called P2, though we do not include all variables from that group. The ones we do include are what we call leaf variables. A leaf variable is a variable that is not a sum of others. For example, the variable P2_005N is a leaf variable that counts the number of people who identify as neither Hispanic or Latino and racially white alone, not in combination with any other race. P2_004N, on the other hand, is not a leaf variable because it represents the number of people who identify as neither Hispanic or Latino and only one race. But because the one race they identify as is not specified, it could be white, Black, Asian, or any of several other designations. Thus, the value of P2_004N for a given area is the sum of P2_005N, P2_006N (people who identify as Black or African American alone), P2_007N (people who identify as American Indian and Alaska Native alone), P2_008N (people who identify as asian alone), and so on. P2_004N is actually the sum of six different leaf variables.

Derived Leaf Variables

One limitation of the P2 group is that it counts all people who identify as Hispanic or Latino in a single leaf variable P2_002N. We don’t include this variable in our data because we want to be able to analyze race within ethnicity. We’d like to be able to know the number of Black Hispanic or Latino people in an area, as well as the number of white Hispanic or Latino people in that area and so on. The P2 group does not give us this information.

There is another group of variables, called P1, that doesn’t consider ethnicity at all, only race. So, for example, it has leaf variables like P1_003N, which counts all people, regardless of ethnicity, who identify as white alone, not in combination with any other race and P1_004N, which counts all people, regardless of ethnicity, who identify as Black or African-American alone, not in combination with any other race.

We used a combination of leaf variables from P1 and P2 to create a new set of derived variables that count the number of people who identify as Hispanic or Latino and or one or more specific races. For example, we creates a leaf variable called hl_005N that counts the number of people who identify as neither Hispanic or Latino and racially white alone, not in combination with any other race. We computed this as

hl_005N = P1_003N – P2_005N.

In English, we took the total number of people who identify as as white alone, not in combination with any other race (P1_003N) and subtracted off the number of people who identify as neither Hispanic or Latino and white alone, not in combination with any other race. What this leaves is the number of people who identify as Hispanic or Latino and white alone, not in combination with any other race.

For every leaf variable P2_0XXN in P2 we created a corresponding derived variable hl_0XXN in exactly this same manner.

Metadata

In addition to the census leaf variables, the corresponding hl_0XXN variables, and diversity and inclusion, the data files contain three metadata columns, STATE, COUNTY, and TRACT. The STATE and COUNTY columns contain the FIPS codes for the state and county each tract is in. The TRACT column is the census tract. Note that tract ids are only unique within a state and county. They can be reused in other states and/or counties.

Diversity and Integration

Using the original P2_0XXN leaf variables from P2 and our new derived hl_0XXN variables, we computed diversity and integration in each of the over eighty thousand census tracts in the United States. A census tract is an area that typically has one to three thousand residents. This was done using the math described in the documentation of the divintseg open source package. You can also read more about the methodology in our post about our Diversity and Integration visualization.

Data Format

We have published our data in two formats. First, it is available as csv files. This is for maximum compatibility with whatever software you might want to use to analyze the data. Whether you prefer Microsoft Excel, Google Sheets, R, Python/Pandas, or just about any other package designed to work with data, you should be able to load a csv file. We have created one csv file per state. This helps keep them of manageable size and lets people who are just interested in their local area deal with a dataset that focuses on what they care about.

The second format is geojson, which is designed for use in geographic information systems. In addition to the data that the csv files contain, these files also contain the geography of each of the blocks. This makes is easier to produce maps. In Python, we normally work with these files using the GeoPandas package, which enables us to load, analyze, and plot them. This is how we produced parts of the maps in our Diversity and Integration visualization. But many other GIS tools can read GeoJson and/or convert it to other formats.

Data Files

The following table contains links to the state-by-state data files you can download.

State/District/Territory	CSV File	GeoJson File
Alabama	01-Alabama-2020.csv	01-Alabama-2020.geojson
Alaska	02-Alaska-2020.csv	02-Alaska-2020.geojson
Arizona	04-Arizona-2020.csv	04-Arizona-2020.geojson
Arkansas	05-Arkansas-2020.csv	05-Arkansas-2020.geojson
California	06-California-2020.csv	06-California-2020.geojson
Colorado	08-Colorado-2020.csv	08-Colorado-2020.geojson
Connecticut	09-Connecticut-2020.csv	09-Connecticut-2020.geojson
District of Columbia	11-District_of_Columbia-2020.csv	11-District_of_Columbia-2020.geojson
Delaware	10-Delaware-2020.csv	10-Delaware-2020.geojson
Florida	12-Florida-2020.csv	12-Florida-2020.geojson
Georgia	13-Georgia-2020.csv	13-Georgia-2020.geojson
Hawaii	15-Hawaii-2020.csv	15-Hawaii-2020.geojson
Idaho	16-Idaho-2020.csv	16-Idaho-2020.geojson
Illinois	17-Illinois-2020.csv	17-Illinois-2020.geojson
Indiana	18-Indiana-2020.csv	18-Indiana-2020.geojson
Iowa	19-Iowa-2020.csv	19-Iowa-2020.geojson
Kansas	20-Kansas-2020.csv	20-Kansas-2020.geojson
Kentucky	21-Kentucky-2020.csv	21-Kentucky-2020.geojson
Louisiana	22-Louisiana-2020.csv	22-Louisiana-2020.geojson
Maine	23-Maine-2020.csv	23-Maine-2020.geojson
Maryland	24-Maryland-2020.csv	24-Maryland-2020.geojson
Massachusetts	25-Massachusetts-2020.csv	25-Massachusetts-2020.geojson
Minnesota	27-Minnesota-2020.csv	27-Minnesota-2020.geojson
Mississippi	28-Mississippi-2020.csv	28-Mississippi-2020.geojson
Michigan	26-Michigan-2020.csv	26-Michigan-2020.geojson
Missouri	29-Missouri-2020.csv	29-Missouri-2020.geojson
Montana	30-Montana-2020.csv	30-Montana-2020.geojson
Nebraska	31-Nebraska-2020.csv	31-Nebraska-2020.geojson
Nevada	32-Nevada-2020.csv	32-Nevada-2020.geojson
New Hampshire	33-New_Hampshire-2020.csv	33-New_Hampshire-2020.geojson
New Jersey	34-New_Jersey-2020.csv	34-New_Jersey-2020.geojson
New Mexico	35-New_Mexico-2020.csv	35-New_Mexico-2020.geojson
New York	36-New_York-2020.csv	36-New_York-2020.geojson
North Carolina	37-North_Carolina-2020.csv	37-North_Carolina-2020.geojson
North Dakota	38-North_Dakota-2020.csv	38-North_Dakota-2020.geojson
Ohio	39-Ohio-2020.csv	39-Ohio-2020.geojson
Oklahoma	40-Oklahoma-2020.csv	40-Oklahoma-2020.geojson
Oregon	41-Oregon-2020.csv	41-Oregon-2020.geojson
Pennsylvania	42-Pennsylvania-2020.csv	42-Pennsylvania-2020.geojson
Rhode Island	44-Rhode_Island-2020.csv	44-Rhode_Island-2020.geojson
South Carolina	45-South_Carolina-2020.csv	45-South_Carolina-2020.geojson
South Dakota	46-South_Dakota-2020.csv	46-South_Dakota-2020.geojson
Tennessee	47-Tennessee-2020.csv	47-Tennessee-2020.geojson
Texas	48-Texas-2020.csv	48-Texas-2020.geojson
Utah	49-Utah-2020.csv	49-Utah-2020.geojson
Vermont	50-Vermont-2020.csv	50-Vermont-2020.geojson
Virginia	51-Virginia-2020.csv	51-Virginia-2020.geojson
Washington	53-Washington-2020.csv	53-Washington-2020.geojson
West Virginia	54-West_Virginia-2020.csv	54-West_Virginia-2020.geojson
Wisconsin	55-Wisconsin-2020.csv	55-Wisconsin-2020.geojson
Wyoming	56-Wyoming-2020.csv	56-Wyoming-2020.geojson
Puerto Rico	72-Puerto_Rico-2020.csv	72-Puerto_Rico-2020.geojson

In addition to the state-by-state files, there are also larger files containing data from all the states. They are as follows:

CSV File	GeoJson File
00-All-2020.csv	00-All-2020.geojson

Conclusions

We hope this data is useful to others. We look forward to hearing how you use it, and we are happy to hear comments or answer questions about it. You can reach us by email at info at datapinions dot com.