Impact charts are a tool for visually interpreting the relationships between variables. In this post, we will look at some impact charts generated from a combination of eviction filing data from the Eviction Lab at Princeton University and demographic data on renters from the U.S. Census Bureau.
Using this data, the question we are going to use impact charts to answer is how the presence of renters from various racial and ethnic populations in an area impacts the rate at which landlords file to evict tenants in those same areas. Race and ethnicity are not the only factors that contribute to eviction rates. We also expect the income of renters to be an important contributor. So our impact charts will consider that too.
Prior to impact charts, the most common way of answering this type of question was through regression analysis. But because impact charts are based on modern machine learning techniques, they can detect effects regression analysis cannot. The underlying models get higher r-squared scores (a measure of their accuracy) than regression analysis and don’t require advanced knowledge of the shapes of the effects you are expecting to find.
If you currently use regression analysis, we hope that by the time you finish reading this post you will agree that impact charts offer a compelling approach to teasing out the impact of variables on one another, and that you will want to try this approach with your own data.
A First Impact Chart: Eviction and Blackness
Before we go into the exact details of what data we used and how the impact charts were constructed, let’s look at our first impact chart. This will help give us a sense for what impact charts are designed to do. We will start with an impact chart that looks at the impact of the percentage of renters who identify at Black on the rate of eviction filings in DeKalb County, Georgia in 2009 and 2010.
The horizontal axis is the percent of renters who identify as Black alone. We measure this at the census tract level. It ranges from 0% to 100%. Some census tracts in the county are almost entirely non-Black and others are almost entirely Black.
The vertical axis is the impact of the percent of renters who identify as Black on the eviction filing rate, as measured by number of eviction cases filed per 100 renters per year. What exactly do we mean by impact? Before we can answer that, we have to look a bit and what lies behind and impact chart.
Before we built the impact chart, we constructed machine learning models and trained them to predict the eviction filing rate based on the percent of renters in a census tract who belonged to each of several different racial and ethnic groups as well as the median income of renters in the tract. Racial and ethnic variables included the percent of renters who identify as white, Black, Asian, Hispanic or Latino, and so on, using groups defined by the U.S. Census.
We took the machine learning models that we constructed, and used a technique called SHAP to interpret them. What SHAP does is tell us, for a given census tract with its particular demographic makeup and median renter income, how much did each of those factors impact the final prediction the models made.
In an ideal world where there was no structural racism, we would expect to find that none of the racial inputs would have any impact. Low-income tracts might be impacted with higher eviction rates than high-income tracts, but low-income mostly Black tracts would not look different than low-income mostly white tracts, low-income mostly Hispanic or Latino tracts, or low-income tracts that were racially and ethnically mixed. But in a non-ideal world, if the data fits the model well, the impacts of race and ethnicity would be non-zero.
Now let’s go back to our first impact chart and see what it tells us. Each green dot represents one of the census tracts in the county. Let’s look at the leftmost fifth of the chart, representing tracts that that are less than 20% Black. They are to the left of the light gray vertical grid line labeled 20% at the bottom. Most have impacts between -5 and -7. What this means is that because they do not have a lot of Black residents, the model expects them to have lower eviction filing rates than they otherwise would.
Now let’s look at the right side of the chart, to the right of the vertical grid line labeled 80% at the bottom. These are tracts where more than 80% of renters identify as Black. If we look at the green dots in this region, they all have an impact greater than 0 on the vertical axis. Many have an impact greater than 5. This means that the model predicts that these neighborhoods with predominantly Black renters will have eviction filing rates higher than otherwise comparable neighborhoods.
What exactly do we mean by otherwise comparable neighborhoods? When studies with the kind of data we are using are published, their conclusions are often summarized by saying something like Black neighborhoods have higher rates of eviction filings, corrected for income. Corrected for income is exactly what we mean by comparable neighborhoods. It means that the model takes income into account, and it still predicts higher eviction filing rates in Black neighborhoods independent of the influence of the fact that Black neighborhoods also tend to be lower-income neighborhoods.
A Second Impact Chart: Eviction and Whiteness
The point of an impact chart is to look at the impact of one single variable, regardless of the statistical relationship it may have to other variables in the model. We can look at other variables as well; each gets it’s own impact chart. For example, here is the impact chart for the percentage of renters who are white.
In this impact chart, the green dots for tracts where renters are less than 10% white are all in the positive impact range. This means the model predicts that neighborhoods that are mostly non-white have higher rates of eviction. Again, this is corrected for income. Note that what this impact chart is telling us is that regardless of what groups of non-white renters live in a tract, the simple fact that few of the renters are white is alone sufficient to drive eviction filing rates up.
There is a sharp downward trend in impact in the low-white range of the chart, up to just over 10%. After that, the impact levels off and then continues to decrease, but more slowly, through about 55%. From there it remaining essentially flat between -3 and -8 all the way up to completely white neighborhoods.
When we looked at the impact chart for Black renters, the trend looked like we could reasonably explain it with a straight line. But the white impact chart has a more complicated behavior. This ability to identify effects with shapes that are not just straight lines is one of the key things that distinguishes impact charts from regression analysis. Because we used a powerful machine learning technique called boosted trees instead of linear regression, we are able to identify these nonlinear impacts, even when we don’t know anything about the shapes we expect to see before we start the analysis.
What About the Impact of Income?
We can build an impact chart for any variable we put into our model. So we can also make one for median renter income. This is what it looks like.
We hypothesized that low-income tracts would have higher eviction rates. But the impact chart tells us something more nuanced. There is a sharp spike between $20,000 and $40,000, but it drops away quickly on either side. For tracts on the right side of the spike, this fits the hypothesis that higher income areas have lower rates of eviction. But on the left side, the opposite happens. It is possible that this is the case because low-income renters qualify for programs or housing with less stringent eviction practices than the open market. It is certainly something that deserves further inquiry.
Taken together, these three impact charts tell two compelling stories. The first is one of systematic racism in eviction filings in DeKalb County, Georgia. The second is one of the existence of some kind of eviction safety net for very low income households.
What are the Gray Dots?
The green dots we have been looking at are actually the average impact from 50 different machine learning models. Each of these models uses the same code, but is trained on a different random sample of 80% of the data. The grey dots are the impact of the variable in each of the 50 different models. So for each green dot, there are 50 gray dots behind it, in a vertical line.
The reason we do this is that machine learning models can be fickle, giving very different results when trained on data that to humans looks relatively the same. By training many models, we can see if they agree on the impact of each input or if they are all over the place, indicating that our final estimate of the impact (the green dot) isn’t that accurate.
The length of the vertical distribution of gray dots gives us a visual idea of how confident we can be on what the corresponding green dot tells us. They are like error bars. If all 50 models closely agree, the gray dots don’t extend very far above or below the green dot. But in some cases they do. You can see a few examples in each of the impact charts above. Often, the green dots for these tracts look somewhat out of place relative to the trend of the green dots immediately around them. This means that the models are having trouble agreeing on what is going on in the tract, perhaps because it is heavily influenced by some other variable that we did not include in our model.
Comparing Impact Charts to Regression Analysis
Regression analysis is the typical approach to answering questions like the ones we have considered here. But we believe that machine learning-based impact charts can produce more complete insights.
Regression analysis makes a number of technical assumptions about the data it works with. First and foremost, it assumes that we have an a priori understanding of the shape of the effect we are trying to model. In most cases, the assumption is that the effects are linear, though sometimes other shaped curves, like polynomials or exponential or logarithmic functions are considered.
Sticking to the linear case for now, this means, for example, that if going from 5% white renters to 15% white renters (a change of +10%) had an impact of -4 to the eviction filing rate, then a change from 50% to 60% (also a change of +10%) would also have an impact of -4. The machine learning method we are using makes no such linearity assumptions and can therefore do a better job of capturing the shape of impacts like those we saw above. In the case of the impact of median household income, a linear model would have no chance of capturing the effect we saw in the impact chart.
To illustrate the contrast, here is our earlier impact chart overlayed with the estimated impact from a linear regression analysis. Note that this is not a linear fit of the green dots. It’s the impact estimated by a linear model of the entire system, which has no knowledge of the green dots at all.
While the linear model captures the general downward impact on the eviction filing rate, it misses the subtleties of the shape of the curve defined by the green dots from our machine learning model. And clearly, for the impact of median renter household income, no line would capture the details that the impact chart shows.
Some additional technical assumptions that regression analysis make include homoscedasticity and normal distribution of errors. We won’t go into detail here, but we can avoid having to make these assumptions, which may not hold in our data, with the machine learning-based impact chart approach.
For these and other reasons, impact charts tend to be more accurate. One way we can examine this is by looking at r-squared, a standard measure of accuracy, for the models behind the impact charts. For the charts above, the model gets a score of 0.72. By contrast, the ordinary least squares model that produced the orange line gets a score of 0.48. Higher scores indicate more accuracy.
More Impact Charts
In addition to the three charts above, we have generated over 1,600 similar impact charts for counties all over the country. For more details, see The Impact of Demographics and Income on Eviction Rates. Or, if you would just like to browse through the charts, you can find them all at evl.datapinions.com.
Code, Data, and Methodology Details
Some readers will no doubt be interested in many more technical details than are presented here. Some might also like to reproduce the results or produce impact charts for their own data sets. In order to facilitate this, we are releasing all of our code as open-source projects. It relies on other open-source projects including SHAP, XGBoost and pandas, which are also publicly available. The foundational data we used is available from third parties. Our open source projects include code to download and preprocess it as necessary.
Our code can be found in three open source repositories. The first repository, impactchart, contains the fundamental code for building impact charts. No matter what data you are using or what impacts you are trying to characterize, this repository contains the fundamental code to do it.
The second repository, evldata, contains code to assemble our data set. Evldata downloads tract-level eviction filing data from the Eviction Lab at Princeton University. Specifically, it uses what the Eviction Lab calls proprietary data within their eviction-lab-data-downloads repository.1
Evldata also uses censusdis to download the U.S. Census data we need. The groups of variables we use are B25119 for median household income for renters and B25003 and B25003A through B25003I for the population of renters overall and renters of different racial and ethnic groups.
Evldata joins the Eviction Lab and U.S. Census data at the census tract level and constructs the features we need to train our machine learning models. These mainly consist of the fractions of the population of renters belonging to different groups. This data is also suitable for use with regression analysis or other methods readers might want to try for the sake of comparison.
The resulting data set covers the years 2009 and 2010, which is the range of years that are covered by both data sets. The census data is complete, but the eviction data, which comes from proprietary data sets produced by private third parties, does not cover all census tracts or all dates. The Eviction Lab also offers imputed data for areas where the proprietary data is not available, but we deliberately did not use this in our work. All of this means that while we have reliable data for some census tracts, there are others where we have little or none and can’t produce meaningful impact charts.
The third repository. evlcharts, constructs the models and the impact charts we saw above and many more. It builds charts for many of the counties for which the Eviction Lab provides data. It also builds them not only for eviction filing rates, but also for eviction threatening rates and eviction judgement rates as defined by the Eviction Lab.
The models behind the impact charts presented here are built using XGBoost, a popular machine learning package that uses a technique called gradient boosted trees.2 The underlying impact analysis is done with a package called SHAP, which pioneered the idea of additive feature importance.3
Conclusions
This has been an introduction to impact charts, how to interpret them, and some of the tools available to generate them. We hope that this approach is useful to researchers in a variety of fields where regression analysis is now the standard. Comments and feedback are always welcome.
Updates
Updated Nov. 1, 2023 with impact charts generated from models build on constant 2018 dollars instead of non-inflation-adjusted dollars.
Updated Nov. 2, 2023 with links to additional impact charts for other counties and a post that describes them.
References
- Gromis, Ashley, Ian Fellows, James R. Hendrickson, Lavar Edmonds, Lillian Leung, Adam Porton, and Matthew Desmond. Estimating Eviction Prevalence across the United States. Princeton University Eviction Lab. https://data-downloads.evictionlab.org/#estimating-eviction-prevalance-across-us/. Deposited May 13, 2022.
- Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. https://arxiv.org/abs/1603.02754. 2016.
- Scott Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. https://arxiv.org/abs/1705.07874. 2017.