Using Interpretable Machine Learning to Analyze Racial and Ethnic Disparities in Home Values

Introduction

There have been a number of well-publicized cases in recent years in which homes belonging to Black families were appraised at substantially lower value than those owned by white families. This has occurred across the country, in Maryland, Ohio, and California, among other states.

The existence, causes, and history of these disparities have been documented at length. In Race for Profit: How Banks and the Real Estate Industry Undermined Black Homeownership, Keeanga-Yamahtta Taylor uses the term “predatory inclusion” to describe the process by which public-private programs nominally aimed at including Black Americans in the housing market only served to further exploit them.

Predatory inclusion evolved in response to the Fair Housing Act, which was part of the Civil Rights Act of 1968. Prior to that, redlining and a variety of other public and private housing programs and policies enforced segregation in America, whose legacy affects home values to this day. See, for example, Richard Rothstein‘s The Color of Law.

Taylor describes housing in America as

a privately owned asset in a society where the value of the asset will be weighed by the race or ethnicity of whoever possesses it.

Race for Profit, p. 258.

There has been work on quantifying the weight Taylor identifies. For example this study done by Andre M. Perry, Jonathan Rothwell, and David Harshbarger at the Brookings Institution in 2018 examining questions around race and home values. (See also this 2021 update.)

Our goal here is to build on this body of work, both the qualitative and the quantitative, with a new approach derived from the field of Machine Learning (ML), and in particular the evolving field of Interpretable Machine Learning.

Although there are significant algorithmic and mathematical nuances to this work, we have left those, to as great an extent as possible, to the code we wrote to support this work. Interested readers are encouraged to examine the code and run it for themselves. However, herein we have made an effort to present the overall concepts behind our approach with as little algorithmic argot as possible beyond standard concepts of regression analysis.

What we show is that real qualitative and quantitative results can be obtained with this new approach. We also have a web page with extensive data and hundreds of what we call impact charts that show qualitatively and quantitatively how race and ethnicity impact housing values in communities across the United States.

Methodology

Our Approach to the Problem

Regression analysis is the de facto standard technique for analyzing and interpreting date in social science, political science, business, investment analysis, and many other fields, and has been for decades. It is widely taught to both undergraduates and graduate students and is the mathematical backbone of thousands of papers. It is also well supported in general purpose software packages like Excel, specialized packages like SAS and SPSS, and popular libraries in computer languages such as R and Python.

But we are going to argue that a newer approach, based on the evolving field of ML, and in particular the subfield of Interpretable Machine Learning, can help us more fully understand the dynamics of the racial and ethnic influences on housing prices. We also believe that this approach is broadly applicable and can be used in many other domains where regression analysis is the de facto standard today.

So while our specific conclusions in the domain of housing and race are, we believe, quite interesting and important, we think the meta-conclusion is that the techniques we use here can be applied to a wide variety of additional problems. We look forward to using this approach again, and hope that others will be inspired to try it as well.

Initial Data Exploration

We began by exploring the raw data from the U.S. Census American Community Survey 5-Year (ACS5) 1997-2021 data set. This was the latest version of the ACS5 available at the time of writing. We looked at data across the top 50 largest metropolitan areas in the U.S. These areas, as defined by the Office of Management and Budget, are known as Core-Bases Statistical Areas (CBSAs).

Within each CBSA, we looked at data at the block group level. Block groups are defined by the U.S. Census Bureau. The average population of a block group in the CBSAs we looked at was 1,528 and the median population was 1,395. The mean and median number of owner occupied households in a block group were 373 and 333 respectively. Median income and median home values, as we are about to see, were all over the map and their distribution was highly variable from one CBSA to another.

We began with a quick visual inspection of scatter plots of median home value vs. median income in each block group within a CBSA. Note that these numbers are estimates based on values self-reported by participants in the survey. Here’s what San Francisco-Oakland-Berkeley, a large CBSA with many affluent residents, looks like:

There are 2,452 block groups in the metro area. We removed 292 of them from the data set before proceeding further because they had either median income of $250,001 (the upper limit in the data set) or median home value of $2,000,0001 (also an upper limit in the data set). The actual values for these outliers are in most cases higher, but this is how they are reported, limited by how the survey questions were constructed.

As we can see, and as we expected, there is correlation between median home values and median household income. But it clearly isn’t the only factor. For a given median income level, median home prices can vary quite widely.

Now let’s look at Buffalo-Cheektowaga, a smaller, less affluent CBSA.

The distribution skews much lower on both axes, though the relationship between the two variables looks more fully linear. This CBSA also has no outliers of the type we saw in San Francisco-Oakland-Berkeley.

Formulating the Problem

We’ve only scratched the surface so far, but we’ve seen enough to have an idea of the problems we want to address. At the highest level, we’d like to be able to characterize the nature of the relationship between median household income and median home value in a way that we can use the former to predict the latter. Because it appears from the San Francisco-Oakland-Berkeley CBSA plot above, among others, that median household income along is not going to be sufficient to accurately predict median household value, we are going to add other variables. We hypothesize that variables indicating the percent of the population that belongs to each of several racial and ethnic groups will also be predictive of home values. Our broader goal will then become not just to use them to predict median home value, but to explain median home value.

Defining Accuracy

Before we begin building models, let’s think a little bit about what it means for a model to be accurate. In regression analysis, the most commonly used metric is Mean Squared Error (MSE). Without going into all the mathematical details, the use of MSE as a metric of quality is essential to most regression analysis. That fact, along with some other assumptions, enable linear regression to be implemented simply and efficiently.

Among the assumptions that regression analysis makes is that of homoscedasticity, which means that the variance in the observations of the dependent variable, median housing price in our example, is the same regardless of the value of the variable’s value or (since a linear relationship is assumed) the value of any independent variable, like median income in our case.

But it’s hard to imagine this condition holds in our case. Suppose, for example, that for homes that are actually worth $200,000, our input data captures them to within an accuracy of ±$20,000, or ±10%. What about homes worth $1,000,000? We don’t expect to be able to also measure their value to within ±$20,000, which would be ±2%. Instead, we might measure them to within the same ±10% as the lower priced homes, which would be ±$100,000. This is very much a classic case of heteroscedasticity, the opposite of homoscedasticity.

Similarly, if we built a model to predict median home prices, we would not expect its mean error to be the same in dollar terms on homes worth around $200,000 as homes worth around $1,000,000. At best, we might hope for the same mean relative error. For example, the model might have a mean absolute error of 20% of the actual home value regardless of whether it was $200,000 or $1,000,000. In the former the error would be ±$40,000 and in the latter it would be ±$200,000. This measure of error is called Mean Absolute Percent Error (MAPE).

Note that we use the mean of the absolute value of the error percentage rather than the mean of the percentage error. Otherwise a large positive error in half the cases (say an 80% overestimate) could be counteracted by a large negative error (an 80% underestimate) in the other half. The mean would be 0%, making this poor model look like an excellent one.

Optimizing model construction to minimize MAPE tends, especially in cases of heteroscedasticity, to produce very different models than optimizing for minimum MSE. If we optimize for MSE in cases like ours, the influence of the high-priced end of the market can overwhelm the influence of the lower priced end because absolute errors are larger there. Squaring compounds this effect. So in effect, we try really hard to be good at predicting prices at the high end of the market even if that means, in MAPE terms, we end of with a pretty bad model at the low end of the market. We’d like to have a good model at both ends of the market, so we’d prefer to optimize our model for MAPE rather than MSE.

Machine Learning

In contrast to regression analysis, most modern ML modeling techniques allow us to choose from a variety of metrics of quality, including MAPE. Software packages that implement these techniques, such as scikit-learn and XGBoost, which we used in this work, make it easy to choose between MAPE, or MSE, or any of a variety of other metrics.

Aside from this, modern ML techniques aren’t predicated on homoscedasticity and they excel at modeling non-linear relationships. A linear model might be able to do a decent job of explaining the relationship of median home value to median income, especially in cases like the Buffalo-Cheektowaga example above. But when we add race and ethnicity variables, that might no longer be the case. We can look at some scatter plots to see.

Let’s begin with a scatter plot of median home value vs. the percentage of the population in a block group that identifies as Black. This data comes from the Miami-Fort Lauderdale-Popano Beach CBSA.

We have already removed outliers in home value and median income as discussed above.

The first thing to notice is that the distribution of median home values looks very different for block groups with small Black populations that for those with large Black populations. Above 40% Black population there is not much change in the distribution beyond a slight downward trend. But below 20%, and especially below 5%, there are some dramatic changes.

If we highlight block groups that are at least 40% Black (in orange) and then further emphasize the subset of those where median home values are at least $500,000 (in green), we see see something quite dramatic:

Only 3 of the 525 block groups (0.57%) that are 40% or more Black have a mean home value of at least $500,000.

If, for the sake of comparison, we look at block groups that are less than 5% Black, we see something dramatically different:

In this case, there are 354 block groups that are less than 5% Black and have median home value of at least $500,000 out of a total of 1,580 (22.41%).

We can dig deeper into the differences in these distributions and do a variety of statistical tests to compare them, possibly correcting for the influence of other variables. This is the approach Perry, Rothwell, and Harshbarger took, though they used 50% and 1% as thresholds instead of 40% and 5% and used a somewhat different set of features and so modeled a different effect.

What are the right thresholds to use in this kind of study? There’s no obvious answer other than that they should be chosen to produce interesting results. But that’s not very satisfying. And the interesting thresholds might be at different places in different CBSAs. We’d rather not have to choose thresholds like this a priori.

It’s still possible, if the distributions of other features and their correlations happen to be just right, that a linear model could fit against median income, percentage Black population, and percentages from other racial and ethnic groups. We won’t go into details, but—and this should not come as a big surprise—this is not the case here.

Luckily, there is another approach we can take. One of the advantages of many ML approaches is that we can build a single model to generate a variety of insights about nonlinear effects without having to choose thresholds. But in order to use them effectively in our application, we need them not only to be good predictive models, but also to be interpretable, meaning we are able to explain why they make the predictions they do when given various inputs.

Interpretable Machine Learning

One of the common complaints about ML solutions is that while they can be good at prediction problems (with the caveat that errors are often concentrated in ways that disproportionately impact historically marginalized groups) they tend not to be particularly explainable. Whereas a regression model will produce coefficients that are easily interpreted to indicate things like the fact that for every 1% variable X increases, the predicted value of variable y goes up by $1,200.

The good news on this front is that in recent years a number of techniques have been developed to interpret the predictions that ML models make. In particular, an approach using an so-called Shapley values, developed by Scott Ljundberg and colleagues and implemented in the SHAP open-source software package, explains each prediction as the sum of the relative contributions made by each of the input features. For example, SHAP can tell us something like

  • The ML model predicted that a particular block group B had a median home value of $395,000.
  • This was because:
    • The average median home value of block groups across the CBSA was $350,000.
    • B was impacted in the amount of +$37,000 because it was 58% white.
    • B was impacted by +$10,000 because it was 12% Asian.
    • B was impacted by -$11,000 because it was 22% Black.
    • B was impacted by +$9,000 because it was 6% Hispanic or Latino
  • The sum of the mean across the CBSA and the impacts of the different demographic features add up to the final prediction for B. That is, $350,000 + $37,000 + $10,000 – $11,000 + $9,000 = $395,000.

This illustrates one of the fundamental strengths of the Shapley value approach. It explains the impact that each of the features had on any particular prediction in a way such that they add up to the value that was predicted. So now we have a powerful tool that lets us say exactly how the value of each feature contributed to any given prediction. It also lets us make broader statements like, “if a block group is more than 40% Black, home values are reduced by 8%.” As we will see below, we can make such a claim just by looking at the graph of the impact, as computed above, of the Black population feature on model predictions.

Shapley values across an ensemble of ML models can be used to develop strong insights into the relationship between variables we can observe. In our case, these are home values, income, and demographics, but we are confident the same approach can be used in many other settings.

The Overall Process

Based on the observations above, we developed code for studying the impact of racial and ethnic demographics that works as follows:

  1. Download raw census data for all of the block groups within each of the 50 most populous CBSAs. The data was from the 2021 ACS5 data set, which is based on surveys conducted between 2017 and 2021.
  2. Construct features that are the percent of the population of each block group that identify as Hispanic or Latino, and the percent of the population of each block group that identify as not Hispanic or Latino but as members of one of the racial groups the U.S. Census defines, such as white, Black or African American, Asian, and so on.
  3. For each CBSA independently build an ML model using XGBoost that uses the ethnic and racial demographic features, along with a median income feature, to predict median home value. We use k-fold cross validation with a MAPE metric to judge the quality of the models and we use random grid search to optimize their hyperparameters. These are both widely used techniques in the ML community.
  4. Given the optimal hyperparameters from the previous step—which if you are not familiar with ML terminology simply means the best configuration of XGBoost we could find for a given CBSA—we construct a collection of 50 different models trained on different 80% samples of the data. The reason we do this is that training on different subsets can produce different models that perform differently on individual data points, even if their overall prediction accuracy is roughly the same. We want to find out if there is consensus among the explanations for the 50 models’ predictions.
  5. We use SHAP to generate an explanation for how each of the 50 models would predict the median home value in each of the block groups in the CBSA.
  6. We ensemble the 50 models—which means we make a prediction that is the mean of their individual predictions—and generate a SHAP explanation of the ensemble’s predictions from the SHAP explanations of each of the underlying model’s predictions.
  7. We plot and analyze the predictions of the previous two steps to understand why the ensemble makes the predictions it does. In particular, we look at one feature at a time and plot the relative Shapley value of that feature (the Shapley value of the feature divided by the final predicted value) against the feature value. We call the resulting graphs impact charts because the show the impact of various values of the feature on the predictions it makes.

All of this is implemented in Python code available in our GitHub repository. The repository also contains instructions for those who would like to try the code out for themselves.

Expected Advantages

Before we move on to the results we obtained with our methodology, it is worth taking a moment to summarize the advantages we expected to find with this approach. They are:

  1. We don’t have to make any of the assumptions that linear regression makes, such as that the relationships are linear, that the data is homoscedastic, that errors are normally distributed, and so on.
  2. We can directly optimize the model to minimize MAPE instead of MSE.
  3. We can produce impact graphs that demonstrate the impact of the values of any feature on predictions.
  4. We don’t have to choose a priori ranges of values of our demographic features before doing our analysis. We can just let the impact charts guide us.

In summary, this is a new way to think about influence. It isn’t just classical correlation. Instead, under the assumption that our ML model is accurate to begin with, it provides us with a new method for observing and reasoning about how features like demographic makeup impact an independent variable like median home value.

Results

For each of the 50 most populated CBSAs, we built an ensemble as described above and produced impact charts for median income and each of 9 demographic features. We did this for both absolute dollar impact and relative impact, as we will see below. This produced a total of 1,000 charts, far more than we can include here. We will therefore only discuss a handful that are representative of some of the kinds of effects we observed. We do this with the absolute caveat that every CBSA is at least somewhat unique and there are many interesting observations that can be made using our approach that we do not have the space to discuss here. Readers interested in browsing through all of the impact charts can do so at this web page we set up for that purpose.

Understanding Impact Charts

In order to guide you in how to understand and interpret impact charts, we are going to start with a handful of examples from the Hartford-East Hartford-Middletown CBSA in Connecticut, one of the smaller CBSAs we studied. It contains 809 block groups, whereas the largest CBSAs can have around ten times more.

Impact of Median Income on Median Home Value

We will begin with a chart showing the impact of the median household income of a block group on the median home values in that block group. This chart is closely related to what the SHAP package calls a force plot, but it uses all components of an ensemble to produce a notion of error bounds on the impacts it plot.

First, on the horizontal axis, we measure the median household income. This is the feature whose impact we are trying to determine. On the vertical axis, we measure the impact the feature has. In this case the impact we are measuring is on median home value.

Notice there is a $0 line about halfway up. $0 means no impact. This does not mean that the median price of a home is $0. It means that median household income did not contribute to the median price. In this graph, we can see this happens when median household income is just over $100,000. At this point, the median home value will be the average of that quantity over all 809 block groups, plus the impact of any of the other features.

In this impact chart, he gray dots represent the individual impact of median household income on the prediction of each of the 50 models for each of the 809 block groups. So for each block group, there are 50 gray points, all at the same x value, which is the median income of the block group. These 50 points have different y values based on the impact that model assigned to the median income feature.

For each block group, there is also a green point, which is the average impact of all 50 gray points for the block group. Together, the gray points and the green point for any one block group show us the overall impact of median income on the prediction and some visual notion of the variance of that impact assessment across the members of the ensemble.

In this example, we can see that for most block groups with median income below $50,000, the green dot impact is right around -$50,000. That means if one of these block groups is racially typical of the area, such that none of the racial features contribute impact, then their median home value will be $50,000 less than that of the average block block group in the CBSA.

The impact chart moves nearly monotonically upward as we move from right to left, which is what we would expect of the relationship between median income and median home value. Indeed, above about $75,000 in median income, the relationship is close to linear.

At the upper end of the median income scale, there are a small number of block groups where the impact is greater than $150,000. One thing to notice, however, is that the grey dots are more widely dispersed in the vertical direction around the green dots than they are at the lower end of the median income scale. That is because we are optimizing for MAPE across all the block groups whether high or low income, and so errors that are the same in relative terms are larger in absolute terms when the median income and median value are higher.

If whiteness drives median price up by $20,000 in a block group where median price is $300,000, the effect appears on a chart to be the same as in a block group where whiteness drives median price up by $120,000 and median price is $1.8 million. This is part of the reason we chose to optimize for MAPE, rather than Mean Absolute Error (MAE). We want to isolate the impact of the features we are studying as much as possible.

We can correct for this effect by projecting into relative space before plotting the impact chart. Now, instead of dollar impact, what we get is impact as a percentage of the final predicted median home value for the block group. The result looks like this:

Now, we can see that the gray points around any given green ensemble impact have a more consistent variance. At the low end, just below $100,000 median income, we can see that the impact bifurcates into a branch that only goes down to about -20% impact and another that goes down to -30%. Without going into all of the technical details, we sometimes see this as a side effect of the fact that our ML model contains decision trees. In cases like this, it is important to look at the full range of grey dots to see the where impacts fall.

Impact of Demographic Variables on Median Home Value

Now let’s look at one of the demographic variables. In particular we will look at the impact of the fraction of the population that is Black on median home value and the impact of the fraction of the population that is white on median home value. First, the impact of Blackness:

In block groups that are 5% or more black, the impact on housing prices is almost always between -3% and -7% (the green dots) with the full range of underlying models falling almost entirely between -2% and -10% (the grey dots).

On the other hand, for CBSAs that have very few Black residents, under 2%, the impact of the level of Blackness on the prediction of the ensemble is overwhelmingly positive, sometimes up to +7%.

What this tells us, at least for this CBSA, is that the presence of any non-trivial number of Black residents in a CBSA drives housing prices down dramatically and measurably.

Now let’s look at the effect of whiteness:

Most block groups that are less than half white see their home prices impacted negatively by 12% as a result. On the other hand, block groups that are over 63% white see home values impacted by +2% to +5%.

Taken together, these last two impact charts suggest a clear pattern or systematic anti-Black racism and white privilege in this particular CBSA. Is it unique, or is this a pattern that repeats itself to a greater or lesser degree in other CBSAs? Armed with our new methodology, we can find out.

Now that we understand and know how to interpret impact charts, we can begin to explore some CBSAs in more detail. We will look first at Los Angeles-Long Beach-Anaheim, then Memphis, and finally Philadelphia-Camden-Wilmington. For other CBSAs, please see this web page.

Los Angeles-Long Beach-Anaheim

The Los Angeles-Long Beach-Anaheim CBSA is the second most populous in the country after New York-Newark-Jersey City. We chose to explore it here because it contains some common patterns we see elsewhere.

First, let’s explore the impact of white residents on median housing prices. Here is our impact chart:

This is a pattern we see fairly commonly across the CBSAs we studied. It is in large measure the same thing we saw in Hartford-East Hartford-Middletown, but with a much larger set of data behind it. Largely non-white neighborhoods have their home values depressed as a result of their non-whiteness. Overwhelmingly white neighborhoods are higher priced than they would otherwise be.

In the middle of the range, from 25% whiteness to 70% there is more than enough overlap of the grey dots, which are essentially our error bars, that we can’t confidently say that whiteness affects housing prices. But at the low end, and at the high end even more confidently, we can.

Even though we do most of our analysis in percentage terms, it is sometimes instructive to revert back to looking at absolute impact in dollars and cents. Here is what the impact of whiteness looks like if we do this:

The general shape of the chart is the same, but at the high end of whiteness there is a more dramatic upturn in the impact in dollar terms. This is presumably because whiter populations tend to be more affluent overall, due to a variety of factors that affect generational wealth. So even though measured as a percentage of median home value they benefit relative to their affluent non-white peers in the same block groups, when we look in pure dollar terms the impact is overly exaggerated relative to less white and less affluent areas.

Though it is remarkable, there are areas like those at the far right of the chart where the influence of people’s whiteness on their net worth, through its impact on the value of the homes they own, is over $100,000. It is even more remarkable when we remember that the model is also trained on median block group income, so this impact chart isolates the specific impact of race, on top of the impact of income.

Returning to relative impact as a measure, let’s look at the impact of Black population on median home prices.

Again, this is a fairly common pattern when it comes to Black population impact charts. There is a small upward impact when Black residents make up a very small percentage of a block group, often 5% or less. But as the numbers grow, there is a sharp downward trend and by 10% it levels off with a negative impact.

Now, let’s look at the impact of Asian residents.

Vertically, it is almost the mirror image of the impact of Black residents. Block groups with very low numbers of Asian residents are negatively impacted, but above a certain threshold, around 9% in this case, their presence is a relatively constant positive impact.

Finally, let’s look at Hispanic and Latino populations.

Now we see the most significant demographic driver of median home prices in the Los Angeles-Long Beach-Anaheim CBSA. Block groups that have very low numbers of Hispanic of Latino residents see median home value impacted by +20%. Highly Hispanic or Latino neighborhoods, in this case those that are 70% or more Hispanic or Latino, see an impact of -20%. That is a net impact of 40 percentage points on median home value that our impact analysis attributes to the presence or lack of presence of Hispanic or Latino residents.

Memphis

Demographically, Memphis is very different than Los Angeles-Long Beach-Anaheim. It also much smaller. Our analysis also shows different factors impacting median home prices than we saw in Los Angeles. Though there are also some familiar patterns.

Let’s begin again with the impact of whiteness.

At the most basic level, the conclusion is the same as before, that concentrated whiteness impacts median housing prices in a positive way. But the effect here is clearly quite a bit more dramatic than it was in L.A. This is especially true at the low end. Block groups that are less than 20% white see median home value impacted by -10% to -50%. It bears repeating. -50%.

On the high side, block groups that are 70% or more white see median home value impacted by +2% to +20%.

So how do other demographic groups fit into the mix. Let’s look at the Black population.

The same pattern emerges at the low end, but it is stretched out to almost 30% Black population. Between 30% and 70% there is a small (typically 1-2%) effect. Above 70% it decreases slowly down to -6% in block groups that are almost 100% Black.

So maybe what is going on here is that Memphis is different than L.A., where the presence of Black people in anything but the tiniest of numbers impacted mean home values negatively. Maybe all the the impact is tied up in the impact of low numbers of white people. Maybe that’s what matters.

Let’s look at the impact of Hispanic or Latino residents to see if we can find out more.

Here, the overwhelming majority of the block groups have less than 20% Hispanic or Latino population. Their presence has no significant impact below 5%, then becomes positive, only to become negative again beyond 30%. However, essentially everywhere the gray dots cross the horizontal 0% line, so the effect does not appear to be significant.

It is also the case that in Memphis, there aren’t any majority Asian block groups. When we look at impact, we see very low levels of Asian residents, below 4%, look to have a negative impact, but not obviously so given many of the grey models cross above the 0% impact line. Above that, for 4% to 25%, there appears to be positive impact, but the number of block groups in this range is relatively small.

So what is going on in Memphis? The models chose to assign extreme negative impact on median home value to low levels of white people. But at the same time, there aren’t enough Hispanic, Latino, or Asian residents for them to make up a large fraction of the non-white population. So the negative impact of low numbers of white residents is in some way an alternate version of the story that large numbers of Black residents impact median home prices negatively. And there are some additional negative effects at high levels of Blackness.

Philadelphia-Camden-Wilmington

The final CBSA we will look at in detail is Philadelphia-Camden-Wilmington. We begin with the impact of whiteness.

Below 30% whiteness, Philadelphia looks at least somewhat like Memphis, with negative impacts down to as low as -60%. Above 40% whiteness, however, it does not continue upward, but instead levels off in a band between 0% and +9%, which slight additional upward trend at 90% whiteness and up.

This raises the question, will the impact of the Black population resemble Memphis? The answer is no. It looks more like L.A.

Below 5% Black population there is a weak positive impact on median home prices, between 0% and 6%. The impact drops sharply through 20% Blackness, where the impact is -4% to -10%. From there, it does not quite flatten out the way it did in L.A., but instead continues a slow decline down to a range of -6% to -8% impact in almost completely Black block groups.

The impact of Hispanic and Latino residents on median home prices structurally mirrors that of the impact of Black residents, though there are far fewer majority Hispanic or Latino block groups and the long downward trend is steeper for Hispanic and Latino populations than Black populations. Impacts go as low as -15% is highly Hispanic or Latino block groups.

Finally, the impact of Asian residents.

The shape of the curve is similar to L.A., though the impact grows larger and there are very few majority Asian block groups.

Patterns and Differences

We saw a number of patterns emerge in the impact plots across the three CBSAs we examined. These tend to repeat themselves in other CBSAs. The difference is largely in degree and slope, as was the case in some of the comparisons we did above.

This suggests that whatever underlying factors, whether the lingering effects of redlining and predatory inclusion, prejudice against certain immigrant groups, or just ongoing systematic racism or individual racism in home buying decisions, they are common across different CBSAa in different areas. We encourage readers to further explore the results at this web page.

Conclusions

In the previous section we saw several ways in which the presence or absence of certain demographic groups impacted median home prices in different CBSAs. As we mentioned, we really only scratched the surface here. But we have run the same analysis very broadly. We have produced 1,500 impact charts using this methodology and published them on this web page. Behind these charts are stories of hundreds of communities waiting to be told.

But even though we only looked at a small number of cases, a higher level conclusion started to emerge. The conclusion is that by fitting ML models and using ML interpretability techniques like SHAP, we can gain insight into the impact of racial and ethnic demographics on home values in ways that we would not have been able to, and that we haven’t seen other examples of, using regression analysis alone. In this case, we took Taylor’s qualitative conclusion and quantified it. We saw that while the details vary from CBSA to CBSA, there are recurring patterns, like the impact of whiteness and the extreme lack of non-whiteness on housing values.

We strongly believe that we have only taken the first few steps of a long journey here. We are confident this approach can be applied to many other data sets where regression analysis has been attempted in the past. In the U.S. Census data alone there are hundreds of other variables we could look at. We also believe that more advanced ML approaches, beyond the techniques and optimization methods we used here, that even more insights can be obtained.

Code and Data

All of the code we used to gather data, create features, fit models, optimize hyperparameters, interpret model predictions, and generate visualizations is contained in the open source software package that was once called rihmodel but is now called rihdata. To see the version of the code we used in this work, before impact charts were moved into their own library. see the tag PRE_IMPACTCHART_LIBRARY. The raw data itself we load from the U.S. Census API using the censusdis package. We encourage readers who are so inclined to examine our code and we welcome questions or comments.