Impact Chart Analysis 101

I have written about impact charts before (see here, here, and the OG post here), but I have neglected to give them a proper introduction. What are they, exactly? What do they do? And how do they compare to traditional tools like histograms and scatter plots?

Let’s say I have some data I’m just starting out with. I have some x_i values. I would probably call them independent variables if I were a statistician or social scientist, or features if I worked in Machine Learning. They look something like this:

I also have some corresponding y values, one for each row of x_i values. I’ll call y either a dependent variable or a target, again depending what academic tradition I come from. Let’s say my y‘s look like this:

In order to get a handle on how these target y values might be related to corresponding values of one of the features, say x₃, I might plot a scatter plot of their values. And to learn even more, I could fit a regression curve, either a line, or something more advanced, like a quadratic. If I do that, I get this:

There isn’t a whole lot to see here. The cloud of points is pretty spread out, and the low r² scores indicate that the fits are not that good.

What if, instead, I could get a chart like this:

Well, I can, and I did. This chart is called an impact chart. But what does it mean?

What an impact chart tells us is that, as far as some ML techniques behind the scene (XGBoost and SHAP in this case) were able to determine from fitting and interpreting 50 different models, this is the impact that x₃ has on y. There is no cloud of points. Just a clear curve showing that there is essentially no impact when x₃ < 0, and that the impact grows, slowly at first, and then more rapidly for larger values of x₃. Looking at the impact chart is a qualitatively different experience than looking at either the scatter plot or regression lines.

There is a green dot for every value of x₃ we saw in the data, and it’s horizontal position in the chart in the value of x₃. The vertical position of the dot is not the value of y, but rather the impact the value of x₃ had on the value of y. Impact indicates how much the ML models think the value of y was pushed up or down because of the value of x₃. The grey dots act as error bars, giving us a notion of how accurate the underlying ensemble of ML models things the green dots are.

If there was a linear relationship between y and x₃, then the green dots would lie very nearly on a straight line, like in this impact chart showing the impact of x₀ on y:

Impact charts can take on all kinds of shapes. Here’s one that looks sinusoidal, as if there is some kind of periodic impact of x₂ on y:

This image has an empty alt attribute; its file name is image-6-2048x1413.png

Going back to x₃, the impact was not linear, and it was not sinusoidal. What was it? And more importantly, why should we believe that any of these impact plots accurately portray the actual impact of the x_i on y? What if the ML algorithms are just hallucinating?

In order to answer that question, we very deliberately constructed the data set we have been using. Unknown to the impact chart code, the data was produced by another piece of code that very deliberately constructed the data so that there each x_i feature impacted y in a specific way. After the fact, we can overlay these deliberately constructed impacts on the impact charts and see if they match up. In the case of x₃, we deliberately made the impact exponential. Here’s what it looks like on top of what the impact chart found:

Visually, the results are pretty undeniable. We added some noise to y in our data set, so the results are not perfect, but they show us that there is a lot more visual insight in an impact chart than in a scatter plot or simple regression like we showed above. As always, the gray dots are designed to act as error bars, showing us a measure of how accurate our code thinks the green dots are.

We obtained similar results when we constructed synthetic linear, quadratic, and sinusoidal impacts and used our impact chart code to try to find them. This gives us some measure of confidence that when we look at data sets where we don’t know the impacts in advance, because we didn’t generate the data, impact chart analysis will still find whatever impacts happen to be there.

If you would like to look at the full notebook that generated the impact charts above and did quite a bit more analysis on the data, see the Synthetic Data.ipynb notebook in the impactchartdemo repository on github. The README.md also includes directions on how to run the notebook yourself, either in a hosted cloud environment on mybinder.org or in a virtual environment on your own machine. Once you are comfortable with the synthetic data, there is also a demo using a real data set from the U.S. Census.

We really only scratched the surface here, but I hope that this brief introduction inspires you to dive deeper and learn more about impact charts, and maybe even try them out on some of your own data in your own work. In addition to the impactchartdemo repository there is also an impactchart repository that does all the heavy lifting of generating impact charts.