What Factors went into the Model?

What Factors?

One of the questions I often get asked about a data science prediction model I’ve built is version of “what are the key characteristics of this model?” Given that many models (i.e. Neural Network) don’t show the characteristics and some that do (i.e. Random Forest) can be hard to follow, I’m often placed in an awkward situation of saying.

“We looked at them as we built the model, but now we just accept the model.”

This is, understandably, to someone not familiar with data science, often thought of as an inadequate answer. However, let me show you how the answer I give is both complete and correct. And to do that I’ll start with the results.

The Results of the Model

Another point of confusion about data science is “what is the result of the model?” The short answer is a table. The slightly longer answer is a table which has each record that needed to be predicted, the characteristics used by the model, and a prediction. That table can have one row (a single prediction) or multiple rows (batch of predictions). Whatever process could have been done to the record can now be done to the record and the prediction (send an e-mail to those customers, direct the predicted ‘yes’ results to a promotional web page, apply a discount, etc.) Also, the results can be directed to graphs / charts.

For many applications, charts should be built to allow for the “human pattern recognition” to see high level information about the predictions. The key factor in these charts should be to look at the predictions in comparison to the overall population. The reason for that is because of one of the truism of data science predictions:

The point of most data science predictions is to try to make the prediction wrong.

You predict what customers are likely to churn in order to prevent those customers from churning. You predict what cases are likely to not be delivered on time in order to somehow get those cases there on time. You predict what customers would also like but haven’t yet bought so that they will now buy that product. You predict so that you can take steps to “beat the house.”

So, how do you beat the house? By making smart and creative guesses based on patterns you see in the charts based on the differences between the predicted and non-predicted population. And here is the big insight on how to beat the prediction:

It doesn’t matter if the machine learning used the characteristic in building the model, if you see the pattern and think there is a way to use it to beat the house, you should use it. To beat the model, you look at the results of the model, not how the model was built.

As It is Made

As the model is being built, however, there should be frequent and regular checks with models that are easier to interpret. This is important for multiple reasons. The first reason is that it is important to confirm that the data feeding model isn’t corrupted with the results needed to predict. Let’s say for, example, you are building to a model to predict the winning team of basketball games. If the historical data you are using to train the model contains a field for “point differential of game”, you have a problem. The problem with the field is that you don’t know the value of that field until after the game is played. The field is related to the outcome you are looking to predict and not something you would know before the game. Sometimes that isn’t obvious that a field is related to the outcome until the algorithm picks up the characteristic as being highly predictive. Trust me from experience on this one, if you have data that is contaminated like the point differential example, the algorithm will find that characteristic and find it quickly.

The second reason to look at the characteristics used as you are building the model relates to the whole “causal vs. correlation” game on which data science is based. A field could be highly predictive because it relates to some root cause (causal) or it could simply be a coincidence (correlation). This is where subject matter expertise (SME) is critical to building a model. As the model is being built, a SME needs to look at some of the key characteristics picked by a more transparent algorithm. The SME should be able to tell a story based on those characteristics and it should make sense to them. Some examples would be “I see how customers who contacted customer service more than 3 times in a year and are high dollar customers would be more likely to leave – high pay plus problems equals disgruntled customers leaving – makes sense” or “shipments going over the Rocky Mountains in winter are more likely to have weather delays – mountain roads with snow – check.”

Allow for the Unexplainable

Once the more explainable algorithm has run, the data scientist should then use some of the less explainable algorithms. Here is the next key: if one algorithm has found a certain characteristic is predictive, other algorithms are highly likely to use the same characteristic. Most algorithms use a combination of characteristics but how they are put together and selected can make a difference. Usually it isn’t a large difference, but if an explainable model is 70% accurate and a difficult to explain model is 78% accurate, usually you go with the more accurate, but difficult to explain model. The assumption being that both models share a large similarity, just that the harder to explain model found some additional connections in the data.

And that gets me back to my answer “we looked at them as we built the model, but now we just accept the model.”