New Chapter Added

After I finished my book, got it all done and dusted, I had a revelation that I knew needed to be added to the book. As such, I’m adding in Chapter X – Why Predict? I’m releasing it first for free on my website, and in a few days, it will be in the printed and electronic versions. If you have already bought the book, or just want to read the chapter, follow this link. What follows here is a quick summary of the chapter.

Why add Why Predict? I realized that while I had been doing data science, I had just assumed the need for data science to be true. After some pondering and conversations, I was able to articulate the answer to the question…

Why predict?

Why predict? Because every goal contains, at a minimum, an implied prediction.

I’ve elaborated this into the new chapter, but I’ll summarize here. Take a goal, any goal. If you unpeel it, you will see a prediction buried within.

A Simple Matter of 5 Pounds

Let’s say I have a goal, “I want to lose 5 pounds.” Let’s look at the hidden prediction. In this case, the prediction is that if I don’t change my lifestyle, I won’t be 5 pounds lighter. The prediction is simply, that I’ll maintain my current slightly overweight status.

If, given my current lifestyle, the prediction was I was going to losing 5 pounds, the goal could be “after 5 more pounds, I need to stop losing weight.” Or if my prediction was that I’m losing weight too rapidly, the goal would be “I need to stop losing weight, I can’t afford to lose 5 pounds.” Similarly, the prediction could be that I’m going to gain 5 pounds and the goal may be to just not gain any more weight.

As you see, the goal is only meaningful in relationship to a prediction.

Data Science Gives Great Predictions

The simple realization that a goal is a difference from a prediction drives so much of data science. We can now predict with ore accuracy than ever before. We can predict the customers who will leave us next month. We can predict which products are likely to have problems shipping. We can predict how many coupons will be redeemed.

The question is: if that prediction comes true is that good enough?

If the results of the prediction is a situation that is good enough, then the good news is, you don’t need to change a thing. You’ll make your sales target, your customers will be happy, and your margins will be exactly what you need them to be.

Now you can focus your attention on those predictions that predict a future that isn’t good enough.

Remember: every goal is a change from a prediction.

More in the Chapter

The extra chapter goes into more detail on the linkage between goals and predictions. Also, in the chapter I cover more detail about the natural disconnect between people and predictions, as well as the pernicious problem of disingenuous predictions, and just as importantly, what predictions are socially unacceptable. I also cover the concepts of how, as we change from looking at historic data in order to make predictions into a world where data science provides the predictions, we also need to change the information we consume.

While the chapter is in itself, only part of the larger book. Feel free to just read the chapter here.

A copy of this post is on my blog.

Chapter X – Why Predict

Chapter X – the important bit I realized after I put the book together.

Why Predict?

I’ll admit that the way that I see and communicate things is a little different. In our family we have a phrase for long complicated explanations, we call them a “Russian Candy Machine.” To explain “Russian Candy Machine” would take a very long complicated explanation of my parenting style, my families shared sense of humor, and how you get from evolution of the spinal cord to a Russian Candy Machine. In other words, it would take a “Russian Candy Machine” to explain a “Russian Candy Machine.” Words and phrases can often be explained with a simple metaphor. Complex ideas and ways to look at the world, however, often require more than a simple metaphor, sometimes they need a Russian candy machine. The simple question “why predict?” needs a complex answer. But every complex answer starts with a simple insight.  Here is the insight:

When we talk about setting a goal, we imply that we have already made a prediction. Or put another way, you can’t set a goal without first making a prediction.

For example, to set a goal of 5% increase in sales for the next year, you need to at least have an implied forecast of “if you change nothing” next year sales will change by X% percent. It is the comparison to that forecast that makes the goal reasonable or not. For example, say you are a salesperson on my team and if this year’s sales increased by 5% from the prior year. I, as the team leader, set a goal of 20% sales increase for next year. You would probably react by indicating that I am crazy, and I should keep a more level head when forecasting increased sales. That is because you have made a prediction in your head. My guess is that you thought, we increased by 5% last year, at best we’ll increase by 5% again. If, however, sales increased by 20% last year, then the sale goal of 20% may be more reasonable.

A simple “what happened last year will happen again” is a naïve prediction. Really, it is called a naïve forecast.  So, if sales last year were one hundred fifty thousand, then a naïve forecast is for next years sales to also be one hundred fifty thousand. Another a simple model is whatever the percentage change was for the prior period, apply it to the next period – this is the example above. Another simple linear type prediction is to take the value of the increase and apply it. So, if sales went from one hundred thousand up to one hundred fifty thousand, the sales next year will increase by fifty thousand again.

Lest there by any doubt, these three simple methods are VERY effective. In fact, they are considered the gold standard by which more sophisticated predictions are judged. Even here, with these three basic forecast methods, we have a range of potential outcomes. Given the sales going from $100,000 in the prior year to $150,000 last year. We have naïve ($150,000 again), repeat value increase ($200,000) and repeat percentage ($225,000).

A goal is set above or below the basic forecast of “what happens if we keep going as we are going.” If my forecast methodology is percentage and yours is value increase – we already have a basic gap in understanding and expectations. Whatever reasoning I used to pick to set the goal above the forecasted value is meaningless if we don’t even agree on the basic forecast.

Every time as leader that you hand out a goal, it comes packaged atop a forecast. It doesn’t matter if a forecast came from a simple application like the above or through a complex sales forecasting model, a goal is a change from the prediction.

And reasonable people can, and should, disagree on a forecast. Years ago, columnist Gregg Easterbrook started using the tag line “All predictions are wrong or your money back.” That comment should be tagged to the bottom of every sales forecast. The variation I use is “all predictions are wrong; it is just a matter of how wrong.” Back to our sales example, there are three values we can probably rule out as the exact sales for the following year, $150,000, $200,000 and $225,000. The sales forecast never hits exactly. In fact, the only forecaster who was right every quarter was Bernie Madoff and he was cheating to get it right every time. Or for those people who like to golf, the safest place to stand on a green is right next to the flagstick – it is just about the least likely point to be hit even though most golfers are aiming right at it.

We know and accept that the forecast is wrong, but we need that baseline for discussions and goal setting. Data science is an attempt to decrease forecasts error, but it can never eliminate the error.

When you and your team think of “SMART” goals (simple, measurable, achievable, relevant, time-fenced), the achievable characteristic is based a mutually accepted forecast. If we don’t agree on the forecast, we can’t agree if the goal is achievable or not. The next time you feel disconnect on a goal, step back and talk about the forecast without the goal.

Most forecasts also have a social range in which they are acceptable. For example, for a large established company, having a single quarter with a 20% profit margin might be a cause for celebration. For a high-tech startup, 20% profit margin (or any profit) is a hope and dream. For stable software company, 20% margin would be a significant drop and reason to panic. The type of company shapes an acceptable forecast for quarterly results.

Think of a coach of a team. Given that there are only three outcomes: win, loss, draw – how many times does a coach come out and predict a loss? The coach can’t because it isn’t socially acceptable, even though it may be an accurate forecast. Sure, when sitting at a wedding, you and your friends can have a funny conversational bet as to if the marriage might not work out, but don’t say that to the bride or groom. The social range of the forecast matters.

That brings us to the last type of forecast. Politely put, a disingenuous forecast. Bluntly put, a lie. This kind of forecast is the hardest to combat because, as we have said, all forecasts are wrong, and one more wrong forecast is hard to point out as bad. Except companies and careers can be destroyed by bad forecasts. If we are going to have sales over $200,000 then a $20,000 marketing spend isn’t a problem. But if we are only going to sell $15,000 then the marketing spend is a problem. Many a small business has been undone not just by missing a sales forecast, but by spending to a disingenuous forecast that was set by hope and not by data. You can’t have goals without a forecast, and you can’t plan your expenses without having expected revenue.

Probably some of the clearest examples of disingenuous forecasts come from politicians. For example, A politician will say that if we pass a law on wages it will change income for a certain number of their constituents. Or they’ll say that the economy will grow at a certain percent. Or that their tax program will have a certain benefit. And while we know reasonable people can disagree on a forecast, politicians have a special forecasting trick and that is to pay consultants to make their forecast. When you already set the number you want to have, it is easy to back into that forecast. Trust me, as a person who makes predictions professionally, the easiest prediction to make is the one my customer wants to hear.

We all see the results of the bad forecasts, somehow the legislation didn’t result in the change predicted, and we have all heard the excuses from politicians. The excuses usually sound like a version of the bad guy in Scooby Do, “my legislature would have worked, if it weren’t for those meddling {other political party}.”

The reason I point this out, is not that I don’t like lying politicians. I don’t. It is that while we can see the political version of the results of bad forecasts, we often ignore that businesses also have a version of this. My favorite version of bad forecasts is the “excuse management” process by which organizations set up detailed processes to excuse away missed forecasts. Dashboards upon dashboards have been setup with the original intention to highlight failures in order to prevent them in the future. However, these dashboards are now used to excuse away poor performance. And Scooby Do comes back into play, “I would have hit my on-time measure, if it weren’t for those meddling {other department}.”

The looking back at the details of the past isn’t supposed to be to find excuses, even though today it most often is. The looking back is to inform your next prediction. You want to say that your on-time miss last quarter was driven in part by bad weather in the Rockies. OK, so next winter we need to adjust our prediction down or take steps to plan around weather conditions. The social reluctance to adjust the forecast down to lower than the department goal creates the excuse management process. If you are not honest with yourself on the prediction, how are you going to be honest with the results?

Data science provides some relief from this circle of dishonest predictions leading to dishonest excuses for missing the dishonest prediction. You can “blame” the data science results for the honest, but less than goal, prediction. Because data science only makes predictions from history, it can only give a “if nothing changes” type of prediction. Data science also gives you the details on which specific activities / customers are more likely to miss. Armed with a collection of predicted bad results, you and your team now get to get creative in trying to figure out how you will beat the prediction.

This changes job roles; team roles shift from looking back looking for patterns (and making excuses) to taking a set of predictions and getting creative about how you are going to “beat the house.” This is a fundamental change to job roles going forward. Jobs will require more creativity because while data science can give you a detailed prediction with the characteristics of what makes that prediction more accurate, it can’t tell you what to do about it. Beating the house is fundamentally a creative activity.

There is a large cautionary item to point out here. As mentioned, data science is only making predictions based on what has happened. If there are problems with what has happened in the past that may be socially unacceptable, data science with predict those to keep happening. If there is a racial, gender, or age bias to past activities and that data is included as potential characteristics, any good algorithm will find that pattern. This can place people facing questions they may not like or may not be positioned to deal with. On the other hand, if the bias is a problem, shouldn’t it be fixed? I’m not saying there is a right or easy answer, I’m just saying that sometimes key predictive characteristics aren’t ones that are pleasant to deal with.

So how, practically, do you make the shift from excusing management dashboards to the new “beat the house world”? The answer is, in a way, already staring you right in the face. It is the new face of digesting data science: prediction dashboards.

There is already a version of prediction dashboards. My guess that there is at least one chart that is looked at that has a “goal” line for future months with the current period actuals mapped against it.  Something like this:

This chart contains a prediction! It is labeled as “goal”. So, let’s make the first change to this chart. Don’t primarily measure yourself against your goal, measure yourself against the predicted value! I’m not saying not to have the goal, I’m saying if you want your marriage to work, you need to first acknowledge that current divorce rate is 0.32% per person annually in the US. (If that number looks low, I picked the version of the statistic that is the smallest for surprise value – for more on this see my upcoming book “How to Lie With Accurate Statistics.”)

So, changing to comparing to the prediction we get:

As you can see June, despite being below the goal line is green because it was above the prediction! Also, we have the rest of the year predicted. The primary goal is to beat the prediction – the goal line is there for reference – the goal line is the agreement between management that mixes in the social expectations above the prediction. For positive metrics, the goal is above the prediction. For negative predictions, the goal would be below the predictions. Some goals can also be a high / low range around a prediction. Also, in this case having a standard annual goal for reference is just fine. Sometimes the goal may be a percentage above the prediction or a fixed number above the prediction or a repeat of the prior months amount above the prediction. And, yes, that is a direct call back to the naïve forecasting.

And, yes, one may also notice then that the difference between the forecast and the goal is a kind of prediction in and upon itself. The reason for that is that part of the challenge in setting a goal is the analysis of how hard it is to beat the house. Are there lots of additional sales opportunities out there and with just a simple bit of work sales can beat the prediction by 5% or is the market full of competitors and you will have to work day and night just to move the sales up by 1% over the prediction. There is a complex discussion about how data science can identify market flexibility, but for now, we’ll keep focusing providing information to help beat the prediction.

Anyone who has worked with dashboards has worked with click through. That is where when you click on the chart, you are taken through to another chart that is view of the data at a lower level of aggregation. Usually we set these levels of aggregation at human pattern recognition levels or levels of organization within the company. For example, we might see sales by sales region or sales team. In our new prediction world, we don’t care nearly as much about the sales team – we only care when we are evaluating individual performance. If we focus on sales teams in the past we are pushing towards excuse management. “My team would have hit my sales number, if it weren’t for the meddling customer service department.”

Instead of looking at levels that help people adjust their human created predictions, we need to look at the key characteristics that did drive machine learning prediction. Often, we think of these key characteristics as having a linear relationship. Not so. For example, divorces in the first five years tend to happen more frequently among young and old but marriages in the 30’s have a lower level of immediate breakdown. With a good combination of algorithms and data manipulating, a good data scientist should include these kinds of non-linear relationships into their model.

What to look at is a breakdown of the populations by the key characteristics that drove the prediction. What we want to do is compare the predicted population to the current population. The goal not being to second guess the prediction but to use the characteristics to fuel the creative process to beat the house.

So, this chart called is a heat map or tree map. It is particularly good for showing 2 different characteristics. Not listed on the chart, but the size of the box in this example is the size of the current business. The color is the percent change of business in the predicted model. Besides some description about the size of the box and the color, I’d probably add on to this chart an additional note that the overall growth is 3%, but for this example the chart is designed to be simple. As you can see, the size lets you focus on the biggest items first while taking into consideration which items are predicted to change the most.

Just looking at this chart the first reaction is probably “why is small business predicted to make such a drop!” But the right phrase should be “how can we prevent a drop in small business?” The reason for this is a subtle change – don’t try to re-guess the prediction, try to beat it. So, we need to get creative – is the small business segment itself simply shrinking? Is there a problem with how we are talking to small businesses? What programs / activities have helped on the large customers and can they be translated to the smaller companies? Basically – what creative steps can we take to make this future happen better for us?

This sample breakout should just be one of the top 5 or so most predictive key characteristics. When a model is refreshed, the key characteristics probably won’t change greatly from model build to model build, but one or two may move on or off the list as the model is refreshed.

One other note: these breakouts of the customers, they should be chosen by data science. The difference between small and large companies shouldn’t be set by what makes convenience organizationally but by the how the companies activities cluster together.

Which brings us back to the question of “why predict?” Firstly, predictions have always been there, often naively overlooked. Second, once you add in better predictions with data science, you begin to shift your focus from looking backwards to looking forwards. The third thing is that by looking at the how the prediction is different than the present, you are given clues as to what you want to change.

Back to how jobs change. We are moving away from having jobs looking at past data to find patterns of deviation from desired in order to make process changes to improve results. We are moving towards data science making a detailed prediction with key predictive characteristics. The job now becomes how to design a creative process in order to beat the predicted value. This is a big irony at the heart of the data science revolution: Application of data science increases the need for employees to be more creative. (And not more creative with their excuses!)

An added benefit is that by having employees focused on improving the future instead of excusing the past it creates a more optimistic environment. Who doesn’t want to work in a more optimistic environment? Everyone is looking forward to “beat the house” instead of backwards and being beaten up.

Why predict? Because you have been. Because you need to.

The End of the Cowboy Era in Data Science

Data Cowboy

If you are just hopping into utilizing data science, I’m sorry to report but you are now officially another wave late. The cowboy era is over. Sometimes being a wave late is a good thing, sometimes it is not. In this case, it is a little good and a little bad. Or like most things data science, “it depends.”

The Cowboy Era

First, let me describe the era of the cowboy data scientist. This era featured bold thinkers who often came from a different background but who had drive and curiosity to solve new problems in new ways. This person wasn’t necessarily an easy fit within a corporate structure. For example, I worked with a senior data scientist who, despite being paid by the hour, never recorded his time. He fed his natural disinclination to the paper processing to such an extent that he left thousands of dollars unbilled.

Now, I’m not saying that the cowboy data scientists are or were wrong. I would, in fact, suggest that part of the success of many data science projects over the past several years have been due to larger companies embracing their inner cowboy and, if I can mis-quote Ms. Frizzle, they “took chances, made mistakes, and got messy!” Companies need to continue to have some cowboys going forward, it is just that the days of having them do just data science is past.

What Changed?

What has changed is the introduction of data science platforms, like Dataiku. One of the benefits of these products is that you can easily go from development to production with just a few clicks of the button. Instead of running processes on desktops, teams can now collaborate on solutions. With both this ease of production-alization and the increased collaboration comes the key to the end of the cowboy: standards.

The standards are a family affair. If you have been in the IT game for a while, it will seem like a family reunion with standards like naming conventions with its cousin appropriate documentation, change management standards with its cousin signoff, and the drunken uncle that is environmental standards.

Naming Conventions

In data science, what needs a naming convention? What doesn’t? The big ones: project names, table names, and code. All the steps need to follow some naming standard. Assuming you are using intermediate or temporary tables those should follow a standard. Within a project, the SQL, the R/Python, and even the command line code segments need to follow a naming standard.

Years ago, I was working on a project and there was a tricky bit of logic that needed to be worked out. The smartest person on the team, Dan, solved the problem with a clever bit of code. The solution became known as the “Dan Logic”. This was bad for Dan, because as long as he was within the company, he would always be the owner of the “Dan Logic”. Even after he had moved 4 steps away from the project team, if there was a question or problem, Dan was always contacted. Dan didn’t want to be contacted, but who else can understand Dan Logic better than Dan? Moral of the story – never let your name be associated with a set of code or logic.

While project code names are fun and can be clever, like who doesn’t want to be involved in “Project Disco Dynamite.” Well me, I can’t stand disco, but you get the point. Fun and wacky names have their place in projects, but they should be kept out of naming standards. When later looking back, can you tell the contents of tables tbl_BoomGoesTheDynamite and tbl_BoomShakaLaka? Keep the naming convention simple and descriptive of the of the contents.

For appropriate documentation, I’m not just talking about commenting your code. The business owner, the team members, and the project goal all should be documented. Who worked on the project isn’t nearly as important as for whom and for what reason the project was created. That, and if you have good version control, you intrinsically know who worked on what. For data science, this documentation needs to include the data science specific questions like, why a certain threshold was chosen or why a certain algorithm was chosen over another. These notes should be kept with the project so that when, not if, there are questions on the project, the notes will be available.

Change Management

In the cowboy years, there was no difference between development and production. When you got the results you wanted, you implemented it or just as frequently, your results were only for a presentation of insight. But now, you can have a bit of data science that is intrinsic to daily business. This calls for treating the data science like the enterprise worthy code that it is. This means keeping the data science team out of the production environment and having a third party execute the elevation of code into the production environment.

This also means appropriate business owner signoff and even a formal signoff process. Since there currently aren’t many tools for migrating data science projects, there should then be clearly document steps. There also should be scheduled times and days of the week that the data science projects move into production. The times depend on your natural business cycle, but the habit and discipline should be set in place. Migrations to production will need to have clear window of sign off so that if the migration time is 3:00 on a Thursday, the sign off can’t be at 2:59.

Environmental Standards

One of the biggest challenges to standards for data science is that, in general, the model must be made with production data. A few years back I got into a “passionate” discussion with a very competent and strong-minded leader with a history of leading successful development projects. At the time, I was the reporting guy. We were discussing why I was so nonchalant about testing my reports in the development environment. If I recall correctly, it was testing a report that was specific to the prior day activity. Sure, they were doing the test scenarios, but the amount of data feeding into the reporting system was but a mere fraction of what would be the daily data deluge from the system once it went live.

I was adamant that the simple test scenarios were so incredibly inadequate for testing purposes that spending my time elsewhere was of better benefit and my friendly antagonist was of the opinion that you if you check thoroughly with what you have you can be ready for production. I pity the poor person who was sitting between us. She described it as being stuck between parents who were arguing.

Let’s just say the discussion ended with “we agree to disagree.”

The reason for this story is that while reporting required a leap of faith for the environmental requirements, Data Science requires an even larger leap because it doesn’t just need production sized data to enable good testing, it requires production data to be built in the first place. This changes the traditional “development” box into a “laboratory” box.

Laboratory Server

The laboratory server contains production data (anonymized where appropriate). It is where the data science tests are performed. The reason it isn’t a production box is that the data science team needs to be able to create and modify the environment while at the same time it needs to be continuously feed production data. The source for the production data could be pulls from a transaction system or from a production reporting server. There needs to be a process to quarterly review intermediate tables so that any abandoned tables are cleaned. By its very nature, the laboratory server will be sized like a production reporting server. Data science projects may produce results for consumption by leadership from this box, so keeping it clean and organized is a must.

Liminal Space

Let’s talk about what I like to call the “liminal” server. I like this definition of liminal space “A liminal space is the time between the ‘what was’ and the ‘next.’ It is a place of transition, waiting, and not knowing. Liminal space is where all transformation takes place, if we learn to wait and let it form us.” Ok a bit too new age-y for me in general, but you get the point. A liminal server is a server between QA and Production. Sometimes it is called a “staging” server, but with many data science projects the transition isn’t just a quick cut off as more of an easing in. The liminal server is used to run production modeling to compare the results of a new model to an existing production model. The liminal server needs to be able to react like a production server, but it doesn’t have the same storage requirements. Models should have a clear time in the liminal space.

The Full Flow

Let’s look at the full flow:

  • Laboratory – Production data to build models / test hypotheses
  • Development – Where data science first interacts with APIs
  • Quality – Where interfaces to the data science are confirmed
  • Liminal – Where data science models are compared against production transactions
  • Production – Where the real money is

It Matters for the Cowboys

I know that someone is going to read this who knew me back in the day and break out laughing. I’ll own up to my own cowboy past, I was that guy who if the sign off window ended at 1:30 p.m., at 1:29 I was trying to get everything signed off. I might even have pushed it to exactly 1:30 and then spent 15 minutes claiming that the end of the window was a “less than or equal to” and not just a “less than”.

It is important that the above guiderails are for the producing of better data science, the rules aren’t there for rules sake. Just as the cowboy laments “Give me land lots of land under starry skies above. Don’t fence me in,” so too will the data cowboy lament the processes above. But here is the kicker – you need the cowboy. The rules are to help them, and you need them to help build the rules.

You don’t want the cowboy to flee to greener pastures, you want them as an integral part of the process. If you bog your data cowboys down, you will lose them. And your data science team isn’t as good as the process or the tools, your data science team is only as good as your people.

One of my colleges says that all data scientists are natural born disruptors. In a typical reaction by a data scientist, I’ll add that I disagree with this assessment. I’ll concede that many of the best data scientists I’ve met are cowboys. But just as the technology has matured, so to must the processes and by extension so to must the people. It can’t be forced onto people by saying “it is for your own good.” Just make sure as you implement the processes, you keep your people enthused about the changes.

End of the Era

With the cowboy era comes the standardization of names, environments, and processes. This gets us better, stronger, more production worthy data science but it doesn’t mean you need to put your current cowboys out to pasture.

Next time I’ll talk about how you engage your cowboys to build the process. This comes with the understanding that you can’t cowboy the process.

Machine Learning and IOT

A few days back a friend approached me on a project that he was consulting on. It involved looking at a continuous distillation device and incorporating machine learning models to automatically control the device to achieve the highest yield possible. This is an interesting problem as it is different than the classic “supervised” or “unsupervised” models can easily solve. Below are my notes I sent to him outlining a direction for a solution. This solution works broadly in many situations that use IOT (“internet of things”) devices. I’ve found my explaining “how” I setup problems can be helpful for those who are just beginning their journey to leverage machine learning. I hope this helps others besides my friend…

As someone who has done a few data science with IOT devices (my last one was a sensor on a display to look at when an item was pulled out of an endcap display), you need to think that there are three ways use the IOT data: programmatically, formulaic model or with a machine learning model.

Option 1: Programatically

Programmatically is for when you know the threshold and you want to maintain it. For example, if you want the liquid temperature set to 173 you can have a temperature sensor and then turn the heating coil on or off depending on if the temperature is above or below the goal. You know the goal measurement and you have direct control of the device(s) that set that measurement. You setup essentially a series of “IF” statements that react to known conditions with known responses.

Option 2: Formulaic

You want a model when you are dealing with characteristics that you can’t directly control. For example, say you are looking at the temperature at the top of the column head. Now there isn’t a direct device that controls it, the temperature is a result of what is happening below it. You may have a few items that influence it: chiller temp, chiller pump speed, and pot temperature. With 3 items, you would probably want to do about 9 runs where you kept 2 of the 3 variables constant and then messed with the third. Since these items probably have a nearly linear relationship, you should be able to figure out something like “for each 1 degree of pot temperature increase, the column head changes 0.75 of a degree. These are simple formulaic models usually only done in the range that matters (i.e. that change of pot temp to column head temp is only true between 160 and 180 degrees – you don’t care if the pot is at 100 degrees.)

Option 3: Machine Learning / Data Science

Where you would want to apply machine learning is when we are measuring the outcome (i.e. Pct. Alcohol) and we want to know what things that are controllable (temp, pump speed, etc.) work best in combination with the uncontrollable (outside temp, air pressure, etc.). The key here is since we are not trying to infer some simple rules, like in a simple algorithmic model.

We need to first gather “what controllable settings in history, combined with the uncontrollable at the time, had what output.” We then build a model that predicts what the pct. alcohol is based on a given combination of both controllable and uncontrollable inputs. This model will be able to predict never before seen results based on the historic result and characteristics. This is the first step – a model that if given a set of our environmental measurements can predict the pct. alcohol.

Reinforced Learning / Optimization

The challenge here is that we are trying to control multiple controllable variables to account for the multiple uncontrollable variables. Most “supervised” models are only capable of predicting against one variable. What is needed is a second model, one that will query the first model with options. This is a variation on “reinforced learning” where the first model is temporary frozen while the second model performs an optimization on the answers given by the first model. Optimization algorithms are often the forgotten sibling in the “supervised” vs. “unsupervised” model discussion.

Basically, you have a second model running an optimization algorithm that is feeding variations on the controllable variables with the to the first model (while holding the uncontrollable variables constant). The first model predicts the resulting pct. alcohol. The second model loops through combinations until an optimal result is returned. The results of the best combination of controllable values are then sent to the appropriate controller.

You also want to be capturing ongoing data as even as the system tries towards the optimal settings as chosen by the second model. You want to be periodically retraining the first model to incorporate all additional data.

The other nice thing is that you want to build controllers for solution 1 (programmatically) so that you can then set a few guidelines – solution 2 (formulaic) which can then capture enough data to feed solution 3 (2 models – prediction interrogated by optimization). The optimization model will then feed the results back to solution 1 as a “target” for the programmatical control of the devices.

I hope this explanation helps others facing a similar challenge. If you have questions on anything here, don’t hesitate to reach out to me.

ETL Basics

We often think of the foundational skills of data science being Data Manipulation and Management (SQL), Graphing/Charting, Modeling, and Development (R/Python). But as the toolset changes, so do the skills needed. With the emergence of data science platforms like Dataiku, the development skillset has become less important. Similarly, these tools have nearly removed the need for a separate data engineering stage of the project. As a result, the process of building data models that follow ETL (Extract-Transfer-Load) principles becomes a more important skill set for the data scientist.

I’ll have a more detailed post coming shortly on specific guidelines of ETL, but for now, let’s focus on what the overall picture of an ETL should look like.

Master Data and Transactional Data

First, we need to talk about two kinds of tables: Master Data and Transactional. Usually, master data is the data describing the objects the database cares about. Frequently this data is not necessarily version controlled but is often “as-is”. An example of this would be a “customer” and within the customer master data could be the “sales plan” the customer is on. If you go into the transactional system and look up the customer, you might see a drop-down list of the possible options. In our imaginary system, we’ll say that the customer can be on the “free tier”, “silver” or “gold” sales plans.

The second type of table, the transactional, is generally the transactions between the master data elements. For example, if you have a customer, you might have a table that is the list of all of their orders. Usually in a normalized database, the customer listed on the order table is just the customer number or some other unique identifier. There is also probably a subsequent table that joins the orders with the products sold. “Products” would be another master data table.

Basic Recommendations for an ETL

ETL’s frequently are then de-normalizing the database from multiple tables into one large table. This process is similar to the steps used to build reports and data warehouses. If you have access to someone who has built (good) reporting systems, look to tap their expertise. Otherwise here are my “basics”:

  • Don’t do everything in Python/R – you are doing data manipulation use a database tool
  • Run the extract in the memory of the source database
  • Limit the transaction extraction by a variable set to be only those greater than the start date/time of the last successful run
  • Transfer to destination database prior to joins
  • Execute joins/transformations in destination database
  • Build any aggregations needed to speed performance
  • Build appropriate checks in the process. Every ETL will eventually fail. By setting up the last successful run variable, the ETL should be setup so that it will automatically catch up. A full re-build may be necessary and that is just setting the date/time of the start of the last good run to the date/time of the first transaction.
  • Build the joins tables assuming you will re-use them. Aggregations and transformations may vary as time goes by so do that in a separate step.
  • Unless you are working with very large data sets, use as many intermediate tables on your destination database as you need. You’ll get better performance results and the extra storage space is no big deal.

The Challenge of Turning a Master Data table into a Transactional Table

One last bit of advice to solve the problem of if your master data does not contain its change history and you need that for analysis. Besides taking a baseball bat up to the developers and demanding they re-design the transactional system to benefit you, you’ll need to build a compare function that compares a snapshot to the current master data table. Then capture the deltas between the old table and the current table and append those to a running table of transactions. Use the date that you execute the load as the date inside the transaction table. There is one weakness to this approach which is not one than can be conveniently worked around. If you skip over any days of processing, then those days of data are “lost”. I’m sorry, but they just are lost. There is no way to re-create history because all you have is a picture of it before and after. Without a set of data in between you are simply at a loss.

As I mentioned, the process for building ETLs has been around for quite some time. If you come from a background of building them into data science today, you have a leg up. Otherwise, you’ll want to incorporate good ETL practices into your process of building models.

What Factors went into the Model?

What Factors?

One of the questions I often get asked about a data science prediction model I’ve built is version of “what are the key characteristics of this model?” Given that many models (i.e. Neural Network) don’t show the characteristics and some that do (i.e. Random Forest) can be hard to follow, I’m often placed in an awkward situation of saying.

“We looked at them as we built the model, but now we just accept the model.”

This is, understandably, to someone not familiar with data science, often thought of as an inadequate answer. However, let me show you how the answer I give is both complete and correct. And to do that I’ll start with the results.

The Results of the Model

Another point of confusion about data science is “what is the result of the model?” The short answer is a table. The slightly longer answer is a table which has each record that needed to be predicted, the characteristics used by the model, and a prediction. That table can have one row (a single prediction) or multiple rows (batch of predictions). Whatever process could have been done to the record can now be done to the record and the prediction (send an e-mail to those customers, direct the predicted ‘yes’ results to a promotional web page, apply a discount, etc.) Also, the results can be directed to graphs / charts.

For many applications, charts should be built to allow for the “human pattern recognition” to see high level information about the predictions. The key factor in these charts should be to look at the predictions in comparison to the overall population. The reason for that is because of one of the truism of data science predictions:

The point of most data science predictions is to try to make the prediction wrong.

You predict what customers are likely to churn in order to prevent those customers from churning. You predict what cases are likely to not be delivered on time in order to somehow get those cases there on time. You predict what customers would also like but haven’t yet bought so that they will now buy that product. You predict so that you can take steps to “beat the house.”

So, how do you beat the house? By making smart and creative guesses based on patterns you see in the charts based on the differences between the predicted and non-predicted population. And here is the big insight on how to beat the prediction:

It doesn’t matter if the machine learning used the characteristic in building the model, if you see the pattern and think there is a way to use it to beat the house, you should use it. To beat the model, you look at the results of the model, not how the model was built.

As It is Made

As the model is being built, however, there should be frequent and regular checks with models that are easier to interpret. This is important for multiple reasons. The first reason is that it is important to confirm that the data feeding model isn’t corrupted with the results needed to predict. Let’s say for, example, you are building to a model to predict the winning team of basketball games. If the historical data you are using to train the model contains a field for “point differential of game”, you have a problem. The problem with the field is that you don’t know the value of that field until after the game is played. The field is related to the outcome you are looking to predict and not something you would know before the game. Sometimes that isn’t obvious that a field is related to the outcome until the algorithm picks up the characteristic as being highly predictive. Trust me from experience on this one, if you have data that is contaminated like the point differential example, the algorithm will find that characteristic and find it quickly.

The second reason to look at the characteristics used as you are building the model relates to the whole “causal vs. correlation” game on which data science is based. A field could be highly predictive because it relates to some root cause (causal) or it could simply be a coincidence (correlation). This is where subject matter expertise (SME) is critical to building a model. As the model is being built, a SME needs to look at some of the key characteristics picked by a more transparent algorithm. The SME should be able to tell a story based on those characteristics and it should make sense to them. Some examples would be “I see how customers who contacted customer service more than 3 times in a year and are high dollar customers would be more likely to leave – high pay plus problems equals disgruntled customers leaving – makes sense” or “shipments going over the Rocky Mountains in winter are more likely to have weather delays – mountain roads with snow – check.”

Allow for the Unexplainable

Once the more explainable algorithm has run, the data scientist should then use some of the less explainable algorithms. Here is the next key: if one algorithm has found a certain characteristic is predictive, other algorithms are highly likely to use the same characteristic. Most algorithms use a combination of characteristics but how they are put together and selected can make a difference. Usually it isn’t a large difference, but if an explainable model is 70% accurate and a difficult to explain model is 78% accurate, usually you go with the more accurate, but difficult to explain model. The assumption being that both models share a large similarity, just that the harder to explain model found some additional connections in the data.

And that gets me back to my answer “we looked at them as we built the model, but now we just accept the model.”

The Basic Steps of a Data Science Project

One of the barriers that many people and companies have to doing data science is simply not knowing the basic steps of a data science project. So, let’s talk about the basic steps to do a data science project by talking through an example. For the example I’ll use the model that I use when I teach data science modeling. The basic steps are: Baseline, ratios, outliers, group, predict, and validate.

The Goal/Challenge

The model for this example is an “Equity Picker” model. The goal is to see if by looking at information published in companies’ annual reports, we can predict if a stock will increase its equity value. Notes, details of the calculations and why equity value are for a later discussion. Our goal is to pick a group of stocks that grow in a desired range. We aren’t trying to predict every stock’s exact growth, only to filter out those that we might want to buy.

So, we have a table (think of a big excel spreadsheet) that contains several years of history of stocks. The first column is a simple 1 or 0 to indicate if the stock grew in equity in the range we find acceptable (between 5% and 50%). The rest of the columns are characteristics. This is the information from the annual reports. The characteristics range from ACCOCI (Accumulated Other Comprehensive Income) to TBVPS (Tangible Assets Book Value per Share). There are approximately 100 or so different columns.


Before we begin, we need to get a baseline. What is the percentage of stocks that match our desired equity profile naturally? Simple answer: 26% of the stocks. The sets our baseline for predictions. If I randomly just grab a stock, it has a 26% chance of being in the desired group. I want to do much better than that.


Our first step in predicting is to add in some ratios. I won’t go into details as to why, but let’s just say that frequently ratios are better than raw values for building predictions. From working with the data, and trying out several ratios, the best ratio for this example is per equity. After adding a column for each value divided by equity, we now have slightly less than 200 or so different columns. We don’t put existing ratios (like the TBVPS) into another ratio, so not all columns get a “per equity” column as well.


Our second step is to remove outliers. Almost all data contains some outliers, and outliers are hard to predict. Since I’m trying to predict just “winners”, hard to predict stocks can be ignored. Like a good Omaha citizen, I love me some Warren B. However, dog, your stock is wack. Berkshire Hathaway stock price is about 62 times greater than any other stock. The rest of the ratios for BRK-A are similarly goofy. BRK-A is a great stock, but it makes for bad prediction data. Not only does it not predict well, but because some of its values are so much larger than other stocks, it skews the curve. So, big outliers like BRK-A are removed.


By looking at the data, we can tell that stocks that post a negative income have an only 13% chance of posting an increase in equity the following year. Now, this sub-group may be highly predictable (it isn’t), but it is probably a bad place to look for gold. Again, I’m looking to predict winners and not looking to predict every stock. So simply putting all of the negative net income stocks into a separate table and ignoring them for now works fine. After a few more cuts of data (high asset stocks and stocks with a large non-us data both get pushed to the side) we have a smaller group but with a higher chance of success. This group has a 34% random chance of picking a winner. Already, we are improving our odds.

The thing about companies is that people naturally group them into segments like “technology”, “services”, and “consumer goods”. I’m not saying these aren’t great groupings, I’m just saying that within the “services” grouping, Disney might not share the same characteristics that are predictive as Union Pacific. Put another way, groups were put there for human ease, not for improving our data science predictions (although the human ease groups can sometimes help predictions). Our fourth step then becomes, let machine learning pick the groupings. After some back and forth working and tweaking models, an algorithm is chosen that breaks the remaining stocks up into 4 distinct groups.


The fifth step then becomes to see if any of the groups manually chosen or groups chosen by machine learning are highly predictable. It turns out two of the groups are. One is precise to 68% and the other is to 67%. Building the model has turned a loosing bet (27%) into a winning bet (67%)!


The last and final step before putting my money where my blog post is, is to do a large amount of double checking. I know I’m not perfect and it is important to make sure that each step of the preparation process is accurate. Also, it can be helpful to slice the data in different sections to make sure the model is truly predictive and there isn’t, for example, a one-year surge that overwhelmed the rest of the data. Also, also, while many algorithms create “blind” predictions, it is important to run at least one model that shows which characteristics feature more prominently in the model. The algorithm can find non-causal coincidences and I don’t like to invest in those.


So, our steps again are: Getting the data and creating a baseline; enhance with thoughtful ratios; remove outliers; if needed, parse out sub-groups to do predictions on; predict on sub-groups; confirm predictions.

It is “that easy” to get data science results. Ok, it isn’t really that simple, but now that you know the process, you can begin to take steps to leverage this in your own business.

For the record, the class I teach this in is the Omaha Data Science Academy. Feel free to enroll. Also, when I teach the class and build the model, I use Dataiku. In my opinion, it is the fastest and best way to do data science.

For a more comprehensive look at data science, please see my book Leading a Data Driven Organization.

If you have questions or comments on the article or about the model, please reach me on LinkedIn.

Leading a Data Driven Organization

Cabri Group is proud to announce:

Leading a Data Driven Organization: A Practical Guide to Transforming Yourself and Your Organization to Win the Data Science Revolution (Kindle Edition)

Leading a Data Driven Organization – A Practical Guide to Transforming Yourself and Your Organization to Win the Data Science Revolution — is a book designed for all levels of leaders within an organization. Filled with stories and real-world examples, the book walks through the concepts of data science with an eye for what leaders need to know. From “What Data Science Does” to “Organizational Challenges,” the book makes sure you “Know if the Answers are Right.” With sections titled, “How Data Science is Done” and “Time is Not Your Friend” you get straight talk about business challenges and opportunities. Whether you are faced with your first data science project or you are just thinking about how to use the latest data science tools for your company’s benefit, this book is for you.

The lessons of Leading a Data Driven Organization apply to anyone from a c-suite member to a first-time manager. One of its core premises is that, as far as data science goes, you either “get it” or you “get replaced.” Read this book if you want to be in the former group.

The author, Gordon Summers, is a data science consultant who has helped companies from large Fortune 500 corporations to small not-for-profit companies. He also teaches data science and has helped many leaders understand and implement data science within their organizations. In this book, the author’s casual style helps demystify an emerging technology with no difficult formulas to memorize.

You’ve Probably Been Doing Random Wrong

The random function rand() appears in many many programming languages everything from Microsoft Excel to R to SQL. This isn’t to talk about the quality of random number generators, this is to talk about the distribution of random numbers. (Although if you are curious, here is an very interesting video talking about random number generators in C++ ).

Normally (*) the random numbers are distributed evenly throughout the range. However, for most situations, I would say that the data should follow Benford’s law (wiki link for those curious). If you are replicating “natural” processes through a system or if, like me, you are randomly selecting from a ranked list, you want to follow a distribution like Benford’s law.

I’m a big fan of Benford’s law as a quick and dirty way to check numbers as they come into a project. It makes a practical first check of the spread of frequency of the first digit can tell right away if there data looks good. Many data sets in the wild don’t exactly follow the law, if you look on wikipedia they give good examples of where it won’t apply.

Some programming languages do allow you to change the distribution pattern with a function. There is, however, a quick and easy way to change any basic


function into a better distribution:

(10^rand()) /10

Back to my example where I am pulling from randomly from a ranked list. By using the quick and dirty adjustment to the distribution, I’m able to pull the lower ranked entries to a frequency which is more in line with what their usage will be. This better simulates what will really happen while still choosing at random.


(*) – yeah that is punny

Second Weekend Results Recap

The scoreboard for our Machine Learning lower seed winning project:

10 wins out of 20 (50%)

$605 profit on $2,000 worth of simulated bets (30% ROI)

Our Machine Learning lower seed winning project was looking to predict as accurately as we could a lower seeded team winning in the NCAA tournament . Or goals were to get 47% right on the picks and a mere 10% hope for ROI. We beat both of those goals. We practically doubled the 26% baseline of the historic lower seed winning rate and more than doubled the lower seed winning percent in this tournament (21%). We never expected to predict 100% of the upsets, however, 10 out of the 13 lower seed wins were predicted by us.

Our other goal was to show with a simple demonstration how Machine Learning can drive results for almost any business problem. With that in mind, let’s recap what we did and how we did it.

Our Machine Learning Algorithm, which we call Evolutionary Analysis, looked at a comparison of 207 different measures of college basketball teams and their results in prior tournaments. It selected ranges of those 207 measures that best matched up with historic wins by lower seeded teams. We then confirmed that the range was predictive by testing the selected ranges against a “clean” historic data set. This comparison is how we got our goal percent and ROI.

Then we did what any good business person does, we acted. We published our forecasts before each round was played and our results above speak for themselves.

Machine Learning is the power to find patterns in data where previous analysis has found none. Our methodology assured that what we were seeing was predictive. No doubt luck is involved (Wisconsin vs. Florida or many of the other times it came down to the final minute). The overall success, however, speaks for itself.

That is the formula for success on any Machine Learning project: A data set with a large number of characteristics, a measure of success, the expertise to execute an effective project and the courage to succeed. If this sounds like something that your business could use, please contact Gordon Summers of Cabri Group ( or Nate Watson of CAN ( today.

And for those who are curious. The algorithm indicates that Oregon vs. North Carolina matches the criteria for an upset.

Here are our collected picks (dollar sign indicates correct pick):

East Tennessee St. over Florida
$ Xavier over Maryland
Vermont over Purdue
Florida Gulf Coast over Florida St.
Nevada over Iowa St.
$ Rhode Island over Creighton
$ Wichita St. over Dayton
$ USC over SMU
$ Wisconsin over Villanova
$ Xavier over Florida St.
Rhode Island over Oregon
Middle Tennessee over Butler
Wichita St. over Kentucky
Wisconsin over Florida
$ South Carolina over Baylor
$ Xavier over Arizona
Purdue over Kansas
Butler over North Carolina
$ South Carolina over Florida
$ Oregon over Kansas