Machine Learning and IOT

A few days back a friend approached me on a project that he was consulting on. It involved looking at a continuous distillation device and incorporating machine learning models to automatically control the device to achieve the highest yield possible. This is an interesting problem as it is different than the classic “supervised” or “unsupervised” models can easily solve. Below are my notes I sent to him outlining a direction for a solution. This solution works broadly in many situations that use IOT (“internet of things”) devices. I’ve found my explaining “how” I setup problems can be helpful for those who are just beginning their journey to leverage machine learning. I hope this helps others besides my friend…

As someone who has done a few data science with IOT devices (my last one was a sensor on a display to look at when an item was pulled out of an endcap display), you need to think that there are three ways use the IOT data: programmatically, formulaic model or with a machine learning model.

Option 1: Programatically

Programmatically is for when you know the threshold and you want to maintain it. For example, if you want the liquid temperature set to 173 you can have a temperature sensor and then turn the heating coil on or off depending on if the temperature is above or below the goal. You know the goal measurement and you have direct control of the device(s) that set that measurement. You setup essentially a series of “IF” statements that react to known conditions with known responses.

Option 2: Formulaic

You want a model when you are dealing with characteristics that you can’t directly control. For example, say you are looking at the temperature at the top of the column head. Now there isn’t a direct device that controls it, the temperature is a result of what is happening below it. You may have a few items that influence it: chiller temp, chiller pump speed, and pot temperature. With 3 items, you would probably want to do about 9 runs where you kept 2 of the 3 variables constant and then messed with the third. Since these items probably have a nearly linear relationship, you should be able to figure out something like “for each 1 degree of pot temperature increase, the column head changes 0.75 of a degree. These are simple formulaic models usually only done in the range that matters (i.e. that change of pot temp to column head temp is only true between 160 and 180 degrees – you don’t care if the pot is at 100 degrees.)

Option 3: Machine Learning / Data Science

Where you would want to apply machine learning is when we are measuring the outcome (i.e. Pct. Alcohol) and we want to know what things that are controllable (temp, pump speed, etc.) work best in combination with the uncontrollable (outside temp, air pressure, etc.). The key here is since we are not trying to infer some simple rules, like in a simple algorithmic model.

We need to first gather “what controllable settings in history, combined with the uncontrollable at the time, had what output.” We then build a model that predicts what the pct. alcohol is based on a given combination of both controllable and uncontrollable inputs. This model will be able to predict never before seen results based on the historic result and characteristics. This is the first step – a model that if given a set of our environmental measurements can predict the pct. alcohol.

Reinforced Learning / Optimization

The challenge here is that we are trying to control multiple controllable variables to account for the multiple uncontrollable variables. Most “supervised” models are only capable of predicting against one variable. What is needed is a second model, one that will query the first model with options. This is a variation on “reinforced learning” where the first model is temporary frozen while the second model performs an optimization on the answers given by the first model. Optimization algorithms are often the forgotten sibling in the “supervised” vs. “unsupervised” model discussion.

Basically, you have a second model running an optimization algorithm that is feeding variations on the controllable variables with the to the first model (while holding the uncontrollable variables constant). The first model predicts the resulting pct. alcohol. The second model loops through combinations until an optimal result is returned. The results of the best combination of controllable values are then sent to the appropriate controller.

You also want to be capturing ongoing data as even as the system tries towards the optimal settings as chosen by the second model. You want to be periodically retraining the first model to incorporate all additional data.

The other nice thing is that you want to build controllers for solution 1 (programmatically) so that you can then set a few guidelines – solution 2 (formulaic) which can then capture enough data to feed solution 3 (2 models – prediction interrogated by optimization). The optimization model will then feed the results back to solution 1 as a “target” for the programmatical control of the devices.

I hope this explanation helps others facing a similar challenge. If you have questions on anything here, don’t hesitate to reach out to me.

ETL Basics

We often think of the foundational skills of data science being Data Manipulation and Management (SQL), Graphing/Charting, Modeling, and Development (R/Python). But as the toolset changes, so do the skills needed. With the emergence of data science platforms like Dataiku, the development skillset has become less important. Similarly, these tools have nearly removed the need for a separate data engineering stage of the project. As a result, the process of building data models that follow ETL (Extract-Transfer-Load) principles becomes a more important skill set for the data scientist.

I’ll have a more detailed post coming shortly on specific guidelines of ETL, but for now, let’s focus on what the overall picture of an ETL should look like.

Master Data and Transactional Data

First, we need to talk about two kinds of tables: Master Data and Transactional. Usually, master data is the data describing the objects the database cares about. Frequently this data is not necessarily version controlled but is often “as-is”. An example of this would be a “customer” and within the customer master data could be the “sales plan” the customer is on. If you go into the transactional system and look up the customer, you might see a drop-down list of the possible options. In our imaginary system, we’ll say that the customer can be on the “free tier”, “silver” or “gold” sales plans.

The second type of table, the transactional, is generally the transactions between the master data elements. For example, if you have a customer, you might have a table that is the list of all of their orders. Usually in a normalized database, the customer listed on the order table is just the customer number or some other unique identifier. There is also probably a subsequent table that joins the orders with the products sold. “Products” would be another master data table.

Basic Recommendations for an ETL

ETL’s frequently are then de-normalizing the database from multiple tables into one large table. This process is similar to the steps used to build reports and data warehouses. If you have access to someone who has built (good) reporting systems, look to tap their expertise. Otherwise here are my “basics”:

  • Don’t do everything in Python/R – you are doing data manipulation use a database tool
  • Run the extract in the memory of the source database
  • Limit the transaction extraction by a variable set to be only those greater than the start date/time of the last successful run
  • Transfer to destination database prior to joins
  • Execute joins/transformations in destination database
  • Build any aggregations needed to speed performance
  • Build appropriate checks in the process. Every ETL will eventually fail. By setting up the last successful run variable, the ETL should be setup so that it will automatically catch up. A full re-build may be necessary and that is just setting the date/time of the start of the last good run to the date/time of the first transaction.
  • Build the joins tables assuming you will re-use them. Aggregations and transformations may vary as time goes by so do that in a separate step.
  • Unless you are working with very large data sets, use as many intermediate tables on your destination database as you need. You’ll get better performance results and the extra storage space is no big deal.

The Challenge of Turning a Master Data table into a Transactional Table

One last bit of advice to solve the problem of if your master data does not contain its change history and you need that for analysis. Besides taking a baseball bat up to the developers and demanding they re-design the transactional system to benefit you, you’ll need to build a compare function that compares a snapshot to the current master data table. Then capture the deltas between the old table and the current table and append those to a running table of transactions. Use the date that you execute the load as the date inside the transaction table. There is one weakness to this approach which is not one than can be conveniently worked around. If you skip over any days of processing, then those days of data are “lost”. I’m sorry, but they just are lost. There is no way to re-create history because all you have is a picture of it before and after. Without a set of data in between you are simply at a loss.

As I mentioned, the process for building ETLs has been around for quite some time. If you come from a background of building them into data science today, you have a leg up. Otherwise, you’ll want to incorporate good ETL practices into your process of building models.

What Factors went into the Model?

What Factors?

One of the questions I often get asked about a data science prediction model I’ve built is version of “what are the key characteristics of this model?” Given that many models (i.e. Neural Network) don’t show the characteristics and some that do (i.e. Random Forest) can be hard to follow, I’m often placed in an awkward situation of saying.

“We looked at them as we built the model, but now we just accept the model.”

This is, understandably, to someone not familiar with data science, often thought of as an inadequate answer. However, let me show you how the answer I give is both complete and correct. And to do that I’ll start with the results.

The Results of the Model

Another point of confusion about data science is “what is the result of the model?” The short answer is a table. The slightly longer answer is a table which has each record that needed to be predicted, the characteristics used by the model, and a prediction. That table can have one row (a single prediction) or multiple rows (batch of predictions). Whatever process could have been done to the record can now be done to the record and the prediction (send an e-mail to those customers, direct the predicted ‘yes’ results to a promotional web page, apply a discount, etc.) Also, the results can be directed to graphs / charts.

For many applications, charts should be built to allow for the “human pattern recognition” to see high level information about the predictions. The key factor in these charts should be to look at the predictions in comparison to the overall population. The reason for that is because of one of the truism of data science predictions:

The point of most data science predictions is to try to make the prediction wrong.

You predict what customers are likely to churn in order to prevent those customers from churning. You predict what cases are likely to not be delivered on time in order to somehow get those cases there on time. You predict what customers would also like but haven’t yet bought so that they will now buy that product. You predict so that you can take steps to “beat the house.”

So, how do you beat the house? By making smart and creative guesses based on patterns you see in the charts based on the differences between the predicted and non-predicted population. And here is the big insight on how to beat the prediction:

It doesn’t matter if the machine learning used the characteristic in building the model, if you see the pattern and think there is a way to use it to beat the house, you should use it. To beat the model, you look at the results of the model, not how the model was built.

As It is Made

As the model is being built, however, there should be frequent and regular checks with models that are easier to interpret. This is important for multiple reasons. The first reason is that it is important to confirm that the data feeding model isn’t corrupted with the results needed to predict. Let’s say for, example, you are building to a model to predict the winning team of basketball games. If the historical data you are using to train the model contains a field for “point differential of game”, you have a problem. The problem with the field is that you don’t know the value of that field until after the game is played. The field is related to the outcome you are looking to predict and not something you would know before the game. Sometimes that isn’t obvious that a field is related to the outcome until the algorithm picks up the characteristic as being highly predictive. Trust me from experience on this one, if you have data that is contaminated like the point differential example, the algorithm will find that characteristic and find it quickly.

The second reason to look at the characteristics used as you are building the model relates to the whole “causal vs. correlation” game on which data science is based. A field could be highly predictive because it relates to some root cause (causal) or it could simply be a coincidence (correlation). This is where subject matter expertise (SME) is critical to building a model. As the model is being built, a SME needs to look at some of the key characteristics picked by a more transparent algorithm. The SME should be able to tell a story based on those characteristics and it should make sense to them. Some examples would be “I see how customers who contacted customer service more than 3 times in a year and are high dollar customers would be more likely to leave – high pay plus problems equals disgruntled customers leaving – makes sense” or “shipments going over the Rocky Mountains in winter are more likely to have weather delays – mountain roads with snow – check.”

Allow for the Unexplainable

Once the more explainable algorithm has run, the data scientist should then use some of the less explainable algorithms. Here is the next key: if one algorithm has found a certain characteristic is predictive, other algorithms are highly likely to use the same characteristic. Most algorithms use a combination of characteristics but how they are put together and selected can make a difference. Usually it isn’t a large difference, but if an explainable model is 70% accurate and a difficult to explain model is 78% accurate, usually you go with the more accurate, but difficult to explain model. The assumption being that both models share a large similarity, just that the harder to explain model found some additional connections in the data.

And that gets me back to my answer “we looked at them as we built the model, but now we just accept the model.”

The Basic Steps of a Data Science Project

One of the barriers that many people and companies have to doing data science is simply not knowing the basic steps of a data science project. So, let’s talk about the basic steps to do a data science project by talking through an example. For the example I’ll use the model that I use when I teach data science modeling. The basic steps are: Baseline, ratios, outliers, group, predict, and validate.

The Goal/Challenge

The model for this example is an “Equity Picker” model. The goal is to see if by looking at information published in companies’ annual reports, we can predict if a stock will increase its equity value. Notes, details of the calculations and why equity value are for a later discussion. Our goal is to pick a group of stocks that grow in a desired range. We aren’t trying to predict every stock’s exact growth, only to filter out those that we might want to buy.

So, we have a table (think of a big excel spreadsheet) that contains several years of history of stocks. The first column is a simple 1 or 0 to indicate if the stock grew in equity in the range we find acceptable (between 5% and 50%). The rest of the columns are characteristics. This is the information from the annual reports. The characteristics range from ACCOCI (Accumulated Other Comprehensive Income) to TBVPS (Tangible Assets Book Value per Share). There are approximately 100 or so different columns.


Before we begin, we need to get a baseline. What is the percentage of stocks that match our desired equity profile naturally? Simple answer: 26% of the stocks. The sets our baseline for predictions. If I randomly just grab a stock, it has a 26% chance of being in the desired group. I want to do much better than that.


Our first step in predicting is to add in some ratios. I won’t go into details as to why, but let’s just say that frequently ratios are better than raw values for building predictions. From working with the data, and trying out several ratios, the best ratio for this example is per equity. After adding a column for each value divided by equity, we now have slightly less than 200 or so different columns. We don’t put existing ratios (like the TBVPS) into another ratio, so not all columns get a “per equity” column as well.


Our second step is to remove outliers. Almost all data contains some outliers, and outliers are hard to predict. Since I’m trying to predict just “winners”, hard to predict stocks can be ignored. Like a good Omaha citizen, I love me some Warren B. However, dog, your stock is wack. Berkshire Hathaway stock price is about 62 times greater than any other stock. The rest of the ratios for BRK-A are similarly goofy. BRK-A is a great stock, but it makes for bad prediction data. Not only does it not predict well, but because some of its values are so much larger than other stocks, it skews the curve. So, big outliers like BRK-A are removed.


By looking at the data, we can tell that stocks that post a negative income have an only 13% chance of posting an increase in equity the following year. Now, this sub-group may be highly predictable (it isn’t), but it is probably a bad place to look for gold. Again, I’m looking to predict winners and not looking to predict every stock. So simply putting all of the negative net income stocks into a separate table and ignoring them for now works fine. After a few more cuts of data (high asset stocks and stocks with a large non-us data both get pushed to the side) we have a smaller group but with a higher chance of success. This group has a 34% random chance of picking a winner. Already, we are improving our odds.

The thing about companies is that people naturally group them into segments like “technology”, “services”, and “consumer goods”. I’m not saying these aren’t great groupings, I’m just saying that within the “services” grouping, Disney might not share the same characteristics that are predictive as Union Pacific. Put another way, groups were put there for human ease, not for improving our data science predictions (although the human ease groups can sometimes help predictions). Our fourth step then becomes, let machine learning pick the groupings. After some back and forth working and tweaking models, an algorithm is chosen that breaks the remaining stocks up into 4 distinct groups.


The fifth step then becomes to see if any of the groups manually chosen or groups chosen by machine learning are highly predictable. It turns out two of the groups are. One is precise to 68% and the other is to 67%. Building the model has turned a loosing bet (27%) into a winning bet (67%)!


The last and final step before putting my money where my blog post is, is to do a large amount of double checking. I know I’m not perfect and it is important to make sure that each step of the preparation process is accurate. Also, it can be helpful to slice the data in different sections to make sure the model is truly predictive and there isn’t, for example, a one-year surge that overwhelmed the rest of the data. Also, also, while many algorithms create “blind” predictions, it is important to run at least one model that shows which characteristics feature more prominently in the model. The algorithm can find non-causal coincidences and I don’t like to invest in those.


So, our steps again are: Getting the data and creating a baseline; enhance with thoughtful ratios; remove outliers; if needed, parse out sub-groups to do predictions on; predict on sub-groups; confirm predictions.

It is “that easy” to get data science results. Ok, it isn’t really that simple, but now that you know the process, you can begin to take steps to leverage this in your own business.

For the record, the class I teach this in is the Omaha Data Science Academy. Feel free to enroll. Also, when I teach the class and build the model, I use Dataiku. In my opinion, it is the fastest and best way to do data science.

For a more comprehensive look at data science, please see my book Leading a Data Driven Organization.

If you have questions or comments on the article or about the model, please reach me on LinkedIn.

Leading a Data Driven Organization

Cabri Group is proud to announce:

Leading a Data Driven Organization: A Practical Guide to Transforming Yourself and Your Organization to Win the Data Science Revolution (Kindle Edition)

Leading a Data Driven Organization – A Practical Guide to Transforming Yourself and Your Organization to Win the Data Science Revolution — is a book designed for all levels of leaders within an organization. Filled with stories and real-world examples, the book walks through the concepts of data science with an eye for what leaders need to know. From “What Data Science Does” to “Organizational Challenges,” the book makes sure you “Know if the Answers are Right.” With sections titled, “How Data Science is Done” and “Time is Not Your Friend” you get straight talk about business challenges and opportunities. Whether you are faced with your first data science project or you are just thinking about how to use the latest data science tools for your company’s benefit, this book is for you.

The lessons of Leading a Data Driven Organization apply to anyone from a c-suite member to a first-time manager. One of its core premises is that, as far as data science goes, you either “get it” or you “get replaced.” Read this book if you want to be in the former group.

The author, Gordon Summers, is a data science consultant who has helped companies from large Fortune 500 corporations to small not-for-profit companies. He also teaches data science and has helped many leaders understand and implement data science within their organizations. In this book, the author’s casual style helps demystify an emerging technology with no difficult formulas to memorize.

You’ve Probably Been Doing Random Wrong

The random function rand() appears in many many programming languages everything from Microsoft Excel to R to SQL. This isn’t to talk about the quality of random number generators, this is to talk about the distribution of random numbers. (Although if you are curious, here is an very interesting video talking about random number generators in C++ ).

Normally (*) the random numbers are distributed evenly throughout the range. However, for most situations, I would say that the data should follow Benford’s law (wiki link for those curious). If you are replicating “natural” processes through a system or if, like me, you are randomly selecting from a ranked list, you want to follow a distribution like Benford’s law.

I’m a big fan of Benford’s law as a quick and dirty way to check numbers as they come into a project. It makes a practical first check of the spread of frequency of the first digit can tell right away if there data looks good. Many data sets in the wild don’t exactly follow the law, if you look on wikipedia they give good examples of where it won’t apply.

Some programming languages do allow you to change the distribution pattern with a function. There is, however, a quick and easy way to change any basic


function into a better distribution:

(10^rand()) /10

Back to my example where I am pulling from randomly from a ranked list. By using the quick and dirty adjustment to the distribution, I’m able to pull the lower ranked entries to a frequency which is more in line with what their usage will be. This better simulates what will really happen while still choosing at random.


(*) – yeah that is punny

Second Weekend Results Recap

The scoreboard for our Machine Learning lower seed winning project:

10 wins out of 20 (50%)

$605 profit on $2,000 worth of simulated bets (30% ROI)

Our Machine Learning lower seed winning project was looking to predict as accurately as we could a lower seeded team winning in the NCAA tournament . Or goals were to get 47% right on the picks and a mere 10% hope for ROI. We beat both of those goals. We practically doubled the 26% baseline of the historic lower seed winning rate and more than doubled the lower seed winning percent in this tournament (21%). We never expected to predict 100% of the upsets, however, 10 out of the 13 lower seed wins were predicted by us.

Our other goal was to show with a simple demonstration how Machine Learning can drive results for almost any business problem. With that in mind, let’s recap what we did and how we did it.

Our Machine Learning Algorithm, which we call Evolutionary Analysis, looked at a comparison of 207 different measures of college basketball teams and their results in prior tournaments. It selected ranges of those 207 measures that best matched up with historic wins by lower seeded teams. We then confirmed that the range was predictive by testing the selected ranges against a “clean” historic data set. This comparison is how we got our goal percent and ROI.

Then we did what any good business person does, we acted. We published our forecasts before each round was played and our results above speak for themselves.

Machine Learning is the power to find patterns in data where previous analysis has found none. Our methodology assured that what we were seeing was predictive. No doubt luck is involved (Wisconsin vs. Florida or many of the other times it came down to the final minute). The overall success, however, speaks for itself.

That is the formula for success on any Machine Learning project: A data set with a large number of characteristics, a measure of success, the expertise to execute an effective project and the courage to succeed. If this sounds like something that your business could use, please contact Gordon Summers of Cabri Group ( or Nate Watson of CAN ( today.

And for those who are curious. The algorithm indicates that Oregon vs. North Carolina matches the criteria for an upset.

Here are our collected picks (dollar sign indicates correct pick):

East Tennessee St. over Florida
$ Xavier over Maryland
Vermont over Purdue
Florida Gulf Coast over Florida St.
Nevada over Iowa St.
$ Rhode Island over Creighton
$ Wichita St. over Dayton
$ USC over SMU
$ Wisconsin over Villanova
$ Xavier over Florida St.
Rhode Island over Oregon
Middle Tennessee over Butler
Wichita St. over Kentucky
Wisconsin over Florida
$ South Carolina over Baylor
$ Xavier over Arizona
Purdue over Kansas
Butler over North Carolina
$ South Carolina over Florida
$ Oregon over Kansas

First Weekend Success

After the first weekend of basketball, our Machine Learning predicting of the lower seed winning has results.

We had two measures of success: We wanted 47% of the picks we made we wanted a lower seed winning and we wanted to see the $ value of virtually betting $100 on each game. By both measures, we had success: We correctly picked 6 upsets out of the 13 games we chose (46%) and we had a virtual profit of $59 on $1300 or 5% ROI.

Overall there were 10 instances where the lower seed won in the first two rounds. This year is on track for fewer lower seeds winning (22%) than the historic rate (26%). So even with “tough headwinds” we still met our expectations.*

Someone asked me over the weekend about one of the upsets, “how come you didn’t have Middle Tennessee?” The answer is simple, it didn’t fit the criteria we had. Games that matched our criteria are the largest historic collection of the lower seed winning. Lower seeds may have different criteria and still have a possibility of winning, our criteria is simply the most predictive of a lower seed winning.

Besides some really, really close calls, we had several games where we had that the lower seed having a good chance of winning and they simply lost. Our play was to choose games that match the criteria and spread the risk over several probable winners. This wasn’t about picking the only upsets or all of the upsets, this was about picking a set of games that had as the highest probability of the lower seed winning. And by our measures of success, we achieved our goal.

The Machine Learning algorithm did as we expected: It identified a set of characteristics from historic data that was predictive of future results. The implications for any business is clear: if you have historic data and you leverage this type of expertise, you can predict the future.

For the next round, we have 5 games that match our criteria:
Wisconsin over Florida
South Carolina over Baylor
Xavier over Arizona
Purdue over Kansas
Butler over North Carolina

If any games match our predictive criteria in the next round, we’ll post them Saturday before tip off.

If you want to see how this can relate to your business contact Gordon Summers of Cabri Group ( or Nate Watson of CAN (
* Don’t even start with the 1% difference between actual of 46% and the target of 47%

Round 2 Predictions

Our target overall is to pick 47% upsets. So far we are at 50%.

Our Round 2 picks are:

Wisconsin over Villanova
Xavier over Florida St.
Rhode Island over Oregon
Middle Tennessee over Butler
Wichita St. over Kentucky

We’ll do a review on Monday 3/20 of the first and second round.