Second Weekend Results Recap

The scoreboard for our Machine Learning lower seed winning project:

10 wins out of 20 (50%)

$605 profit on $2,000 worth of simulated bets (30% ROI)

Our Machine Learning lower seed winning project was looking to predict as accurately as we could a lower seeded team winning in the NCAA tournament . Or goals were to get 47% right on the picks and a mere 10% hope for ROI. We beat both of those goals. We practically doubled the 26% baseline of the historic lower seed winning rate and more than doubled the lower seed winning percent in this tournament (21%). We never expected to predict 100% of the upsets, however, 10 out of the 13 lower seed wins were predicted by us.

Our other goal was to show with a simple demonstration how Machine Learning can drive results for almost any business problem. With that in mind, let’s recap what we did and how we did it.

Our Machine Learning Algorithm, which we call Evolutionary Analysis, looked at a comparison of 207 different measures of college basketball teams and their results in prior tournaments. It selected ranges of those 207 measures that best matched up with historic wins by lower seeded teams. We then confirmed that the range was predictive by testing the selected ranges against a “clean” historic data set. This comparison is how we got our goal percent and ROI.

Then we did what any good business person does, we acted. We published our forecasts before each round was played and our results above speak for themselves.

Machine Learning is the power to find patterns in data where previous analysis has found none. Our methodology assured that what we were seeing was predictive. No doubt luck is involved (Wisconsin vs. Florida or many of the other times it came down to the final minute). The overall success, however, speaks for itself.

That is the formula for success on any Machine Learning project: A data set with a large number of characteristics, a measure of success, the expertise to execute an effective project and the courage to succeed. If this sounds like something that your business could use, please contact Gordon Summers of Cabri Group (Gordon.Summers@CabriGroup.com) or Nate Watson of CAN (Nate@CanWorkSmart.com) today.

And for those who are curious. The algorithm indicates that Oregon vs. North Carolina matches the criteria for an upset.

Here are our collected picks (dollar sign indicates correct pick):

East Tennessee St. over Florida
$ Xavier over Maryland
Vermont over Purdue
Florida Gulf Coast over Florida St.
Nevada over Iowa St.
$ Rhode Island over Creighton
$ Wichita St. over Dayton
$ USC over SMU
$ Wisconsin over Villanova
$ Xavier over Florida St.
Rhode Island over Oregon
Middle Tennessee over Butler
Wichita St. over Kentucky
Wisconsin over Florida
$ South Carolina over Baylor
$ Xavier over Arizona
Purdue over Kansas
Butler over North Carolina
$ South Carolina over Florida
$ Oregon over Kansas

First Weekend Success

After the first weekend of basketball, our Machine Learning predicting of the lower seed winning has results.

We had two measures of success: We wanted 47% of the picks we made we wanted a lower seed winning and we wanted to see the $ value of virtually betting $100 on each game. By both measures, we had success: We correctly picked 6 upsets out of the 13 games we chose (46%) and we had a virtual profit of $59 on $1300 or 5% ROI.

Overall there were 10 instances where the lower seed won in the first two rounds. This year is on track for fewer lower seeds winning (22%) than the historic rate (26%). So even with “tough headwinds” we still met our expectations.*

Someone asked me over the weekend about one of the upsets, “how come you didn’t have Middle Tennessee?” The answer is simple, it didn’t fit the criteria we had. Games that matched our criteria are the largest historic collection of the lower seed winning. Lower seeds may have different criteria and still have a possibility of winning, our criteria is simply the most predictive of a lower seed winning.

Besides some really, really close calls, we had several games where we had that the lower seed having a good chance of winning and they simply lost. Our play was to choose games that match the criteria and spread the risk over several probable winners. This wasn’t about picking the only upsets or all of the upsets, this was about picking a set of games that had as the highest probability of the lower seed winning. And by our measures of success, we achieved our goal.

The Machine Learning algorithm did as we expected: It identified a set of characteristics from historic data that was predictive of future results. The implications for any business is clear: if you have historic data and you leverage this type of expertise, you can predict the future.

For the next round, we have 5 games that match our criteria:
Wisconsin over Florida
South Carolina over Baylor
Xavier over Arizona
Purdue over Kansas
Butler over North Carolina

If any games match our predictive criteria in the next round, we’ll post them Saturday before tip off.

If you want to see how this can relate to your business contact Gordon Summers of Cabri Group (Gordon.Summers@CabriGroup.com) or Nate Watson of CAN (nate@canworksmart.com).
* Don’t even start with the 1% difference between actual of 46% and the target of 47%

Round 2 Predictions

Our target overall is to pick 47% upsets. So far we are at 50%.

Our Round 2 picks are:

Wisconsin over Villanova
Xavier over Florida St.
Rhode Island over Oregon
Middle Tennessee over Butler
Wichita St. over Kentucky

We’ll do a review on Monday 3/20 of the first and second round.

 

First Round Addition

Since the USC / Providence play in game was after we made our earlier predictions, it wasn’t included in our first round games (except with an asterisk).

The model predicts, however, that USC v. SMU fits the profile of an upset, so we are adding that to our list of predictions.

First Round Predictions

Author: Gordon Summers

The Cabri Group / CAN Machine Learning Lower Seed Win Prediction tool has made its first round forecast. Without further ado:

East Tennessee St. (13)  over Florida (4)
Xavier (11) over Maryland (6)
Vermont (13) over Purdue (4)
Florida Gulf Coast (14) over Florida St. (3)
Nevada (12) over Iowa St. (5)
Rhode Island (11) over Creighton (6)
Wichita St. (10) over Dayton (7)

* If the last play in games add another predicted upset, we’ll update that prior to the game starting.

One of the obvious observations on the predictions is: “Wait, no 8/9 upsets????” Remember these games show the most similar characteristics of the largest historic collection of upsets. This doesn’t mean that there will be no upsets as 8/9 nor that all of the predictions above will hit (remember we are going for 47% upsets) nor that all games not listed will have the favorites win. The games on the list are there because they share the most characteristics with historic times when the lower seed won.

Also, one of the key team members on this project, Matt, is a big Creighton fan (and grad). He was not happy to see Creighton on the list. I’ll speak to that one specifically. In the technical notes, I indicated that one of the many criteria that is being used is was Defensive Efficiency (DE). Machine Learning algorithm (Evolutionary Analysis) doesn’t like it when the lower seed has a large gap of DE between the lower seed and the higher seed. Creighton actually has a lower Defensive Efficiency than Rhode Island. Sorry Matt. Again, it doesn’t mean Creighton won’t win, it only means that the Rhode Island v. Creighton game shares more criteria with a the largest collection of historic upsets than the other games in the tournament.

As we indicated, we will use the odds as well as a count of upsets to determine how well we do as the tournament goes on. We’ll have a new set of predictions on Saturday for the next round of the tournament and a recap coming on Monday.

Machine Learning and the NCAA Men’s Basketball Tournament – Methodology

“The past may not be the best predictor of the future, but it is really the only tool we have”

 

Before we delve into the “how” of the methodology, it is important to understand “what” we were going for: A set of characteristics that would indicate that a lower seed would win. We use machine learning to look through a large collection of characteristics and it finds a result set of characteristics that maximizes the number of lower seed wins while simultaneously minimizing lower seed losses. We then apply the result set as a filter to new games. The new games that make it through the filter are predicted as more likely to have the lower seed win. What we have achieved is a set of criteria that are most predictive of a lower seed winning.

 

This result set is fundamentally different than an approach trying to determine the results of all new games whereby an attempt is made to find result set that would apply to all new games. There is a level of complexity and ambiguity with a universal model which is another discussion entirely. By focusing in on one result set (lower seed win) we can get a result that is more predictive than attempting to predict all games.

 

This type of predictive result set has great applications in business. What is the combination of characteristics that best predict a repeat customer? What is the combination of characteristics that best predict a more profitable customer? What is the combination of characteristics that best predict an on time delivery? This is different from just trying to forecast a demand by using a demand signal combined with additional data to help forecast. Think of it as the difference between a stock picker that picks stocks most likely to rise vs. forecasting how far up or down a specific stock will go. The former is key for choosing stocks the later for rating stocks you already own.

 

One of the reasons we chose “lower seed wins” is that there is an opportunity in almost all games played in the NCAA tournament for there to be a data point. There are several games where identical seeds play. Most notably, the first four games do involve identical seeds and the final four can possibly have identical seeds. However, that still gives us roughly 60 or so games a year. The more data we have, the better predictions we get.

 

The second needed item is more characteristics. For our lower seed win we had >200 different characteristics for years 2012-2015. We used the difference between the characteristics of the two teams as the selection. We could have used the absolute characteristics for both teams as well. As the analysis is executed, if a characteristic is un-needed it is ignored. What the ML creates is a combination of characteristics. We call our tool, “Evolutionary Analysis”. It works by adjusting the combinations in an ever improving manner to get result. There is a little more in the logic that allows for other aspects of optimization, but the core of Evolutionary Analysis is finding a result set.

 

The result set was then used as a filter on 2016 to confirm that the result is predictive. It is possible that the result set from 2012-2015 doesn’t actually predict 2016 results. Our current result set as a filter on 2016 data had 47% underdog wins vs. the overall population. The historic average is 26% lower seed wins and randomly, the 47% underdog win result could happen about 3.4% of the time. Our current result is therefore highly probable as a predictive filter.

 

The last step in the process is to look at those filter criteria that have been chosen and to check to see if they are believable. For example, one of the criteria that was Defensive Efficiency Rank. Evolutionary Analysis chose a lower limit of … well it set a lower limit, let’s just say that. This makes sense, if a lower seed has a defense that is ranked so far inferior to the higher seed, it is unlikely to prevail. A counter example is that the number of blocks per game was not a criteria that was chosen. In fact, most of the >200 criteria were not used, but that handful of around ten criteria set the filter that chooses a population of games that is more likely to contain a lower seed winning.

 

And that is one of the powerful aspects of this type of analysis, you don’t get the one key driver, or even two metrics that have a correlation. You get a whole set of filters that points to a collection of results that deviates from the “normal”.

 

Please join us as we test our result set this year. We’ll see if we get around 47%. Should be interesting!

 

If you have questions on this type of analysis or machine learning in general, please don’t hesitate to contact Gordon Summers of Cabri Group (Gordon.Summers@CabriGroup.com) or Nate Watson at CAN (nate@canworksmart.com).

 

Disclaimer: Any handicapping sports odds information contained herein is for entertainment purposes only. Neither CAN nor Cabri Group condone using this information to contravene any law or statute; it’s up to you to determine whether gambling is legal in your jurisdiction. This information is not associated with nor is it endorsed by any professional or collegiate league, association or team. Machine Learning can be done by anyone, but is done best with professional guidance.

Machine Learning and the NCAA Men’s Basketball Tournament

Machine Learning (ML) is a powerful technology and many companies rightly guess that they need to begin to leverage ML. Because there are so few successful ML people and projects to learn from, there is a gap between desire and direction. Cabri Group and CAN have teamed up to help. By demonstrating results, we believe more people can give direction to their ML projects. Therefore, we proudly present our ML NCAA lower seed predictions.

Those interested in a fuller description of our analysis methodology can read our accompanying article.

We will be publishing a selection of games in the 2017 NCAA Men’s Basketball Tournament. Our prediction tool estimates games where the lower seed has a better than average chance of winning against the higher seed. We will predict about 16 games from various rounds in the tournament. The historical baseline for lower seeds winning is 26%. Our target will be to get 47% right. Our target is based on the results we would have achieved using our prediction tool for the 2016 tournament. The simulated gambling ROI was 10%.

This analysis isn’t to support gambling, but we will be keeping score with virtual dollars as if we were. We will be “betting” on the lower seed to win. We aren’t taking into consideration the odds in our decisions, only using them to help score our results.

We will be publishing our first games on Wednesday 15th after the first four games are played. We won’t have any selections for the first four games as they are played by teams with identical seeds. Prior to each round, we will publish all games that our tool thinks have the best chance of the lower seed winning. We’ll also publish weekly re-caps with comments on how well our predictions are doing.

The technique that finds a group of winners (or losers) in NCAA data and can be used on any metric. We hope this demonstration opens up people’s minds onto the possibilities to leverage Machine Learning for their businesses. If you would like more on this type of analysis please contact Gordon Summers of Cabri Group (Gordon.Summers@CabriGroup.com) or Nate Watson at CAN (nate@canworksmart.com).
Disclaimer: Any handicapping sports odds information contained herein is for entertainment purposes only. Neither CAN nor Cabri Group condone using this information to contravene any law or statute; it’s up to you to determine whether gambling is legal in your jurisdiction. This information is not associated with nor is it endorsed by any professional or collegiate league, association or team. Machine Learning can be done by anyone, but is done best with professional guidance.