The Basic Steps of a Data Science Project

One of the barriers that many people and companies have to doing data science is simply not knowing the basic steps of a data science project. So, let’s talk about the basic steps to do a data science project by talking through an example. For the example I’ll use the model that I use when I teach data science modeling. The basic steps are: Baseline, ratios, outliers, group, predict, and validate.

The Goal/Challenge

The model for this example is an “Equity Picker” model. The goal is to see if by looking at information published in companies’ annual reports, we can predict if a stock will increase its equity value. Notes, details of the calculations and why equity value are for a later discussion. Our goal is to pick a group of stocks that grow in a desired range. We aren’t trying to predict every stock’s exact growth, only to filter out those that we might want to buy.

So, we have a table (think of a big excel spreadsheet) that contains several years of history of stocks. The first column is a simple 1 or 0 to indicate if the stock grew in equity in the range we find acceptable (between 5% and 50%). The rest of the columns are characteristics. This is the information from the annual reports. The characteristics range from ACCOCI (Accumulated Other Comprehensive Income) to TBVPS (Tangible Assets Book Value per Share). There are approximately 100 or so different columns.


Before we begin, we need to get a baseline. What is the percentage of stocks that match our desired equity profile naturally? Simple answer: 26% of the stocks. The sets our baseline for predictions. If I randomly just grab a stock, it has a 26% chance of being in the desired group. I want to do much better than that.


Our first step in predicting is to add in some ratios. I won’t go into details as to why, but let’s just say that frequently ratios are better than raw values for building predictions. From working with the data, and trying out several ratios, the best ratio for this example is per equity. After adding a column for each value divided by equity, we now have slightly less than 200 or so different columns. We don’t put existing ratios (like the TBVPS) into another ratio, so not all columns get a “per equity” column as well.


Our second step is to remove outliers. Almost all data contains some outliers, and outliers are hard to predict. Since I’m trying to predict just “winners”, hard to predict stocks can be ignored. Like a good Omaha citizen, I love me some Warren B. However, dog, your stock is wack. Berkshire Hathaway stock price is about 62 times greater than any other stock. The rest of the ratios for BRK-A are similarly goofy. BRK-A is a great stock, but it makes for bad prediction data. Not only does it not predict well, but because some of its values are so much larger than other stocks, it skews the curve. So, big outliers like BRK-A are removed.


By looking at the data, we can tell that stocks that post a negative income have an only 13% chance of posting an increase in equity the following year. Now, this sub-group may be highly predictable (it isn’t), but it is probably a bad place to look for gold. Again, I’m looking to predict winners and not looking to predict every stock. So simply putting all of the negative net income stocks into a separate table and ignoring them for now works fine. After a few more cuts of data (high asset stocks and stocks with a large non-us data both get pushed to the side) we have a smaller group but with a higher chance of success. This group has a 34% random chance of picking a winner. Already, we are improving our odds.

The thing about companies is that people naturally group them into segments like “technology”, “services”, and “consumer goods”. I’m not saying these aren’t great groupings, I’m just saying that within the “services” grouping, Disney might not share the same characteristics that are predictive as Union Pacific. Put another way, groups were put there for human ease, not for improving our data science predictions (although the human ease groups can sometimes help predictions). Our fourth step then becomes, let machine learning pick the groupings. After some back and forth working and tweaking models, an algorithm is chosen that breaks the remaining stocks up into 4 distinct groups.


The fifth step then becomes to see if any of the groups manually chosen or groups chosen by machine learning are highly predictable. It turns out two of the groups are. One is precise to 68% and the other is to 67%. Building the model has turned a loosing bet (27%) into a winning bet (67%)!


The last and final step before putting my money where my blog post is, is to do a large amount of double checking. I know I’m not perfect and it is important to make sure that each step of the preparation process is accurate. Also, it can be helpful to slice the data in different sections to make sure the model is truly predictive and there isn’t, for example, a one-year surge that overwhelmed the rest of the data. Also, also, while many algorithms create “blind” predictions, it is important to run at least one model that shows which characteristics feature more prominently in the model. The algorithm can find non-causal coincidences and I don’t like to invest in those.


So, our steps again are: Getting the data and creating a baseline; enhance with thoughtful ratios; remove outliers; if needed, parse out sub-groups to do predictions on; predict on sub-groups; confirm predictions.

It is “that easy” to get data science results. Ok, it isn’t really that simple, but now that you know the process, you can begin to take steps to leverage this in your own business.

For the record, the class I teach this in is the Omaha Data Science Academy. Feel free to enroll. Also, when I teach the class and build the model, I use Dataiku. In my opinion, it is the fastest and best way to do data science.

For a more comprehensive look at data science, please see my book Leading a Data Driven Organization.

If you have questions or comments on the article or about the model, please reach me on LinkedIn.