If you are just hopping into utilizing data science, I’m sorry to report but you are now officially another wave late. The cowboy era is over. Sometimes being a wave late is a good thing, sometimes it is not. In this case, it is a little good and a little bad. Or like most things data science, “it depends.”
The Cowboy Era
First, let me describe the era of the cowboy data scientist. This era featured bold thinkers who often came from a different background but who had drive and curiosity to solve new problems in new ways. This person wasn’t necessarily an easy fit within a corporate structure. For example, I worked with a senior data scientist who, despite being paid by the hour, never recorded his time. He fed his natural disinclination to the paper processing to such an extent that he left thousands of dollars unbilled.
Now, I’m not saying that the cowboy data scientists are or were wrong. I would, in fact, suggest that part of the success of many data science projects over the past several years have been due to larger companies embracing their inner cowboy and, if I can mis-quote Ms. Frizzle, they “took chances, made mistakes, and got messy!” Companies need to continue to have some cowboys going forward, it is just that the days of having them do just data science is past.
What has changed is the introduction of data science platforms, like Dataiku. One of the benefits of these products is that you can easily go from development to production with just a few clicks of the button. Instead of running processes on desktops, teams can now collaborate on solutions. With both this ease of production-alization and the increased collaboration comes the key to the end of the cowboy: standards.
The standards are a family affair. If you have been in the IT game for a while, it will seem like a family reunion with standards like naming conventions with its cousin appropriate documentation, change management standards with its cousin signoff, and the drunken uncle that is environmental standards.
In data science, what needs a naming convention? What doesn’t? The big ones: project names, table names, and code. All the steps need to follow some naming standard. Assuming you are using intermediate or temporary tables those should follow a standard. Within a project, the SQL, the R/Python, and even the command line code segments need to follow a naming standard.
Years ago, I was working on a project and there was a tricky bit of logic that needed to be worked out. The smartest person on the team, Dan, solved the problem with a clever bit of code. The solution became known as the “Dan Logic”. This was bad for Dan, because as long as he was within the company, he would always be the owner of the “Dan Logic”. Even after he had moved 4 steps away from the project team, if there was a question or problem, Dan was always contacted. Dan didn’t want to be contacted, but who else can understand Dan Logic better than Dan? Moral of the story – never let your name be associated with a set of code or logic.
While project code names are fun and can be clever, like who doesn’t want to be involved in “Project Disco Dynamite.” Well me, I can’t stand disco, but you get the point. Fun and wacky names have their place in projects, but they should be kept out of naming standards. When later looking back, can you tell the contents of tables tbl_BoomGoesTheDynamite and tbl_BoomShakaLaka? Keep the naming convention simple and descriptive of the of the contents.
For appropriate documentation, I’m not just talking about commenting your code. The business owner, the team members, and the project goal all should be documented. Who worked on the project isn’t nearly as important as for whom and for what reason the project was created. That, and if you have good version control, you intrinsically know who worked on what. For data science, this documentation needs to include the data science specific questions like, why a certain threshold was chosen or why a certain algorithm was chosen over another. These notes should be kept with the project so that when, not if, there are questions on the project, the notes will be available.
In the cowboy years, there was no difference between development and production. When you got the results you wanted, you implemented it or just as frequently, your results were only for a presentation of insight. But now, you can have a bit of data science that is intrinsic to daily business. This calls for treating the data science like the enterprise worthy code that it is. This means keeping the data science team out of the production environment and having a third party execute the elevation of code into the production environment.
This also means appropriate business owner signoff and even a formal signoff process. Since there currently aren’t many tools for migrating data science projects, there should then be clearly document steps. There also should be scheduled times and days of the week that the data science projects move into production. The times depend on your natural business cycle, but the habit and discipline should be set in place. Migrations to production will need to have clear window of sign off so that if the migration time is 3:00 on a Thursday, the sign off can’t be at 2:59.
One of the biggest challenges to standards for data science is that, in general, the model must be made with production data. A few years back I got into a “passionate” discussion with a very competent and strong-minded leader with a history of leading successful development projects. At the time, I was the reporting guy. We were discussing why I was so nonchalant about testing my reports in the development environment. If I recall correctly, it was testing a report that was specific to the prior day activity. Sure, they were doing the test scenarios, but the amount of data feeding into the reporting system was but a mere fraction of what would be the daily data deluge from the system once it went live.
I was adamant that the simple test scenarios were so incredibly inadequate for testing purposes that spending my time elsewhere was of better benefit and my friendly antagonist was of the opinion that you if you check thoroughly with what you have you can be ready for production. I pity the poor person who was sitting between us. She described it as being stuck between parents who were arguing.
Let’s just say the discussion ended with “we agree to disagree.”
The reason for this story is that while reporting required a leap of faith for the environmental requirements, Data Science requires an even larger leap because it doesn’t just need production sized data to enable good testing, it requires production data to be built in the first place. This changes the traditional “development” box into a “laboratory” box.
The laboratory server contains production data (anonymized where appropriate). It is where the data science tests are performed. The reason it isn’t a production box is that the data science team needs to be able to create and modify the environment while at the same time it needs to be continuously feed production data. The source for the production data could be pulls from a transaction system or from a production reporting server. There needs to be a process to quarterly review intermediate tables so that any abandoned tables are cleaned. By its very nature, the laboratory server will be sized like a production reporting server. Data science projects may produce results for consumption by leadership from this box, so keeping it clean and organized is a must.
Let’s talk about what I like to call the “liminal” server. I like this definition of liminal space “A liminal space is the time between the ‘what was’ and the ‘next.’ It is a place of transition, waiting, and not knowing. Liminal space is where all transformation takes place, if we learn to wait and let it form us.” Ok a bit too new age-y for me in general, but you get the point. A liminal server is a server between QA and Production. Sometimes it is called a “staging” server, but with many data science projects the transition isn’t just a quick cut off as more of an easing in. The liminal server is used to run production modeling to compare the results of a new model to an existing production model. The liminal server needs to be able to react like a production server, but it doesn’t have the same storage requirements. Models should have a clear time in the liminal space.
The Full Flow
Let’s look at the full flow:
- Laboratory – Production data to build models / test hypotheses
- Development – Where data science first interacts with APIs
- Quality – Where interfaces to the data science are confirmed
- Liminal – Where data science models are compared against production transactions
- Production – Where the real money is
It Matters for the Cowboys
I know that someone is going to read this who knew me back in the day and break out laughing. I’ll own up to my own cowboy past, I was that guy who if the sign off window ended at 1:30 p.m., at 1:29 I was trying to get everything signed off. I might even have pushed it to exactly 1:30 and then spent 15 minutes claiming that the end of the window was a “less than or equal to” and not just a “less than”.
It is important that the above guiderails are for the producing of better data science, the rules aren’t there for rules sake. Just as the cowboy laments “Give me land lots of land under starry skies above. Don’t fence me in,” so too will the data cowboy lament the processes above. But here is the kicker – you need the cowboy. The rules are to help them, and you need them to help build the rules.
You don’t want the cowboy to flee to greener pastures, you want them as an integral part of the process. If you bog your data cowboys down, you will lose them. And your data science team isn’t as good as the process or the tools, your data science team is only as good as your people.
One of my colleges says that all data scientists are natural born disruptors. In a typical reaction by a data scientist, I’ll add that I disagree with this assessment. I’ll concede that many of the best data scientists I’ve met are cowboys. But just as the technology has matured, so to must the processes and by extension so to must the people. It can’t be forced onto people by saying “it is for your own good.” Just make sure as you implement the processes, you keep your people enthused about the changes.
End of the Era
With the cowboy era comes the standardization of names, environments, and processes. This gets us better, stronger, more production worthy data science but it doesn’t mean you need to put your current cowboys out to pasture.
Next time I’ll talk about how you engage your cowboys to build the process. This comes with the understanding that you can’t cowboy the process.