Within the arts, there has always been a tension between ornamentation and simplicity. The good artist is one who can execute a technique exceptionally, but the great artist is one who knows when to hold back.
I myself love a good grilled London Broil. But whenever I make it myself, it tastes like pencil erasers. I learned that I was adding way too much oregano to the marinade, and it was overpowering everything. If a little is good, how can more not be good-er?
It’s like Reinhardt said, “The more stuff in it, the busier the work of art, the worse it is. More is less. Less is more.” Data science is a young occupation that could stand to take from these older pursuits. Whether it’s writing, cooking, or painting, editing is a core component of becoming a master of the discipline. Knowing when to hold back. The same is true in analytics. Oftentimes, a data scientist can build a better model, a more complex model, a more accurate model. But that doesn’t mean they should. |
Do I really need to build this model? Can I do something simpler?
If your job is building models, all you do is try to build models. A data scientist’s job should be to assist the business using data regardless of whether that’s through predictive modeling, simulation, optimization modeling, data mining, visualization, or summary reporting.
In a business, predictive modeling and accuracy is a means, not an end. What’s better: A simple model that’s used, updated, and kept running? Or a complex model that works when you babysit it but the moment you move on to another problem no one knows what the hell it’s doing?
Robert Holte argued simplicity versus accuracy in his rather interesting and thoroughly-titled paper “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets.” In that paper, the *air quotes* AI model he used was a single decision stump – one rule. In code, it’d basically be an IF statement. He simply looked through a training dataset and found the single best feature for splitting the data in reference to the response variable and used that to form his rule. The scandalous part of Holte’s approach: his results were pretty good!
And so, Holte raises this point, “simple-rule learning systems are often a viable alternative to systems that learn more complex rules. If a complex rule is induced, its additional complexity must be justified…”
Complexity must be justified. Stop and think about that a moment.
If I’m working on a high frequency trading model, additional complexity in an AI model might mean millions of dollars in revenue. And I’ve likely been hired to gold-plate an AI model where there is no other end user beyond myself and a few other PhDs.
But I don’t work in high frequency trading. I work at a much more fun company. That means that while some of my models are mission-critical, i.e. revenue is on the line and every bit of accuracy helps, other models are less important.
Recently I needed to build a model that classified users into three groups: paid, likely-to-pay-in-the-future, and will-never-pay-us-a-dime. I could have built a highly complex model. There were many weak predictors at my disposal (“have you generated an API key in the first 30 days?”) that would have worked great in an ensemble model. But I stopped myself.
This was a model that no one would have the time to babysit. And our business wasn’t on the line. It just needed to be something that worked better than checking “free/paid” within an account.
Holte argues in his paper that often datasets used in predictive modeling have relatively few features that really stand out as predictors. Investigating my data, this was the case. There was one feature that was highly predictive of likely-to-pay-in-the-future. And there was a second feature that was almost as powerful. Other than that, the other features boosted accuracy marginally.
So rather than build a full-fledged AI model, I handed an engineer an IF statement: “Check this and this in a user’s account. If those two things are true, they’re likely to pay us in the future.”
Now, I still let the business know the FPR and TPR of this simple model. You’ve gotta let people know what they’re sacrificing by going with a simple approach. Because whether or not you increase complexity for additional accuracy is not a data science decision. It’s a business decision.
So keep in mind that as a data scientist you’re a technical person, but there’s also a bit of artistry in your job. You are an editor subject to the needs of the business.