Say you have a classification task, such as figuring out which people are most likely to buy your product. Machine Learning courses often assume you have a large amount of data and relatively little domain expertise. But many practical situations involve small data and a significant amount of domain expertise, either yours or your coworkers’. What’s a simple way to build a good model in a situation like this? Here’s a framework I’ve developed which generally lets me build a strong predictor in a few hours, occasionally beating algorithms with weeks of tuning.
Traits of a good model
- High predictive ability (low bias)
- Generalizes well to new data (low variance)
- Easy for humans to understand
- Easy to incorporate domain knowledge
Logistic Regression is a good starting point, because it naturally generalizes well, is easy to understand, and is included out of the box in most statistical packages.
A default out-of-the-box LR will use all the variables available to you, which can significantly increase your change of overfitting. l1 regularization gives you an easy way to reduce your number of features, which makes the model more human-readable and reduces the chance of overfitting.
Decision Tree Splits
The major limitation of LR is that it’s linear. A very common case where this fails is when a numeric variable’s prediction value is effectively binary.
For example, say you’re deciding whether to loan someone money. If they recently went bankrupt, you probably don’t want to lend money to them. But if they recently went bankrupt 10 times, they’re certainly not 10 times worse than the person who went bankrupt once.
One way to incorporate this data into your model is to run 1-level Decision Tree splits on your features, and then enter those splits as binary variables. For example, “number of previous purchases” might turn into a binary variable where 0 represents <3 purchases and 1 represents 4 or more.
At this point, do the variables that appear in your regression make sense? A binary split for “number of previous bankruptcies” seems like a great variable, a split for “business located at least 9,837 feet above sea level” is…probably not. You can now easily apply domain knowledge to weed out variables that seem like random overfitting and re-train your algorithm.