08 May Productionizing machine learning: Lesson #1
Occam is always right – Model stacking doesn’t justify a +0.2% ACC
One design principle that has been applied to numerous fields is called KISS that stands for ‘Keep It Simple Stupid’. This principle states that you should prioritize simple solutions over complex ones. The rule is as easy to understand as easy to forget, which leads to over-engineering practices, in other words, when someone decides to build more than what is really necessary based on speculation. This extra effort will not only be a waste but it’s also likely to get in the way. Any unneeded feature kills performance, expandability and maintainability, so eventually profitability. So it should be avoided at all costs.
No Free Lunch
There is a trade-off between improvements and complexity. Any increase in the accuracy’s model will probably be due to a new data source, a new feature transformation or a new architecture. In any of these cases, the complexity was increased; thus it is needed to analyze whether the improve in predictive power compensates the increase in complexity. In most cases, especially if the dataset is small or unbalanced, a performance increase of less than 1% should be scrutinized as it might be just noise or will not even compensate the added complexity.
Scale the model size to the data size
The direct application of the KISS-principle to machine learning is the use of regularization to avoid over-fitting. The most common regularization techniques are:
- L1 regularization (Lasso). A penalty term is added to the cost function with the absolute value of the weights. This technique enforces weight sparsity.
- L2 regularization (Ridge). A penalty term is added to the cost function with the quadratic value of the weights. This technique enforces small weight values.
- Early stopping. Stop the training after several iterations or when the changes of weight or score values are too small.
- In random forests, it is possible to randomly ignore attributes at each node, select a minimum number of instances per leaf, or impose a maximum tree depth.
- In neural networks, dropout prevents complex co-adaptations.
Besides, it is also important to scale the model size to the size of your dataset. As a rule of thumb, you must have a train set that has 10 times more data points than the number of variables in your whole architecture, which includes weight model(s) plus hyperparameters. In case, the data is noisy, classes are unbalanced or there is high intercorrelation between samples, this multiplier should be higher. As a simple example, when using linear regression, the number of weights is determined by the number of regressors (#features), which should be larger than the number of data points (#samples).
Delete everything that is not needed
Unfortunately, most machine learning engineers and data scientists are not experienced software engineers. They tend to keep experimental code paths in the master branch, just in case they needed in the future. This anti-pattern should be avoided and everything that is not currently used must be deleted, which includes data sources, features, feature transformations, utils, extra models, etc. Clean and simple code is easier to understand, debug and expand.
If you include a lot of data sources in your model, you will need to keep them as long as you model is working. Some data sources might change ranges, distributions or even being deprecated. Thus, use only data that is needed and will be likely be available in the future, especially at prediction time.
Do not overfit the architecture
Most engineers use a train-test split in the data to develop new models, which is good. However, it is pretty common practice to reuse the test set to select the architecture, which is bad. In a generic scenario, a developer uses 80% of the data to train a model, then uses the remaining 20% to test it, as the results are probably not good, the developer tries another model, then another feature transformation, etc until the model performs fine in the test set. Even, if the split is randomly done for each iteration, we are overfitting the architecture to the available data; thus harnessing the model generatlization. The test set should only be used once at the end of our development. A better approach is to initially divide the data into 90% to train and 10% to test. We will save than 10% in our trouser back pocket and forget about it. Then, we will use the remaining 90% to create a K-fold cross-validation framework to try different architectures. Eventually, when we are satisfied with the architecture, we will recover that 10% and test the model. As a final step, we will present the train and test scores, and we will retrain the model for the whole data before deploying.
Do not reinvent the wheel
The basics of most machine learning algorithms can be implemented by a bachelor student. However, the specific implementation can give us a huge gain in computing and predictive performance. Thus, it is not recommended to build our own algorithms but to use tested and open-source libraries built by experts. At the same time, it is also an anti-pattern to have our implementation full of glue code connecting those libraries, basically transforming data from one format to another.
Some ways to avoid creating waste are:
- Have short iteration cycles. This will help the developer to keep the mindset to build the essential and skip unneeded features.
- Ask ‘why’ to every task, module and function. And give answers that relates to the known needs of the customer.
- Machine learning professionalism – Let’s kill the hype and start creating value for customers
- Challenge #1: A model is not a product – It doesn’t matter how deep it is
- Challange #2: Infrastructure (publish soon)
- Lesson #2: Get the pipeline right – Think twice the name of that column! (publish soon)