Are sklearn defaults wrong?
There was some uprising on Twitter recently about the default behavior of sklearn
I have very strong feelings about scikit-learn's default keyword arguments. https://t.co/LUvKH0rQVO— Senior OLS Engineer (@ryxcommar) August 31, 2019
If you read the post, you can see that the biggest problem with the choice is that, unless your data is regularized, you will train a model that probably under performs: you are unnecessarily penalizing it by making it learn less than what it could from the data.
The second problem with the default behavior of
LogisticRegression is about choosing a
regularization constant that is — in effect — a magic number (equal to
This hides the fact that the regularization constant should be tuned by hyperparameter search, and
not set in advance without knowing how the data and problem looks like.
You could just normalize the data and do a grid search then, can't you? We certainly could: the wide spread problem in machine learning is, however, that people often blindly follow tutorials online written without attention to these details as they're hard(er). Understanding how grid search works is not difficult but not trivial. Understanding why regularization is necessary requires a good mental model of the feature space. Again, these are hardly intricate concepts. The post notes how the first Google hit that you find by searching "logistic regression sklearn example" does not talk about these fundamental details.
As an aside, this makes for a very simple yet powerful question when interviewing data scientists: why should you normalize the data when using a regularization term. A trivial answer for any experienced data scientist, a hard one if you are not an experienced practitioner1.
This whole discussion is what makes it hard to justify our data science courses when most people think that you can find all the answers online. While this is true, understanding which answers are correct — and which are not — takes often an expert.
Want more controversial opinions every day in your Twitter client? I'm gglanzani there!
Not knowing the answer doesn't mean the candidate fails, it just gives you a better idea of their skill. ↩
Follow us for more of this
Early Access of Apache Airflow book
October 30, 2019
IP whitelisting your Chalice application
October 26, 2019
GoDataDriven Open Source Contribution for Q3 2019
October 21, 2019
Azure container instance example
October 21, 2019
The Linear Algebra Behind Linear Regression
October 18, 2019
Bare minimum bring your own model on SageMaker
October 05, 2019