Are sklearn defaults wrong?

There was some uprising on Twitter recently about the default behavior of sklearn LogisticRegression:

If you read the post, you can see that the biggest problem with the choice is that, unless your data is regularized, you will train a model that probably under performs: you are unnecessarily penalizing it by making it learn less than what it could from the data.

The second problem with the default behavior of LogisticRegression is about choosing a regularization constant that is — in effect — a magic number (equal to 1.0). This hides the fact that the regularization constant should be tuned by hyperparameter search, and not set in advance without knowing how the data and problem looks like.

You could just normalize the data and do a grid search then, can't you? We certainly could: the wide spread problem in machine learning is, however, that people often blindly follow tutorials online written without attention to these details as they're hard(er). Understanding how grid search works is not difficult but not trivial. Understanding why regularization is necessary requires a good mental model of the feature space. Again, these are hardly intricate concepts. The post notes how the first Google hit that you find by searching "logistic regression sklearn example" does not talk about these fundamental details.

As an aside, this makes for a very simple yet powerful question when interviewing data scientists: why should you normalize the data when using a regularization term. A trivial answer for any experienced data scientist, a hard one if you are not an experienced practitioner1.

This whole discussion is what makes it hard to justify our data science courses when most people think that you can find all the answers online. While this is true, understanding which answers are correct — and which are not — takes often an expert.

Want more controversial opinions every day in your Twitter client? I'm gglanzani there!


  1. Not knowing the answer doesn't mean the candidate fails, it just gives you a better idea of their skill. 

Author
Follow us for more of this
Recent posts
Recent tweets
Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.
Follow us for more of this