A foundation for scikit-learn at Inria

fermigier | 78 points

Scikit-learn is a very nicely written library and I can use plenty of superlatives to describe the wonderous API of scikit-learn.

One thing I can't recommend enough is to extend their Transfomers base class in such a way that you implement their fit and transform methods. A simple example can be viewed here: https://gitlab.com/timelord/sklearn_transformers

which allows you to put your transformers into the scikit-learn Pipelines and GridSearchCV (and more). The way scikit-learn leverages multiple cores is by using joblib and Dask extends this implementation to effortlessly scale the scikit-learn pipelines onto a cluster of servers. https://distributed.readthedocs.io/en/latest/joblib.html

By writing your own data transformations in the transformer format you can, by extension, leverage this g great ecosystem.

I think it's a great time to be a data scientist / engineer now.

hetspookjee | 6 years ago

Unfortunately scikit-learn is a mess without an alternative.

There is so much wrong with the api design of sklearn (how can one think "predict_proba" is a good function name?). I can understand this, since most of it was probably written by PhD students without the time and expertise to come up with a proper api; many of them without a CS background. Compare this to e.g. the API of google/guava.

For example https://www.reddit.com/r/statistics/comments/8de54s/is_r_bet...

   Case in point, sklearn doesn't have a bootstrap crossvalidator despite the bootstrap being one of the most
   important statistical tools of the last two decades. In fact, they used to, but it was removed. 
   Weird right?
   ...
   > We don't remove the sklearn.cross_validation.Bootstrap class because few people are using it, 
   > but because too many people are using something that is non-standard (I made it up) and very very 
   > likely not what they expect if they just read its name. 
   > At best it is causing confusion when our users read the docstring and/or its source code. 
   > At worse it causes silent modeling errors in our users code base.
   ...
   Oh man, I thought of another great example. I bet you had no idea that 
   sklearn.linear_model.LogisticRegression is L2 penalized by default. 
   "But if that's the case, why didn't they make this explicit by calling it RidgeClassifier instead?" 
   Maybe because sklearn has a Ridge object already, but it exclusively performs regression? 
   Who knows (also... why L2 instead of L1? Yeesh). Anyway, if you want to just do unpenalized 
   logistic regression, you have to set the C argument to an arbitrarily high value, 
   which can cause problems. Is this discussed in the documentation? 
   Nope, not at all. Just on stackoverflow and github. 
   Is this opaque and unnecessarily convoluted for such a basic and crucial technique? Yup.
Or the following: https://www.reddit.com/r/haskell/comments/7brsuu/machine_lea...
zeec123 | 6 years ago

I thought INRIA uses OCaml everywhere and would choose Owl[1] (OCaml library for numeric scientific computing and machine learning) as a project for this kind of foundation.

[1] https://github.com/owlbarn/owl

xvilka | 6 years ago

I'd love this same kind of library in nodejs

11235813213455 | 6 years ago