-
Notifications
You must be signed in to change notification settings - Fork 1
/
article.tex
509 lines (455 loc) · 23.5 KB
/
article.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
\documentclass[a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{authblk}
\usepackage{dblfloatfix}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{listings}
\usepackage[numbers,square]{natbib}
\usepackage{xcolor}
\usepackage{xspace}
\usepackage{url}
\usepackage{magazine-style}
\usepackage{textcomp}
\makeatletter
\newcommand*\verbfont{%
\normalfont\ttfamily\hyphenchar\font\m@ne\@noligs}
\makeatother
% Listing settings for Python
\lstset{language=python,
extendedchars=true,
columns=fullflexible,
upquote=true,
xleftmargin = 0pt,
%frame=single,
%rulecolor=\color{lightgrey},
aboveskip = 0.5ex,
belowskip = 0.6ex,
basicstyle=\verbfont,
escapebegin={\color{green}},
keywordstyle=\bfseries\color{blue!50!black},
%identifierstyle=\sffamily\small,
commentstyle=\color{green!40!black} \verbfont,
stringstyle=\color{red!60!black} \verbfont,
showstringspaces=false,
tabsize=2,
breaklines=true,
%classoffset=1,
morekeywords={{,},=,:}, %keywordstyle=\color{SkoleYellow},
%classoffset=0,
backgroundcolor=\color{green!10},
fillcolor=\color{green!10},
}
\title{Scikit-learn: machine learning without learning the machinery}
%\author{G. Varoquaux \and L. Buitinck \and G. Louppe \and O. Grisel \and F.
%Pedregosa \and A. Mueller}
\author[1]{G. Varoquaux\thanks{[email protected]}}
\author[2]{L. Buitinck}
\author[3]{G. Louppe}
\author[1]{O. Grisel}
\author[1]{F. Pedregosa}
\author[4]{A. Mueller}
\affil[1]{Parietal, INRIA, Bat 145 CEA Saclay, 91191 Gif-sur-Yvette,
France}
\affil[2]{ISLA, University of Amsterdam, Science Park 904, Amsterdam, The
Netherlands}
\affil[3]{Montefiore Institute, University of Liège, Liège, Belgium}
\affil[4]{Amazon Development Center Germany, Berlin, Germany}
\begin{document}
\lstset{language=Python}
\maketitle
\begin{abstract}
Machine learning is a pervasive development at the intersection of
statistics and computer science. While it can benefit many
data-related applications, the technicality of the research literature
and the corresponding algorithms slows down its adoption. Scikit-learn is
an open-source software project that aims at making machine learning
accessible to all, whether it be in academia or in the industry. It
benefits from
the general-purpose Python language, that enjoys a huge growth in
the scientific world, as well as a striving ecosystem of contributors.
Here we give a quick introduction to scikit-learn as well as to
machine-learning basics.
\end{abstract}
\section{A software project across communities}
\paragraph{Project vision.}
%
Scikit-learn was born from the observation that most standard
machine learning algorithms were out of reach of the users that could
most benefit from them: researchers --biologists, climate
scientists, experimental physicists-- or developers of web
services or domain-specific applications.
%
In the scientific world, implementations of these algorithms
mostly consisted of scattered pieces of code %to download
on researchers' web pages. Unifying efforts could be found in
statistic-specific environments and libraries such as the R
language \cite{Rmanual}, the Weka Java toolkit \cite{hall2009weka}, or the
Shogun C++ library \cite{sonnenburg2010}.
Of these, R is a custom \emph{language},
while Weka is a Java library with a custom GUI, and Shogun is a fairly
technical C++ code base.
In scikit-learn, we chose Python to reach many users, technical or not, tying machine learning
into a general-purpose language good for simple
interactive programming \cite{perez2007ipython}, sophisticated
scripting, as well as full-blown applications.
Scikit-learn aims at bridging the gap between machine-learning research and
applications by providing a library, and not an environment, in a
general-purpose programming language, relying on domain-agnostic data structures
\cite{pedregosa2011}. Emphasis is put on quality and ease of use, which
implies focus on installation issues, documentation, and API design
\cite{buitinck2013ecml}. For the project to be usable in real-world
settings, computational performance is also a priority.
\paragraph{The Python data ecosystem.}
%
Machine learning is only a small part of a data-analysis pipeline, and
scikit-learn nicely dovetails into the rich Python ecosystem: scientific
and numeric tools \cite{oliphant2007python,varoquaux2013scipy}, but also
text processing tools, web servers, etc.
%
For its numerical needs, scikit-learn leverages NumPy arrays
\cite{vanderwalt2011} --data structures for efficient numerical
computation--, SciPy --a collection of classic numerical algorithms--,
matplotlib \cite{hunter2007matplotlib} --for scientific plotting-- and
Cython, to generate compiled code in a Python-like syntax
\cite{behnel2011cython}.
%
A plethora of Python packages can help users with input or preprocessing
of data, notably Pandas for columnar data \cite{mckinney2012}, scikit-image
for images, and NLTK for text. Finally, the IPython environment
\cite{perez2007ipython} is priceless for interactive work.
\paragraph{Some history.}
%
Scikit-learn started circa 2007 as code from David Cournapeau and
Matthieu Brucher's PhD work, but dwindled down until 2010, when the
Parietal team at INRIA adopted the project, hiring a full-time
engineer. Basic APIs were defined and an efficient binding of LibSVM
\cite{chang2011libsvm} gave the project a compelling advantage. In
December 2011, the first international sprint was organized with generous
funding from Google. Today, the project has grown vastly beyond
INRIA into a worldwide open source effort with more than 200 contributors.
\section{A brief introduction to machine learning}
\begin{figure}[t]
%\hspace*{-.015\linewidth}%
\includegraphics[width=1.05\linewidth]{wage_data}%
\caption{Wage as a function of years of work experience and
education; data from \cite{berndt1991}\label{fig:data}.}
\end{figure}
\begin{figure}[b]
%\hspace*{-.015\linewidth}%
\includegraphics[width=1.05\linewidth]{wage_data_random_forest}%
\caption{Wage prediction from years of work experience and education,
using random forests.\label{fig:random_forest}}
\end{figure}
% Introduce and describe the data
Machine learning is about extracting rules from data, most often with the
goal of making decision on new data \cite{elemstatlearn}. As a simple example, we show in
Fig.~\ref{fig:data} data about wages in the US, from \cite{berndt1991}:
11 \emph{features}, including wage, number of years of education and of work
experience, describe 534 individuals,
called \emph{samples} in machine-learning jargon.
Code downloading the data to reproduce the figures can be found on
\url{https://github.com/scikit-learn/SigMobile-paper}.
%
This data showcases many of the typical difficulties in machine learning.
Samples are very irregularly distributed, as most individuals
finished studies after high-school. As a result, there are gaping
holes in the 2D plot of education versus work experience, in which any
statistical analysis is extrapolation. The data is very noisy: for a
given pair of education and work experience values, wages vary widely.
This variability
is probably explained by missing factors, such as sector of
activity, but adding them to the analysis creates a more complex picture,
in 3D or more, rather than 2D, with even more gaps.
In Fig.~\ref{fig:random_forest}, we use a popular machine-learning
algorithm, random forests, to predict the wage from the years of
education, or the years of work experience, either as separate features,
or combined. The patchy square appearance of the prediction is due to the
inners of the algorithm, and it illustrates the corresponding
extrapolation mechanism. The predictions fit the data well; maybe too
well, as, on the prediction solely from the years of education (left of
the figure), it is hard to believe in the bump at 5 years. This bump is
probably a classic case of \emph{overfitting}: the algorithm is learning its
prediction from noise in the data. As a result, the prediction error on
new data will be significantly different than that measured on the
training data.
% then with svms, and discuss model complexity and use linear
% models to link to convention statistics. Conclude with the difference
% with 1D and 2D, and introduce the curse of dimensionality.
%overfitting, model complexity
In Fig.~\ref{fig:linear_svm}, we use another popular machine-learning
algorithm, linear Support Vector Machines (SVMs). Unlike random forests,
it learns decisions with linear functions. It is said that
it has a lower \emph{model complexity}, because the number of parameters
to learn from the data is much smaller. The risk here is to
\emph{underfit}: not fully use the richness of the data, that may not be
following a linear law. The art of machine learning consists in choosing
the right family of algorithms/models to describe the data well and
find the sweet spot between under and over fit. As the number of features
describing the data grows, the number of parameters to learn grows, and
thus the risk of overfit increases. This core difficulty is known as the
\emph{curse of dimensionality}.
\begin{figure}[b]
\hspace*{-.015\linewidth}%
\includegraphics[width=1.05\linewidth]{wage_data_linear_svm}%
\caption{Wage prediction from years of work experience and education,
using a linear SVM.\label{fig:linear_svm}}
\end{figure}
\section{Learning with scikit-learn}
\begin{figure*}[b]
% To have the float on the bottom, we need this declared on the page
% before where it is going to be printed
\lstinputlisting[lastline=12]{nlp_example.py}
\caption{Simple text-processing code to download movie reviews
and learn positive versus negative ratings.\label{fig:code}}
\end{figure*}
\paragraph{Setting up models and feeding them data.}
%
Learning algorithms in scikit-learn are embodied in \emph{estimators}:
objects instantiated with parameters that control learning.
Data is passed to the {\tt fit} method,
which accepts an $(n \times p)$ data matrix $\mathbf{X}$,
represented as a NumPy array or SciPy sparse matrix, where $n$ is the
number of samples, and $p$ the number of features.
When there is a quantity or class label to predict,
a second argument $\mathbf{y}$ (usually a 1D array)
contains the desired outcomes for the rows in $\mathbf{X}$.
For a regression task such as wage prediction, these are real-valued,
while in classification they are integers or strings that serve as labels
for the class to which each sample belongs.
Because data seldom arrives in this format,
scikit-learn comes with flexible feature extraction code
to make data suitable for consumption by estimators.
\paragraph{Supervised models: learning to predict.}
%
The goal of supervised models is the prediction of some value of interest.
The corresponding estimators
have a {\tt predict} method, that takes a data matrix $\mathbf{X}$ and
returns a predicted $\mathbf{y}$.
\paragraph{Model evaluation and parameter selection.}
%
The performance of a model measures its ability to correctly
predict new data. Because of overfitting, the \emph{train} data used for
learning
should in general not be used to also evaluate model performance, as it may often
result in overly optimistic estimates of the true prediction error of the model.
Instead, the correct way to obtain an unbiased estimate is to leave out \emph{test} data,
untouched during training and used for model evaluation only.
Alternatively, a \emph{cross-validation} scheme may also be
used, where the data is repeatedly split into \emph{train} and
\emph{test} subsets. Scikit-learn provides integrated support for
cross-validation, as specific iterators defining the train and test
subsets. Several functions and objects accept these iterators as arguments to
perform cross-validation internally. For instance the {\tt
cross\_val\_score} function measures the prediction score of an
estimator. Cross-validation can also be used to tune the meta-parameters
of an estimator, such as the sparsity of a sparse model. For this,
specific estimators, \emph{meta-estimators}, such as the {\tt
GridSearchCV}, take another estimator at construction time, and use
cross-validation internally to set its parameter. When used together with
{\tt cross\_val\_score}, they perform \emph{nested} cross-validation,
\emph{i.e.}~the meta-parameters are set independently of the test data.
\paragraph{Unsupervised models: learning to transform.}
%
Unsupervised learning covers all learning applications in which there is
no clearly identified variable to predict. For instance, in a clustering
application, the task is to group together observations that are similar.
Dimension reduction tries to find simplified representations that capture
well the properties of the data. Novelty detection finds in a new dataset
observations that differ from the data in the train set. As the uses of
unsupervised learning are very diverse, it is not possible to have an API
as uniform as for supervised learning. While some unsupervised
estimators can be used to predict a characteristic of new data, such as
in novelty detection, many are useful to transform data, as in dimension
reduction. Scikit-learn estimators can have a {\tt transform} method, used
for this purpose. A {\tt Pipeline} estimator can then chain estimators to
form a new estimator that applies transformations before calling the {\tt
fit} or {\tt predict} method of the last object.
\section{In practice: putting scikit-learn to good use}
\paragraph{A simple text-mining example.}
%
Not every dataset comes as a set of numbers. For instance in natural
language processing (NLP), observations are strings,
which needs to be transformed into the data matrix expected by
estimators. Often, features used to build decisions indicate the
presence of certain words, or their frequencies.
In spam filtering, it can be very relevant to extract from a document that
it contains the term ``casino''.
In other NLP problems, such as finding proper names,
events such as ``next token is a Roman numeral'' may be relevant, as in
``J.P. Kennedy III.''
Reusing the {\tt transform} API, scikit-learn
provides \emph{feature extraction} objects, that turn such
unstructured data into a data matrix. In these situations, the
$\textbf{X}$ matrix is a matrix of counting statistics.
Since each word is typically present in only a few documents,
the data matrix will be mostly zeros.
SciPy offers a \emph{sparse matrix} object
that stores the zeros in such matrices implicitly,
and many scikit-learn estimators can use this object to improve scalability.
In Fig.~\ref{fig:code} we give an example of fully functioning code to
do sentiment polarity classification for movie reviews: given a review,
it outputs whether that review is positive about the movie it concerns.
The code demonstrates the use of pipelines, and also shows how machine
learning ties into other Python modules: downloading, unpacking and
learning from a dataset are all done in one programming language.
This script first fetches a hand-labeled movie review dataset
\cite{pang2004} and unpacks it to disk using standard Python libraries.
It then loads these using scikit-learn data loading functions, makes a
pipeline of a \textsf{tf--idf} feature extractor
\cite{rennie2003tackling} (that turns strings into a sparse matrix based
on word frequencies) and logistic regression, and finally trains the
model. Parameters for the feature extractor and the classifier
were set manually.
\begin{figure*}[b]
% To have the float on the bottom, we need this declared on the page
% before where it is going to be printed
\lstinputlisting{online_learning_example.py}
\caption{On-line learning: the main loop iterates over batches
of 1000 documents, vectorizing them on the fly and updating the
classifier. Running the example shows the error on the test set
progressively decreasing as data is accumulated.\label{fig:online}}
\end{figure*}
We can test our classifier on two new movie reviews.\footnote{
The code can be tried interactively by simply copy-pasting it
into an interactive Python interpreter or IPython \cite{perez2007ipython}.
}
\begin{lstlisting}
>>> clf.predict(["Worst movie I ever saw",
... "Godzilla, eat your heart out!"])
array([0, 1])
\end{lstlisting}
Here, we asked the estimator to predict whether the reviews were
positive.
The first review is predicted as being negative (0),
while the second is positive (1). We have
anecdotal evidence that the classifier works. To get a measure for the
classifier's accuracy we retrain it in a cross-validation scheme.
This scheme yields the fraction of correct answers on three different folds:
\begin{lstlisting}
>>> from sklearn import cross_validation
>>> cross_validation.cross_val_score(clf, data.data, data.target)
array([0.9026946, 0.8723724, 0.8633634])
\end{lstlisting}
We see that the expected accuracy is about 88\%.
Experimenting with different classification models is easy:
e.g., replacing \texttt{LogisticRegression} with \texttt{svm.LinearSVC}
gives a linear support vector machine.
\paragraph{What about big data?}
%
\emph{Big data} is a major trend in data analytics: applying machine
learning to large datasets can extract information so far unexploited.
While scikit-learn is most efficient in the ``medium data'' range,
using shared memory multiprocessors rather than clusters,
it has facilities for dealing with datasets that don't fit
in any single machine.
Thinking in terms of the $(n \times p)$ data matrix,
we can distinguish between the case of large numbers of samples $n$,
and large numbers of features $p$.
For a large number of features, scikit-learn's options include
random projections, which reduce data dimensionality
while preserving most of the geometric structure.
For the ``large $n$'' case, some estimators expose a \texttt{partial\_fit}
method that does \emph{online learning}.
Instead of loading all of the data in a single huge matrix, the user
can feed the estimator chunks of the data,
and the algorithm finds a good approximation to its objective
after a few passes over the dataset.
% The citation is for "Heaps' law":
% vocabulary size as a function of collection size.
% Gael, feel free to remove it if you think it's too detailed.
With large collections of text, one faces both challenges:
the dimensionality $p$
tends to increase with sample size $n$, as the number of different words used
is greater in large collections \cite[88--89]{manning2009}.
Beyond the mere size of the data matrix
created by a standard vectorization approach, outlined earlier, another challenge
is that the vectorization must be done in two passes over the documents,
as the size of the output vectors is not known up front.
Here, \emph{feature hashing} can be used
to tackle both of these challenges and produce a streaming vectorizer,
the {\tt HashingVectorizer} in scikit-learn.
Words are run through a (stateless) hash function, rather than a lookup table,
to map words to indices.
Collisions may occur, but the hashing algorithm \cite{weinberger2009}
is specially designed to mitigate their effect.
The dimensionality of the output vectors can be set by the user;
the bigger, the less likely collisions will occur.
%
A full example doing online learning on the 20 Newsgroups corpus\footnote{\url{http://qwone.com/~jason/20Newsgroups/}}
is given in Fig.~\ref{fig:online} (a downloader for this corpus comes with scikit-learn).
\section{Nurturing an open source project}
The greatest asset of the scikit-learn project is its breadth of
contributors, coming from different backgrounds, working in different
institutions, with a variety of applications: more than 200 contributors in
total, 15 active core contributors in the past year. Indeed, not only
is there strength in numbers, but the convergence of points of views
make for better code and better design. From a developer's perspective, the
project is managed with the goal of enabling newcomers to contribute,
while keeping quality high.
Our development workflow relies on intensive code review,
facilitated by the online version control environment GitHub (\url{http://github.com})
The code review process is a great asset to
raising the quality of the code, from all points of view: algorithmic,
numerical robustness, homogeneity of APIs, ease of use, documentation.
Another central tool for maintaining quality in the project is the use of
unit testing (some 88\% of the lines of code are covered).
All would-be code contributions are automatically tested using Travis
(\url{http://travis-ci.org}).
The open-source community-driven model of scikit-learn has fantastically
payed off. However, we find that it does not fit in the standard
professional incentive structure, whether it be in academia or in the
industry. Indeed, getting due credit for grass-root, free, software that
``just works'' is difficult. None of the grant proposals submitted to fund
development by core contributors have so far been accepted, and there is
no straightforward revenue model. Even more challenging is rewarding
properly the long tail of small contributors.
%Finally, active
%contributors tend to be victim of the project's success, and be promoted
%to leading positions which leave them little time.
Scikit-learn has become a reference machine learning toolkit. As such, it
should offer a wide variety of learning algorithms. However, each new
feature, each new line of code, comes with a maintenance cost.
Uncontrolled growth would lead the project to pick up weight until it
came to a crawl. The difficulty lies in defining the project scope and
prioritizing methods. We strive to identify and integrate all major
methods most useful for prediction tasks. We favor time-tested
algorithms and put great care in the choice of default parameters. We put
a lot of effort in making the documentation didactic and
pragmatic. Our
secret ambition is that it can provide an executable alternative to
textbooks, with all the figures generated by code examples.
Machine learning is a very active field of research, with promising
trends such as deep learning, harnessing GPUs or distributed computing
for extra computational power, or extending to new platforms, such as
Android or iOS-based embedded devices. However, as scikit-learn's purpose
is to provide tried-and-true, mostly turn-key, predictive tools, we have
to walk the fine line between supporting powerful techniques and
utilizing our limited project cycles on proven technologies. For
instance, deep neural networks are an emerging tool with a huge
potential, but are still in flux and hard to use without significant expertise.
Similarly, GPGPU technology is promising, but not ready for non-expert use.
The HPC community and industry have largely achieved source-level compatibility with OpenCL
\cite{stone2010opencl},
but performance portability and cross-platform driver support remain
challenging.
%
Support for mobile platforms such as Android is determined by those platforms'
support for Python and the standard C/C++ build chain. Indeed, our basic
execution environment is the scientific Python stack, that is both very
powerful and high-level. The vision of the scikit-learn project is to
bring advances in machine learning to users that that may lack the full
expertise. As a consequence, we focus on seasoned approaches, rather than
implementing ongoing research. We believe that code is an intrinsic part
of the scholarship outcome of scientific research in computational
science \cite{buckheit1995wavelab}, and scikit-learn strives to be akin
to a reference textbook in this research software landscape.
%We have found that implementing, even a standard algorithm, really well,
%can require a lot of domain knowledge. Thus it is natural that specific
%libraries span up to solve it. Some adopt scikit-learn API and standard,
%and we hope that scikit-learn has a structural effect on the environment.
\bibliography{paper}
\bibliographystyle{plain}
\end{document}