-
Notifications
You must be signed in to change notification settings - Fork 23
/
07-ChapterML.Rmd
1775 lines (1438 loc) · 106 KB
/
07-ChapterML.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Machine Learning {#chap:ml}
================
**Rayid Ghani and Malte Schierholz**
This chapter introduces you to the use of machine learning in tackling
social science and public policy problems. We cover the end-to-end machine learning process and focus on clustering and classification methods. After reading this chapter, you should have an overview of the components of a machine learning pipeline and methods, and know how to use those in solving social science problems. We have written this chapter to give an intuitive explanation for the methods and to provide a framework and practical tips on how to use them in practice.
Introduction
------------
You have probably heard of "machine learning" but are not sure exactly
what it is, how it differs from traditional statistics, and what you, as social scientists, can
do with it. In this chapter, we will demystify machine learning, draw
connections to what you already know from statistics and data analysis,
and go deeper into some of the novel concepts and methods that have
been developed in this field. Although the field originates from
computer science, it has been
influenced quite heavily by math and statistics in the past 15--20 years. As you will
see, many of the concepts you will learn are not entirely new, but are
simply called something else. For example, you already are familiar with
logistic regression (a classification method that falls under the
supervised learning framework in machine learning) and cluster analysis
(a form of unsupervised learning). You will also learn about new methods
that are more exclusively used in machine learning, such as random
forests, support vector machines, and neural networks. We will keep formalisms to a
minimum and focus on getting the intuition across, as well as providing
practical tips. Our hope is this chapter will make you comfortable and
familiar with machine learning vocabulary, concepts, and processes, and
allows you to further explore and use these methods and tools in your own
research and practice.
What is machine learning?
-------------------------
When humans improve their skills with experience, they are said to
learn. Is it also possible to program computers to do the same? Arthur
Samuel, who coined the term *machine learning* in 1959
[@samuel1959some], was a pioneer in this area, programming a computer to
play checkers. The computer played against itself and human opponents,
improving its performance with every game. Eventually, after sufficient
*training* (and experience), the computer became a better player than
the human programmer. Today, machine learning has grown significantly
beyond learning to play checkers. Machine learning systems have learned
to drive (and park) autonomous cars, are embedded inside robots, can
recommend books, products, and movies we are (sometimes) interested in,
identify drugs, proteins, and genes that should be investigated further
to cure diseases, detect cancer and other pathologies in x-rays and other types
of medical imaging, help us understand how the human brain learns language, help identify
which voters are persuadable in elections, detect which students are
likely to need extra support to graduate high school on time, and help
solve many more problems. Over the past 20 years, machine learning has
become an interdisciplinary field spanning computer science, artificial
intelligence, databases, and statistics. At its core, machine learning
seeks to design computer systems that improve over time with more
experience. In one of the earlier books on machine learning, Tom
Mitchell gives a more operational definition, stating that: "A computer
program is said to learn from experience $E$ with respect to some class
of tasks $T$ and performance measure $P$, if its performance at tasks in
$T$, as measured by $P$, improves with experience $E$"
[@mitchell1997machine]. We like this definition because it is task-focused and allows us to think of machine learning as a tool used inside a larger system to improve outcomes that we care about.
---
**Box: Commercial machine learning examples** <a id="box:ml1"></a>
- **Speech recognition**: Speech recognition software uses machine learning algorithms that
are built on large amounts of initial training data. Machine
learning allows these systems to be tuned and adapt to individual
variations in speaking as well as across different domains.
- **Autonomous cars**: The ongoing development of self-driving cars applies techniques from
machine learning. An onboard computer continuously analyzes the
incoming video and sensor streams in order to monitor the
surroundings. Incoming data are matched with annotated images to
recognize objects like pedestrians, traffic lights, and potholes. In
order to assess the different objects, huge training data sets are
required where similar objects already have been identified. This
allows the autonomous car to decide on which actions to take next.
- **Fraud detection**: Many public and private organizations face the problem of fraud and
abuse. Machine learning systems are widely used to take historical
cases of fraud and flag fraudulent transactions as they take place.
These systems have the benefit of being adaptive, and improving with
more data over time.
- **Personalized ads**: Many online stores have personalized recommendations promoting
possible products of interest. Based on individual shopping history
and what other similar users bought in the past, the website
predicts products a user may like and tailors recommendations.
Netflix and Amazon are two examples of companies whose
recommendation software predicts how a customer would rate a certain
movie or product and then suggests items with the highest predicted
ratings. Of course there are some caveats here, since they then
adjust the recommendations to maximize profits.
- **Face recognition**: Surveillance systems, social networking platforms, and imaging
software all use face detection and face recognition to first detect
faces in images (or video) and then tag them with individuals for
various tasks. These systems are trained by giving examples of faces
to a machine learning system which then learns to detect new faces,
and tag known individuals. The bias and fairness chapter will highlight some concerns with these types of systems.
---
<!--
% Box 7.2 and section 7.7. overlap
-->
---
**Box: Social Science machine learning examples** <a id="box:ml2"></a>
Potash et al [-@Potash2015] worked with the Chicago Department of Public Health and used random forests (a machine learning classification method) to predict which children are at risk of lead poisoning. This early warning system was then used to prioritize lead hazard inspections to detect and remediate lead hazards before they had an adverse effect on the child.
Carton et al [-@Carton2016] used a collection of machine learning methods to identify police officers at risk of adverse behavior, such as unjustified use of force or unjustified shootings or sustained complaints, to prioritize preventive interventions such as training and counseling.
Athey and Wager [-@athey2019] use a modification of random forests to estimate heterogeneous treatment effects using a data set from The National Study of Learning Mindsets to evaluate the impact of interventions to improve student achievement.
Voigt et al [-@Voigt2017] uses machine learning methods to analyze footage from body-worn cameras and understand the respectfulness of police officer language toward white and black community members during routine traffic stops.
---
Machine learning grew from the need for systems that were adaptive,
scalable, and cost-effective to build and maintain. A lot of tasks now
being done using machine learning used to be done by rule-based systems,
where experts would spend considerable time and effort developing and
maintaining the rules. The problem with those systems was that they were
rigid, not adaptive, hard to scale, and expensive to maintain. Machine
learning systems started becoming popular because they could improve the
system along all of these dimensions^[See Chapter
[Record Linkage](#chap:link).]. Box [Applications](#box:ml1)
mentions several examples where machine learning is being used in
commercial applications today. Social scientists are uniquely placed
today to take advantage of the same advances in machine learning by
having better methods to solve several key problems they are tackling.
Box [Social Science](#box:ml2) describes a few social science and policy problems that are being tackled using machine learning today.^[If you have examples from your own research using the methods we describe in this chapter, please submit a link to the paper (and/or code) here: https://textbook.coleridgeinitiative.org/submitexamples]
This chapter is not an exhaustive introduction to machine learning.
There are many books that have done an excellent job of that
[@Flach; @HastieTibshirani; @mitchell1997machine]. Instead, we present a
short and accessible introduction to machine learning for social
scientists, give an overview of the overall machine learning process,
provide an intuitive introduction to machine learning methods, give some
practical tips that will be helpful in using these methods, and leave a
lot of the statistical theory to machine learning textbooks. As you read
more about machine learning in the research literature or the media, you
will encounter names of other fields that are related (and practically
the same for most social science audiences), such as statistical
learning, data mining, and pattern recognition.
Types of analysis
----------------------------
A lot of the data analysis tasks that social scientists do can be broken down into four types:
- **Description**: The goal is to describe patterns or groupings in historical data. You’re already familiar with descriptive statistics and exploratory data analysis methods, and we will cover more advanced versions of those later in this chapter under Unsupervised Learning.
- **Detection**: The goal here is not to necessarily understand historical behavior but to detect new or emerging anomalies, events, or patterns as they happen. A typical example is early outbreak detection for infectious diseases in order to inform public health officials.
- **Prediction**: The goal here is to use the same historical data as the description and detection methods, but use it to predict events and behaviors in the future.
- **Behavior Change (or Causal Inference)**: The goal here is to understand the causal relationships in the data in order to influence the outcomes we care about.
In this chapter, we will mostly focus on Description and Prediction methods but there is a lot of work going in developing and using machine learning methods for detection as well as for behavior change and causal inference.
The Machine Learning process
----------------------------
When solving problems using machine learning methods, it is important to
think of the larger data-driven problem-solving process of which these
methods are a small part^[See Chapter [Record Linkage](#chap:link).].
A typical machine learning problem requires
researchers and practitioners to take the following steps:
1. **Understand the problem and goal**: This sounds obvious but is often nontrivial. Problems typically
start as vague descriptions of a goal---improving health outcomes,
increasing graduation rates, understanding the effect of a variable
$X$ on an outcome $Y$, etc. It is really important to work with
people who understand the domain being studied to discuss and
define the problem more concretely. What is the analytical
formulation of the metric that you are trying to improve? The Data Science Project Scoping Guide^[http://www.datasciencepublicpolicy.org/home/resources/data-science-project-scoping-guide/] is a good place to start when doing problem scoping for social science or policy problems
2. **Formulate it as a machine learning problem**: Is it a classification problem or a regression problem? Is the goal to build a model that generates a ranked list prioritized by risk of an outcome, or is it to detect anomalies as new data come in? Knowing what kinds of tasks machine learning can solve will allow you to map the problem you are working on to one or more machine learning settings and give you access to a suite of methods appropriate for that task.
3. **Data exploration and preparation**: Next, you need to carefully explore the data you have.
What additional data do you need or have access to? What variable will
you use to match records for integrating different data sources?
What variables exist in the data set? Are they continuous or
categorical? What about missing values? Can you use the variables in
their original form or do you need to alter them in some way?
4. **Feature engineering**: In machine learning language, what you might know as independent
variables or predictors or factors or covariates are called
"features." Creating good features is probably the most important
step in the machine learning process. This involves doing
transformations, creating interaction terms, or aggregating over
data points or over time and space.
5. **Modeling**: Having formulated the problem and created your features, you now
have a suite of methods to choose from. It would be great if there
were a single method that always worked best for a specific type of
problem, but that would make things too easy. Each method makes a
difference assumption about the structure and distribution of the data
and with large amounts of high-dimensional data^[Dimensionality of the
data often refers to how many variables we have in the data.], it is
difficult to know apriori which assumption will best match the data
we have. Typically, in machine learning, you take a collection of
methods and try them out to empirically validate which one works the
best for your problem. This process not only helps you select the best
method for your problem but also helps you understand the structure of
your data. We will give an overview of leading methods that are being
used today in this chapter.
6. **Model Interpretation**: Once we have built the machine learning models, we also want to understand what they are, which predictors they found important, and how much, what types of entities they flagged as high risk (and why), where they made errors, etc. All of these fall under the model interpretation, interpretability, explainability umbrella which is an active area of research right now in machine learning.
7. **Model Selection**: As you build a large number of possible models,
you need a way to select the model that is the “best”. This part of
the chapter will cover methodology to first test the models on
historical data as well as discuss a variety of evaluation metrics.
While this chapter will focus mostly on traditionally used metrics,
Chapter [Bias and Fairness](#chap:bias) will expand on this using bias
and fairness related metrics. It is important to note that sometimes
the machine learning literature will call this step the “validation”
step using historical data, but we want to distinguish it here from
validation, which is the next step.
8. **Model Validation**: The next step, after model selection (using historical data) is validation. Validate on new data, as well as designing and running field trials or experiments.
9. **Deployment and Monitoring**: Once you have selected the best model and validated it using
historical data as well as a field trial, you are ready to put the
model into practice. You still have to keep in mind that new data
will be coming in, the world will be changing, and the model might also
(need to) change over time. We will not cover too much of those aspects in this
chapter, but they are important to keep in mind when putting the machine learning
work in to practice.
<!--Although each step in this process is critical, this chapter will not focus on step 3 (data exploration) because we assume you already have some experience doing that, and on step 7 (deployment) because, although critical, it is outside the scope of the book. MALTE: There are more steps than 3 and 7 that we mostly ignore. -->
Although each step in this process is critical, a thorough description of each is out of scope. This chapter will focus on models, terms, and techniques that form the core of machine learning.
Problem formulation: Mapping a problem to machine learning methods
------------------------------------------------------------------
When working on a new problem, one of the first things we need to do is
to map it to a class of machine learning methods. In general, the
problems we will tackle, including the examples above, can be grouped
into two major categories:
1. **Supervised learning**: These are problems where there exists a target variable (continuous
or discrete) that we want to predict or classify data into.
Classification, prediction, and regression all fall into this
category. More formally, supervised learning methods predict a value
$Y$ given input(s) $X$ by learning (or estimating or fitting or
training) a function $F$, where $F(X) = Y$. Here, $X$ is the set of
variables (known as *features* in machine learning, or in other
fields as *predictors*) provided as input and $Y$ is the
target/dependent variable or a *label* (as it is known in machine
learning).
The goal of supervised learning methods is to search for that
function $F$ that best estimates or predicts $Y$. When the output $Y$ is
categorical, this is known as *classification*. When $Y$ is a
continuous value, this is called *regression*. Sound familiar?
One key distinction in machine learning is that the goal is not just
to find the best function $F$ that can estimate or predict $Y$ for observed
outcomes (known $Y$s) but to find one that best generalizes to new,
unseen data, often in the future. This distinction makes methods more focused on
generalization and less on just fitting the data we have as best as
we can. It is important to note that you do that implicitly when
performing regression by not adding more and more higher-order terms
to get better fit statistics. By getting better fit statistics, we
*overfit* to the data and the performance on new (unseen) data often
goes down. Methods like the lasso [@tibshirani1996regression]
penalize the model for having too many terms by performing what is
known as *regularization*^[In statistical terms, regularization is an attempt to avoid overfitting the model.].
2. **Unsupervised learning**: These are problems where there does not exist a target variable that
we want to predict but we want to understand "natural" groupings or
patterns in the data. Clustering is the most common example of this
type of analysis where you are given $X$ and want to group similar
$X$s together. You may have heard of “segmentation” that’s used in the marketing world to group similar customers together using clustering techniques. Principal components analysis (PCA) and related
methods also fall into the unsupervised learning category.
In between the two extremes of supervised and unsupervised learning,
there is a spectrum of methods that have different levels of supervision
involved (Figure
\@ref(fig:spectrum)). Supervision in this case is the presence
of target variables (known in machine learning as *labels*). In
unsupervised learning, none of the data points have labels. In
supervised learning, all data points have labels. In between, either the
percentage of examples with labels can vary or the types of labels can
vary. We do not cover the weakly supervised and semi-supervised methods
much in this chapter, but this is an active area of research in machine
learning. Zhu [-@zhu2005semi] provides more details.
```{r spectrum, out.width = '70%', fig.align = 'center', echo = FALSE, fig.cap = 'Spectrum of machine learning methods from unsupervised to supervised learning'}
knitr::include_graphics("ChapterML/figures/spectrum.png")
```
### Features
Before we get to models and methods, we need to turn our raw data into
“features”. In social science, they are not called features but instead
are known as variables or predictors (or covariates if you’re doing
regression).^[Good features are what makes machine learning systems
effective.] Feature generation (or engineering, as it is often
called) is where a large chunk of the time is spent in the machine learning
process. This is also the phase where previous research and learnings from
the domain being tackled can be incorporated into the machine learning
process. As social science researchers or practitioners, you have spent
a lot of time constructing features, using transformations, dummy
variables, and interaction terms. All of that is still required and
critical in the machine learning framework. One difference you will need
to get comfortable with is that instead of carefully selecting a few
predictors, machine learning systems tend to encourage the creation of
*lots* of features and then empirically use holdout data to perform
regularization and model selection. It is common to have models that are
trained on thousands of features. Of course, it is important to keep in
mind that increasing the number of features requires you to have enough
data so that you’re not overfitting. Commonly used approaches to create
features include:
- **Transformations**, such as log, square, and square root.
- **Dummy (binary) variables**: This is often done by taking categorical variables (such as city)
and creating a binary variable for each value (one variable for each
city in the data). These are also called indicator variables.
- **Discretization**: Several methods require features to be discrete instead of
continuous. Several approaches exist to convert continuous variables
into discrete ones, the most common of which is equal-width binning.
- **Aggregation**: Aggregate features often constitute the majority of features for a
given problem. These aggregations use different aggregation
functions (count, min, max, average, standard deviation, etc.),
often over varying windows of time and space. For example, given
urban data, we would want to calculate the number (and min, max,
mean, variance) of crimes within an $m$-mile radius of an address in
the past $t$ months for varying values of $m$ and $t$, and then to
use all of them as features in a classification problem. Spatiotemporal aggregation features are going to be extremely important as you build machine learning models.
In general, it is a good idea to have the complexity in features and use
a simple model, rather than using more complex models with simple
features. Keeping the model simple makes it faster to train and easier
to understand and explain.
<!--
% [ideally give a reference with list of features]
-->
Methods
-------
We will start by describing unsupervised learning methods and then go on
to supervised learning methods. We focus here on the intuition behind
the methods and the algorithm, as well as some practical tips, rather than on
the statistical theory that underlies the methods. We encourage readers
to refer to machine learning books listed in Section [Resources](#ml:res).
Box [Vocabulary](#box:ml3) gives brief definitions of several terms we
will use in this section.
---
**Box: Machine learning vocabulary** <a id="box:ml3"></a>
- **Learning**: In machine learning, you will notice the term *learning* that will
be used in the context of "learning" a model. This is what you
probably know as *fitting* or *estimating* a function, or *training*
or *building* a model. These terms are all synonyms and are used
interchangeably in the machine learning literature.
- **Examples**: These are data points, rows, observations, or instances.
- **Features**: These are independent variables, attributes, predictor variables,
and explanatory variables.
- **Labels**: These include the response variable, dependent variable, target
variable, or outcomes.
- **Underfitting**: This happens when a model is too simple and does not capture the
structure of the data well enough.
- **Overfitting**: This happens when a model is possibly too complex and models the
noise in the data, which can result in poor generalization
performance. Using in-sample measures to do model selection can
result in that.
- **Regularization**: This is a general method to avoid overfitting by applying additional
constraints to the model that is learned. For example, in building logistic regression models, a common approach is to
make sure the model weights (coefficients) are, on average, small in magnitude. Two
common regularizations are $L_1$ regularization (used by the lasso),
which has a penalty term that encourages the sum of the absolute
values of the parameters to be small; and $L_2$ regularization,
which encourages the sum of the squares of the parameters to be
small.
---
### Unsupervised learning methods
As mentioned earlier, unsupervised learning methods are used when we do
not have a target variable to estimate or predict but want to understand
clusters, groups, or patterns in the data. These methods are often used for
data exploration, as in the following examples:
1. When faced with a large corpus of text data---for example, email
records, congressional bills, speeches, or open-ended free-text
survey responses---unsupervised learning methods are often used to
understand and get a handle on the patterns in our data.
2. Given a data set about students and their behavior over time
(academic performance, grades, test scores, attendance, etc.), one
might want to understand typical behaviors as well as trajectories
of these behaviors over time. Unsupervised learning methods
(clustering) can be applied to these data to get student "segments"
with similar behavior.
3. Given a data set about publications or patents in different fields,
we can use unsupervised learning methods (association rules) to
figure out which disciplines have the most collaboration and which
fields have researchers who tend to publish across different fields.
4. Given a set of people who are at high risk of recidivism, clustering can be used to understand different groups of people within the high risk set, to determine intervention programs that may need to be created.
**Clustering**
Clustering is the most common unsupervised learning technique and is
used to group data points together that are similar to each other. The
goal of clustering methods is to produce with high intra-cluster
(within) similarity and low inter-cluster (between) similarity.
Clustering algorithms typically require a distance (or similarity)
metric^[Distance metrics are mathematical formulas to
calculate the distance between two objects.
For example, *Manhattan distance* is the distance a
car would drive from one
place to another place in
a grid-based street system,
whereas *Euclidean distance*
(in two-dimensional space)
is the “straight-line” distance between two points.] to generate clusters. They take a data set and a distance metric
(and sometimes additional parameters), and they generate clusters based
on that distance metric. The most common distance metric used is
Euclidean distance, but other commonly used metrics are Manhattan,
Minkowski, Chebyshev, cosine, Hamming, Pearson, and Mahalanobis. Often,
domain-specific similarity metrics can be designed for use in specific
problems. For example, when performing the record linkage tasks
discussed in Chapter [Record Linkage](#chap:link), you can design a
similarity metric that compares
two first names and assigns them a high similarity (low distance) if
they both map to the same canonical name, so that, for example, Sammy
and Sam map to Samuel.
Most clustering algorithms also require the user to specify the number
of clusters (or some other parameter that indirectly determines the
number of clusters) in advance as a parameter. This is often difficult
to do a priori and typically makes clustering an iterative and
interactive task. Another aspect of clustering that makes it interactive
is often the difficulty in automatically evaluating the quality of the
clusters. While various analytical clustering metrics have been
developed, the best clustering is task-dependent and thus must be
evaluated by the user. There may be different clusterings that can be
generated with the same data. You can imagine clustering similar news
stories based on the topic content, based on the writing style or based
on sentiment. The right set of clusters depends on the user and the task
they have. Clustering is therefore typically used for exploring the
data, generating clusters, exploring the clusters, and then rerunning
the clustering method with different parameters or modifying the
clusters (by splitting or merging the previous set of clusters).
Interpreting a cluster can be nontrivial: you can look at the centroid
of a cluster, look at frequency distributions of different features (and
compare them to the prior distribution of each feature), or you can
build a decision tree (a supervised learning method we will cover later
in this chapter) where the target variable is the cluster ID that can
describe the cluster using the features in your data. A good example of
a tool that allows interactive clustering from text data is Ontogen
[@Ontogen].
**$k$-means clustering**
The most commonly used clustering algorithm is called $k$-means, where
$k$ defines the number of clusters. The algorithm works as follows:
1. Select $k$ (the number of clusters you want to generate).
2. Initialize by selecting $k$ points as centroids of the $k$ clusters.
This is typically done by selecting $k$ points uniformly at random.
3. Assign each point a cluster according to the nearest centroid.
4. Recalculate cluster centroids based on the assignment in (3) as the
mean of all data points belonging to that cluster.
5. Repeat (3) and (4) until convergence.
The algorithm stops when the assignments do not change from one
iteration to the next (Figure
\@ref(fig:kmeans)). The final set of clusters, however, depend
on the starting points. If they are initialized differently, it is
possible that different clusters are obtained. One common practical
trick is to run $k$-means several times, each with different (random)
starting points. The $k$-means algorithm is fast, simple, and easy to
use, and is often a good first clustering algorithm to try and see if it
fits your needs. When the data are of the form where the mean of the
data points cannot be computed, a related method called $K$-medoids can
be used [@park2009simple].
```{r kmeans, out.width = '70%', fig.align = 'center', echo = FALSE, fig.cap = 'Example of $k$-means clustering with $k = 3$. The upper left panel shows the distribution of the data and the three starting points $m_1$, $m_2$, $m_3$ placed at random. On the upper right we see what happens in the first iteration. The cluster means move to more central positions in their respective clusters. The lower left panel shows the second iteration. After six iterations the cluster means have converged to their final destinations and the result is shown in the lower right panel'}
knitr::include_graphics("ChapterML/figures/kmeans.png")
```
**Expectation-maximization (EM) clustering**
You may be familiar with the EM algorithm in the context of imputing
missing data. EM is a general approach to maximum likelihood in the
presence of incomplete data. However, it is also used as a clustering
method where the missing data are the clusters a data point belongs to.
Unlike $k$-means, where each data point gets assigned to only one
cluster, EM does a soft assignment where each data point gets a
probabilistic assignment to various clusters. The EM algorithm iterates
until the estimates converge to some (locally) optimal solution.
The EM algorithm is fairly good at dealing with outliers as well as
high-dimensional data, compared to $k$-means. It also has a few
limitations. First, it does not work well with a large number of
clusters or when a cluster contains few examples. Also, when the value
of $k$ is larger than the number of actual clusters in the data, EM may
not give reasonable results.
**Mean shift clustering**
Mean shift clustering works by finding dense regions in the data by
defining a window around each data point and computing the mean of the
data points in the window. Then it shifts the center of the window to
the mean and repeats the algorithm till it converges. After each
iteration, we can consider that the window shifts to a denser region of
the data set. The algorithm proceeds as follows:
1. Fix a window around each data point (based on the bandwidth
parameter that defines the size of the window).
2. Compute the mean of data within the window.
3. Shift the window to the mean and repeat till convergence.
Mean shift needs a bandwidth parameter $h$ to be tuned, which influences
the convergence rate and the number of clusters. A large $h$ might
result in merging distinct clusters. A small $h$ might result in too
many clusters. Mean shift might not work well in higher dimensions since
the number of local maxima is pretty high and it might converge to a
local optimum quickly.
One of the most important differences between mean shift and $k$-means
is that $k$-means makes two broad assumptions: the number of clusters is
already known and the clusters are shaped spherically (or elliptically).
Mean shift does not assume anything about the number of clusters (but
the value of $h$ indirectly determines that). Also, it can handle
arbitrarily shaped clusters.
The $k$-means algorithm is also sensitive to initializations, whereas
mean shift is fairly robust to initializations. Typically, mean shift is
run for each point, or sometimes points are selected uniformly randomly.
Similarly, $k$-means is sensitive to outliers, while mean shift is less
sensitive. On the other hand, the benefits of mean shift come at a
cost---speed. The $k$-means procedure is fast, whereas classic mean
shift is computationally slow but can be easily parallelized.
**Hierarchical clustering**
The clustering methods that we have seen so far, often termed
*partitioning* methods, produce a flat set of clusters with no
hierarchy. Sometimes, we want to generate a hierarchy of clusters, and
methods that can do that are of two types:
1. **Agglomerative (bottom-up)**: Start with each point as its own cluster and iteratively merge the
closest clusters. The iterations stop either when the clusters are
too far apart to be merged (based on a predefined distance
criterion) or when there is a sufficient number of clusters (based
on a predefined threshold).
2. **Divisive (top-down)**: Start with one cluster and create splits recursively.
Typically, agglomerative clustering is used more often than divisive
clustering. One reason is that it is significantly faster, although both
of them are typically slower than direct partition methods such as
$k$-means and EM. Another disadvantage of these methods is that they are
*greedy*, that is, a data point that is incorrectly assigned to the
"wrong" cluster in an earlier split or merge cannot be reassigned again
later on.
**Spectral clustering**
Figure \@ref(fig:spectral) shows the clusters that $k$-means would
generate on the data set in the figure. It is obvious that the clusters
produced are not the clusters you would want, and that is one drawback
of methods such as $k$-means. Two points that are far away from each
other will be put in different clusters even if there are other data
points that create a "path" between them. Spectral clustering fixes that
problem by clustering data that are connected but not necessarily (what
is called) compact or clustered within convex boundaries. Spectral
clustering methods work by representing data as a graph (or network),
where data points are nodes in the graph and the edges (connections
between nodes) represent the similarity between the two data points.
```{r spectral, out.width = '70%', fig.align = 'center', echo = FALSE, fig.cap = 'The same data set can produce drastically different clusters: (a) k-means; (b) spectral clustering'}
knitr::include_graphics("ChapterML/figures/spectral.png")
```
The algorithm works as follows:
1. Compute a similarity matrix from the data. This involves determining
a pairwise distance function (using one of the distance functions we
described earlier).
2. With this matrix, we can now perform graph partitioning, where
connected graph components are interpreted as clusters. The graph
must be partitioned such that edges connecting different clusters
have low weights and edges within the same cluster have high values.
3. We can now partition these data represented by the similarity matrix
in a variety of ways. One common way is to use the normalized cuts
method. Another way is to compute a graph Laplacian from the
similarity matrix.
4. Compute the eigenvectors and eigenvalues of the Laplacian.
5. The $k$ eigenvectors are used as proxy data for the original data
set, and they are fed into $k$-means clustering to produce cluster
assignments for each original data point.
Spectral clustering is in general much better than $k$-means in
clustering performance but much slower to run in practice. For
large-scale problems, $k$-means is a preferred clustering algorithm to
run because of efficiency and speed.
**Principal components analysis**
Principal components analysis is another unsupervised method used for
finding patterns and structure in data. In contrast to clustering
methods, the output is not a set of clusters but a set of *principal
components* that are linear combinations of the original variables. PCA
is typically used when you have a large number of variables and you want
a reduced number that you can analyze. This approach is often called
*dimensionality reduction*. It generates linearly uncorrelated
dimensions that can be used to understand the underlying structure of
the data. In mathematical terms, given a set of data on $n$ dimensions,
PCA aims to find a linear subspace of dimension $d$ lower than $n$ such
that the data points lie mainly on this linear subspace.
PCA is related to several other methods you may already know about.
Multidimensional scaling, factor analysis, and independent component
analysis differ from PCA in the assumptions they make, but they are
often used for similar purposes of dimensionality reduction and
discovering the underlying structure in a data set.
**Association rules**
Association rules are a different type of analysis method and originate
from the data mining and database community, primarily focused on
finding frequent co-occurring associations among a collection of items.
This method is sometimes referred to as "market basket analysis," since
that was the original application area of association rules. The goal is
to find associations of items that occur together more often than you
would randomly expect. The classic example (probably a myth) is "men who
go to the store to buy diapers will also tend to buy beer at the same
time." This type of analysis would be performed by applying association
rules to a set of supermarket purchase data. For social scientists, this method can be used on data that contains social services that individuals have received in the past to determine what types of services “co-occur” in people and proactively offer those services to people in need.
Association rules take the form $X_1, X_2, X_3 \Rightarrow Y$ with
support $S$ and confidence $C$, implying that when a transaction
contains items $\{X_1, X_2, X_3\}$ $C$% of the time, they also contain
item $Y$ and there are at least $S$% of transactions where the
antecedent is true. This is useful in cases where we want to find
patterns that are both *frequent* and *statistically significant*, by
specifying thresholds for support $S$ and confidence $C$.
Support and confidence are useful metrics to generate rules but are
often not enough. Another important metric used to generate rules (or
reduce the number of spurious patterns generated) is *lift*. Lift is
simply estimated by the ratio of the joint probability of two items, $x$
and $y$, to the product of their individual probabilities:
$P(x,y)/[P(x)P(y)]$. If the two items are statistically independent,
then $P(x,y)=P(x)P(y)$, corresponding to a lift of $1$. Note that
anti-correlation yields lift values less than 1, which is also an
interesting pattern, corresponding to mutually exclusive items that
rarely occur together.
Association rule algorithms work as follows: Given a set of transactions
(rows) and items for that transaction:
1. Find all combinations of items in a set of transactions that occur
with a specified minimum frequency. These combinations are called
*frequent itemsets*.
2. Generate association rules that express co-occurrence of items
within frequent itemsets.
For our purposes, association rule methods are an efficient way to take
a *basket* of features (e.g., areas of publication of a researcher,
different organizations an individual has worked at in their career, all
the cities or neighborhoods someone may have lived in) and find
co-occurrence patterns. This may sound trivial, but as data sets and
number of features get larger, it becomes computationally expensive and
association rule mining algorithms provide a fast and efficient way of
doing it.
### Supervised learning {#sec:MLchapter:super}
We now turn to the problem of supervised learning, which typically
involves methods for classification, prediction, and regression. We will
mostly focus on classification methods in this chapter since many of the
regression methods in machine learning are fairly similar to methods
with which you are already familiar. Remember that classification means
predicting a discrete (or categorical) variable. Most of the
classification methods that we will cover can also be used for
Regression (predicting continuous outcomes).
In general, supervised learning methods take as input pairs of data
points $(X,Y)$ where $X$ are the predictor variables (features) and $Y$
is the target variable (label). The supervised learning method then uses
these pairs as *training data* and *learns* a model $F$, where
$F(X)\sim Y$. This model $F$ is then used to predict $Y$s for new data
points $X$. As mentioned earlier, the goal is not to build a model that
best fits known data but a model that is useful for future predictions
and minimizes future generalization error. This is the key goal that
differentiates many of the methods that you know from the methods that
we will describe next. In order to minimize future error, we want to
build models that are not just *overfitting* on past data.
Another goal, often prioritized in the social sciences, that machine
learning methods do not optimize for is getting a structural form of the
model. Machine learning models for classification can take different
structural forms (ranging from linear models, to sets of rules, to more
complex non-linear forms), and it may not always be possible to write them down in
a compact form as an equation. This does not, however, make them
incomprehensible or uninterpretable. Another focus of machine learning
models for supervised learning is prediction, and not necessarily causal inference^[The topic of causal inference is addressed in more
detail in Chapter [Data Quality and Inference Errors](#chap:errors).].
Some of these models can be used to help with causal inference, but they
are typically optimized for prediction tasks. We believe that there are
many social science and policy problems where better prediction methods
can be extremely beneficial [@Kleinberg2015]. <!-- reference already exists in section 7.7 -->
In this chapter, we mostly deal with binary classification problems:
that is, problems in which the data points are to be classified into one
of two categories. Several of the methods that we will cover can also be
used for multiclass classification (classifying a data point into one of
$n$ categories) or for multi-label classification (classifying a data
point into $m$ of $n$ categories where $m\ge1$). There are also
approaches to take multiclass problems and turn them into a set of
binary problems that we will mention briefly in the next section.
Before we describe supervised learning methods, we want to recap a few
principles as well as terms that we have used and will be using in the
rest of the chapter.
**Training a model**
Once we have finished data exploration, filled in missing values,
created predictor variables (features), and decided what our target
variable (label) is, we now have pairs of $X,Y$ to start training (or
building) the model.
**Using the model to score new data**
We are building this model so we can predict $Y$ for a new set of
$X$s---using the model means, getting new data, generating the same
features to get the vector $X$, and then applying the model to produce
$Y$.
One common technique for supervised learning is logistic regression, a
method you will already be familiar with. We will give an overview of
some of the other methods used in machine learning. It is important to
remember that as you use increasingly powerful classification methods,
you need more data to *train* the models.
**$k$-nearest neighbor**
The method $k$-nearest neighbor ($k$-NN) is one of the simpler
classification methods in machine learning. It belongs to a family of
models sometimes known as *memory-based models* or *instance-based
models*. An example is classified by finding its $k$ nearest neighbors
and taking majority vote (or some other aggregation function). We need
two key things: a value for $k$ and a distance metric with which to find
the $k$ nearest neighbors. Typically, different values of $k$ are used
to empirically find the best one. Small values of $k$ lead to
predictions having high variance but can capture the local structure of
the data. Larger values of $k$ build more global models that are lower
in variance but may not capture local structure in the data as well.
Figure \@ref(fig:knn) provides an example for $k = 1, 3, 5$ nearest
neighbors. The number of neighbors ($k$) is a parameter, and the
prediction depends heavily on how it is determined. In this example,
point B is classified differently if $k = 3$.
```{r knn, out.width = '70%', fig.align = 'center', echo = FALSE, fig.cap = 'Example of $k$-nearest neighbor with $k = 1, 3, 5$ neighbors. We want to predict the points A and B. The 1-nearest neighbor for both points is red ("Patent not granted"), the 3-nearest neighbor predicts point A (B) to be red (green) with probability 2/3, and the 5-nearest neighbor predicts again both points to be red with probabilities 4/5 and 3/5, respectively.'}
knitr::include_graphics("ChapterML/figures/knn.png")
```
Training for $k$-NN just means storing the data, making this method
useful in applications where data are coming in extremely quickly and a
model needs to be updated frequently. All the work, however, gets pushed
to scoring time, since all the distance calculations happen when a new
data point needs to be classified. There are several optimized methods
designed to make $k$-NN more efficient that are worth looking into if
that is a situation that is applicable to your problem.
In addition to selecting $k$ and an appropriate distance metric, we also
have to be careful about the scaling of the features. When distances
between two data points are large for one feature and small for a
different feature, the method will rely almost exclusively on the first
feature to find the closest points. The smaller distances on the second
feature are nearly irrelevant to calculate the overall distance. A
similar problem occurs when continuous and categorical predictors are
used together. To resolve the scaling issues, various options for
rescaling exist. For example, a common approach is to center all
features at mean $0$ and scale them to variance $1$.
There are several variations of $k$-NN. One of these is weighted nearest
neighbors, where different features are weighted differently or
different examples are weighted based on the distance from the example
being classified. The method $k$-NN also has issues when the data are
sparse and has high dimensionality, which means that every point is far
away from virtually every other point, and hence pairwise distances tend
to be uninformative. This can also happen when a lot of features are
irrelevant and drown out the relevant features' signal in the distance
calculations.
Notice that the nearest-neighbor method can easily be applied to
regression problems with a real-valued target variable. In fact, the
method is completely oblivious to the type of target variable and can
potentially be used to predict text documents, images, and videos, based
on the aggregation function after the nearest neighbors are found.
**Support vector machines**
Support vector machines are one of the most popular and best-performing
classification methods in machine learning today. The mathematics behind
SVMs has a lot of prerequisites that are beyond the scope of this book,
but we will give you an intuition of how SVMs work, what they are good
for, and how to use them.
We are all familiar with linear models (e.g., logistic regression) that separate two classes by
fitting a line in two dimensions (or a hyperplane in higher dimensions)
in the middle (see Figure \@ref(fig:svm)). An important decision that linear models have
to make is which linear separator we should prefer when there are
several we can build.
```{r svm, out.width = '100%', fig.align = 'center', echo = FALSE, fig.cap = 'Support vector machines'}
knitr::include_graphics("ChapterML/figures/svm.png")
```
You can see in Figure \@ref(fig:svm) that multiple lines offer a solution to the
problem. Is any of them better than the others? We can intuitively
define a criterion to estimate the worth of the lines: A line is bad if
it passes too close to the points because it will be noise sensitive and
it will not generalize correctly. Therefore, our goal should be to find
the line passing as far as possible from all points.
The SVM algorithm is based on finding the hyperplane that maximizes the
*margin* of the training data. The training examples that are closest to
the hyperplane are called *support vectors* since they are *supporting*
the margin (as the margin is only a function of the support vectors).
An important concept to learn when working with SVMs is *kernels*. SVMs
are a specific instance of a class of methods called *kernel methods*.
So far, we have only talked about SVMs as linear models. Linear works
well in high-dimensional data but sometimes you need nonlinear models,
often in cases of low-dimensional data or in image or video data.
Unfortunately, traditional ways of generating nonlinear models get
computationally expensive since you have to explicitly generate all the
features such as squares, cubes, and all the interactions. Kernels are a
way to keep the efficiency of the linear machinery but still build
models that can capture nonlinearity in the data without creating all
the nonlinear features.
You can essentially think of kernels as similarity functions and use
them to create a linear separation of the data by (implicitly) mapping
the data to a higher-dimensional space. Essentially, we take an
$n$-dimensional input vector $X$, map it into a high-dimensional (infinite-dimensional)
feature space, and construct an optimal separating
hyperplane in this space. We refer you to relevant papers for more
detail on SVMs and nonlinear kernels [@ShaweTaylor2004; @Scholkopf2001].
SVMs are also related to logistic regression, but use a different
loss/penalty function [@HastieTibshirani].
When using SVMs, there are several parameters you have to optimize,
ranging from the *regularization* parameter $C$, which determines the
tradeoff between minimizing the training error and minimizing model
complexity, to more kernel-specific parameters. It is often a good idea
to do a grid search to find the optimal parameters. Another tip when
using SVMs is to normalize the features; one common approach to doing
that is to normalize each data point to be a vector of unit length.
Linear SVMs are effective in high-dimensional spaces, especially when
the space is sparse such as text classification where the number of data
points (perhaps tens of thousands) is often much less than the number of
features (a hundred thousand to a million or more). SVMs are also fairly
robust when the number of irrelevant features is large (unlike the
$k$-NN approaches that we mentioned earlier) as well as when the class
distribution is skewed, that is, when the class of interest is
significantly less than 50% of the data.
One disadvantage of SVMs is that they do not directly provide
probability estimates. They assign a score based on the distance from
the margin. The farther a point is from the margin, the higher the
magnitude of the score. This score is good for ranking examples, but
getting accurate probability estimates takes more work and requires more
labeled data to be used to perform probability calibrations.
In addition to classification, there are also variations of SVMs that
can be used for regression [@SmolaRegression04] and ranking
[@Chapelle2010].
**Decision trees**
Decision trees are yet another set of methods that are helpful for
prediction. Typical decision trees learn a set of rules from training
data represented as a tree. An exemplary decision tree is shown in
Figure \@ref(fig:tree). Each level of a tree *splits* the tree to
create a branch using a feature and a value (or range of values). In the
example tree, the first split is made on the feature *number of visits
in the past year* and the value $4$. The second level of the tree now
has two splits: one using *average length of visit* with value $2$ days
and the other using the value $10$ days.
```{r tree_r, out.width = '70%', fig.align = 'center', echo = FALSE}
knitr::include_graphics("ChapterML/figures/tree.png")
```
```{r tree, out.width = '70%', fig.align = 'center', echo = FALSE, fig.cap = 'An exemplary decision tree. The top figure is the standard representation for trees. The bottom figure offers an alternative view of the same tree. The feature space is partitioned into numerous rectangles, which is another way to view a tree, representing its nonlinear character more explicitly'}
knitr::include_graphics("ChapterML/figures/tree-rectangle.png")
```
Various algorithms exist to build decision trees. C4.5, CHAID, and CART
(Classification and Regression Trees) are the most popular. Each needs to
determine the next best feature to split on. The goal is to find feature
splits that can best reduce class impurity in the data, that is, a split
that will ideally put all (or as many as possible) positive class
examples on one side and all (or as many as possible) negative examples
on the other side. One common measure of impurity that comes from
information theory is *entropy*, and it is calculated as
$$H(X) = -\sum_x p(x) \log p(x).$$
Entropy is maximum (1) when both classes have equal numbers of examples
in a node. It is minimum (0) when all examples are from the same class.
At each node in the tree, we can evaluate all the possible features and
select the one that most reduces the entropy given the tree so far. This
expected change in entropy is known as *information gain* and is one of
the most common criteria used to create decision trees. Other measures
that are used instead of information gain are Gini and chi-squared.
If we keep constructing the tree in this manner, selecting the next best
feature to split on, the tree ends up fairly deep and tends to overfit
the data. To prevent overfitting, we can either have a stopping
criterion or *prune* the tree after it is fully grown. Common stopping
criteria include minimum number of data points to have before doing
another feature split, maximum depth, and maximum purity. Typical
pruning approaches use holdout data (or cross-validation, which will be
discussed later in this chapter) to cut off parts of the tree.
Once the tree is built, a new data point is classified by running it
through the tree and, once it reaches a terminal node, using some
aggregation function to give a prediction (classification or
regression). Typical approaches include performing maximum likelihood
(if the leaf node contains 10 examples, 8 positive and 2 negative, any
data point that gets into that node will get an 80% probability of being
positive). Trees used for regression often build the tree as described
above but then fit a linear regression model at each leaf node.
Decision trees have several advantages. The interpretation of a tree is
straightforward as long as the tree is not too large. Trees can be
turned into a set of rules that experts in a particular domain can
possibly dig deeper into, validate, and modify. Trees also do not
require too much feature engineering. There is no need to create
interaction terms since trees can implicitly do that by splitting on two
features, one after another.
Unfortunately, along with these benefits come a set of disadvantages.
Decision trees, in general, do not perform well, compared to SVMs,
random forests, or logistic regression. They are also unstable: small
changes in data can result in very different trees. The lack of
stability comes from the fact that small changes in the training data
may lead to different splitting points. As a consequence, the whole tree
may take a different structure. The suboptimal predictive performance
can be seen from the fact that trees partition the predictor space into
a few rectangular regions, each one predicting only a single value (see
the bottom part of Figure \@ref(fig:tree).
**Ensemble methods**
Combinations of models are generally known as model ensembles. They are
among the most powerful techniques in machine learning, often
outperforming other methods, although at the cost of increased
algorithmic and model complexity.