-
Notifications
You must be signed in to change notification settings - Fork 23
/
04-ChapterDB.Rmd
1441 lines (1211 loc) · 76.5 KB
/
04-ChapterDB.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Databases {#chap:db}
=========
**Ian Foster and Pascal Heus**
Once data have been collected and linked, it is
necessary to store and organize them. Many social scientists are used to
working with one analytical file, often in SAS, Stata, SPSS, or R. But most organizations store (or should store) their data in databases, which makes it critical for social scientists to learn how to create, manage, and use databases for data storage and analysis. This chapter describes the concept of databases and introduces different types of databases and analysis languages (in particular, relational databases and SQL, respectively) that allow storing and organizing of data for rapid and efficient data exploration and analysis.
Introduction {#sec:db:intro}
------------
We turn now to the question of how to store, organize, and manage the
data used in data-intensive social science. As the data with which you
work grow in volume and diversity, effective data management becomes
increasingly important to avoid scale and
complexity from overwhelming your research processes. In particular,
when you deal with data that are frequently updated, with changes made
by different people, you will want to use database management
systems (DBMSs) instead of maintaining data in text files or within
siloed statistical packages such as SAS, SPSS, Stata, and R. Indeed, we
go so far as to say: if you take away *just one thing* from this book (or at least from this chapter), it should be this: *Use a database!*
As we explain in this chapter, DBMSs greatly
simplify data management. They require a little bit
of effort to set up, but are worth it. They permit large amounts of
data to be organized in multiple ways that allow for efficient and
rapid exploration via query languages; durable and reliable storage that maintain data consistency; scaling to large data sizes; and intuitive
analysis, both within the DBMS itself and via connectors to other data
analysis packages and tools. DBMSs have become a critical component of
most real-world systems, from handling transactions in financial systems
to delivering data to power websites, dashboards, and software that we
use every day. If you are using a production-level enterprise system,
chances are there is a database in the back-end. They are multi-purpose
and well suited for organizing social science data and for supporting
data exploration and analysis.
DBMSs make many easy things trivial, and many hard things easy. They
are easy to use but can appear daunting at first. A basic understanding
of databases and of when and how to use DBMSs is an important element of
the social data scientist's knowledge base. We therefore provide in this
chapter an introduction to databases and how to use them. We describe
different types of databases and their various features, and how they
can be used in different contexts. We describe basic features like
how to get started, setting up a database schema, ingesting data, querying
and analyzing data within a database, and getting results out. We also
discuss how to link from databases to other tools, such as Python, R,
and (if you have to) Stata.
DBMS: When and why {#sec:db:when}
------------------
Consider the following three data sets:
1. 10,000 records describing research grants, each specifying the
principal investigator, institution, research area, proposal title,
award date, and funding amount in a comma-separated-value (CSV)
format.
2. 10 million records in a variety of formats from funding agencies,
web APIs, and institutional sources describing people, grants,
funding agencies, and patents.
3. 10 billion Twitter messages and associated metadata---around 10
terabytes ($10^{13}$ bytes) in total, and increasing at a terabyte a
month.
Which tools should you use to manage and analyze these data sets? The
answer depends on the specifics of the data, the analyses that you want
to perform, and the life cycle within which data and analyses are
embedded. Table \@ref(tab:table4-1) summarizes relevant factors, which we now
discuss.
Table: (\#tab:table4-1) When to use different data management and analysis technologies
|**When to use different data management and analysis technologies**|
|-----------------------------------------------------------------|
| **Text files, spreadsheets, and scripting language** |
| • Your data are small |
| • Your analysis is simple |
| • You do not expect to repeat analyses over time |
| **Statistical packages** |
| • Your data are modest in size |
| • Your analysis maps well to your chosen statistical package |
| **Relational database** |
| • Your data are structured |
| • Your data are large |
| • You will be analyzing changed versions of your data over time |
| • You want to share your data and analyses with others |
| **NoSQL database** |
| • Your data are unstructured |
| • Your data are extremely large |
| • Your analysis will happen mostly outside the database in a programming language |
<br>
In the case of data set 1 (10,000 records describing research grants),
it may be feasible to leave the data in their original file, use
spreadsheets, pivot tables, or write programs in
**scripting languages**^[A scripting language is a programming language
used to automate tasks that could otherwise be performed one by one by the
user.] such as Python or R to
analyze the data in those files. For example, someone familiar with such
languages can quickly create a script to extract from data set 1 all
grants awarded to one investigator, compute average grant size, and
count grants made each year in different areas.
However, this approach also has disadvantages. Scripts do not provide
inherent control over the file structure. This means that if you obtain
new data in a different format, your scripts need to be updated. You
cannot just run them over the newly acquired file. Scripts can also
easily become unreasonably slow as data volumes grow. A Python or R
script will not take long to search a list of 1,000 grants to find those
that pertain to a particular institution. But what if you have
information about 1 million grants, and for each grant you want to
search a list of 100,000 investigators, and for each investigator, you
want to search a list of 10 million papers to see whether that
investigator is listed as an author of each paper? You now have
$1{,}000{,}000 \times 100{,}000 \times 10{,}000{,}000 = 10^{18}$
comparisons to perform. Your simple script may now run for hours or even
days. You can speed up the search process by constructing indices, so
that, for example, when given a grant, you can find the associated
investigators in constant time rather than in time proportional to the
number of investigators. However, the construction of such indices is
itself a time-consuming and error-prone process.
For these reasons, the use of scripting languages alone for data
analysis is rarely to be recommended. This is not to say that all
analysis computations can be performed in database systems. A
programming language will also often be needed. But many data access and
manipulation computations are best handled in a database.
Researchers in the social sciences frequently use statistical packages
such as R, SAS, SPSS, and Stata for data analysis. Because these systems
integrate some
data management, statistical analysis, and graphics capabilities in a
single package, a researcher can often carry out a data analysis project
of modest size within the same environment. However, each of these
systems has limitations that hinder its use for modern social science
research, especially as data grow in size and complexity.
Take Stata, for example. Stata loads the entire data set into the
computer's working memory, and thus you would have no problems loading
data set 1. However, depending on your computer's memory, it could have
problems dealing with with data set 2 and most likely would not be able to
handle data set 3. In addition, you would need to perform this data
loading step each time you start working on the project, and your
analyses would be limited to what Stata can do. SAS can deal with larger
data sets, but is renowned for being hard to learn and use. Of course
there are workarounds in statistical packages. For example, in Stata you
can deal with larger file sizes by choosing to only load the variables
or cases that you need for the analysis [@kohler2012datenanalyse].
Likewise, you can deal with more complex data by creating a system of
files that each can be linked as needed for a particular analysis
through a common identifier variable.^[For example, the Panel
Study of Income Dynamics (http://psidonline.isr.umich.edu)
has a series of files that are related and can be
combined through common identifier variables [@PSIDguide].]
The ad hoc approaches to problems of scale mentioned in the preceding
paragraph are provided as core functions in most DBMSs, and thus rather
than implement such inevitably limited workarounds, you will usually
be well advised to set up a database. A database is especially valuable
if you find
yourself in a situation where the data set is constantly updated by
different users, if groups of users have different rights to use your
data or should only have access to subsets of the data, or if the
analysis takes place on a server that sends results to a client
(browser). Some statistics packages also have difficulty working with more
than one data source at a time---something that DBMSs are designed to do
well.
Alas, databases are not perfectly suited for every need. For example, in social science research, the reproducibility of our analysis is critical and hence versioning of the data used for analysis is critical. Most databases do not do provide versioning out of the box. Typically, they do keep a log of all operations performed (inserting, updating, updating data for example), which can facilitate versioning and rollbacks, but we often need to configure the database to allow versioning and support reproducibility.
These considerations bring us to the topic of this chapter, namely
**database management systems**. A DBMS^[DBMS is a system that interacts
with users, other applications, and the database itself to capture
and analyze data.] handles all of the issues listed above, and more. As we
will see below when we look at concrete examples, a DBMS allows us to
define a logical design that fits the structure of our data. The DBMS
then creates a *data model* (more on this below) that allows these data to be
stored, queried, and updated efficiently and reliably on disk, thus
providing independence from underlying physical storage. It supports
efficient access to data through *query languages* and (somewhat) automatic
optimization of those queries to permit fast analysis. Importantly, it
also support concurrent access by multiple users, which is not an option
for file-based data storage. It supports *transactions*, meaning that
any update to a database is performed in its entirety or not at all,
even in the face of computer failures or multiple concurrent updates.
It also reduces the time spent both by analysts, by making it easy to
express complex analytical queries concisely, and on data
administration, by providing simple and uniform data administration
interfaces.
A *database* is a structured collection of data about entities and their
relationships. It models real-world objects---both entities (e.g.,
grants, investigators, universities) and relationships (e.g., "Steven
Weinberg" works at "University of Texas at Austin")---and captures
structure in ways that allow these entities and relationships to be
queried for analysis. A *database management system* is a software suite
designed to safely store and efficiently manage databases, and to assist
with the maintenance and discovery of the relationships that database
represents. In general, a DBMS encompasses three key components: its
*data model* (which defines how data are represented: see Box
[Data model](#box:db1), its *query language* (which defines how the user
interacts with the data), and support for *transactions and crash
recovery* (to ensure reliable execution despite system failures).^[Some key DBMS features are often lacking in standard statistical packages: a standard query language (with commands that allow analyses or data manipulation on a subgroup of cases defined during the analysis, for example “group by ...,” “order by ...”), keys (for speed improvement), and an explicit model of a relational data structure.]
---
**Box: Data model** <a id="box:db1"></a>
A *data model*
specifies the data elements associated with the domain we are analyzing, the
properties of those data elements, and how those data elements relate to
one another. In developing a data model, we commonly first identity the
entities that are to be modeled and then define their properties and
relationships. For example, when working on the science of science
policy (see Figure \@ref(fig:fig2), the entities include people, products,
institutions, and funding, each of which has various properties (e.g.,
for a person, their name, address, employer); relationships include "is
employed by" and "is funded by." This conceptual data model can then be
translated into relational tables or some other database representation,
as we describe next.
---
Hundreds of different open source, commercial, and cloud-hosted versions
DBMSs are available and new ones pop up every day. However, you only need to
understand a relatively small number of concepts and major database
types to make sense of this diversity.
Table \@ref(tab:table4-3) defines the major classes of DBMSs that we will
consider in this chapter. We consider only a few of these in any detail.
Relational DBMSs are the most widely used systems, and will
be the optimal solution for many social science data analysis purposes.
We describe relational DBMSs in detail below, but in brief, they allow
for the efficient storage, organization, and analysis of large
quantities of *tabular* data^[Sometimes, as discussed in Chapter
[Record Linkage](#chap:link), the links are one to one and sometimes
one to many.]: data organized as tables, in which rows
represent entities (e.g., research grants) and columns represent
attributes of those entities (e.g., principal investigator, institution,
funding level). The associated Structured Query Language (SQL) can then
be used to perform a wide range of tasks, which are executed with
high efficiency due to sophisticated indexing and query planning
techniques.
While relational DBMSs have dominated the database world for decades,
other database technologies have become popular for various classes of
applications in recent years. As we will see, the design of these alternative *NoSQL
DBMSs* is typically motivated by a desire to scale the quantities
of data and/or number of users that can be supported and/or to deal with
unstructured data that are not easily represented in tabular form. For
example, a key--value store can organize large numbers of records, each
of which associates an arbitrary key with an arbitrary value. These
stores, and in particular variants called *document stores* that permit
text search on the stored values, are widely used to organize and
process the billions of records that can be obtained from web crawlers.
We review below some of these alternatives and the factors that may
motivate their use.
Table: (\#tab:table4-3) Types of databases: relational (first row) and various types of NoSQL (other rows)
| **Type** | **Examples** | **Advantages** | **Disadvantages** | **Uses** |
|---------------------|-------------------------------------------------|--------------------------------------------------------------------|--------------------------------------------------------------|------------------------------------------------------------------|
| Relational database | MySQL, PostgreSQL, Oracle, SQL Server, Teradata | Consistency (ACID) | Fixed schema; typically harder to scale | Transactional systems: order processing, retail, hospitals, etc. |
| Key–value store | Dynamo, Redis | Dynamic schema; easy scaling; high throughput | Not immediately consistent; no higher-level queries | Web applications |
| Column store | Cassandra, HBase | Same as key–value; distributed; better compression at column level | Not immediately consistent; using all columns is inefficient | Large-scale analysis |
| Document store | CouchDB, MongoDB | Index entire document (JSON) | Not immediately consistent; no higher-level queries | Web applications |
| Graph database | Neo4j, InfiniteGraph | Graph queries are fast | May not be efficient to do non-graph analysis | Recommendation systems, networks, routing |
<br>
Relational and NoSQL databases (and indeed other solutions, such as
statistical packages) can also be used together. Consider, for example,
Figure \@ref(fig:figdb-dbs), which depicts data flows commonly encountered
in large research projects. Diverse data are being collected from
different sources: JSON documents from web APIs, web pages from web
scraping, tabular data from various administrative databases, Twitter
data, and newspaper articles. There may be hundreds or even thousands of
data sets in total, some of which may be extremely large. We initially
have no idea of what **schema**^[A schema defines the
structure of a database in a formal language defined
by the DBMS. See Section [Schema design and definition](#sec:db:schema).]
to use for the different data sets, and indeed it
may not be feasible to define a unified set of schema, as the data may be
diverse and new data sets may be getting acquired continuously. Furthermore,
the way we organize the data may vary according to our intended purpose.
Are we interested in geographic, temporal, or thematic relationships
among different entities? Each type of analysis may require a different way
of organizing.
For these reasons, a common storage solution may be to first load all data
into a large NoSQL database. This approach makes all data available via
a common (albeit limited) query interface. Researchers can then extract
from this database the specific elements that are of interest for their
work, loading those elements into a relational DBMS, another specialized
DBMS (e.g., a graph database), or a package for more detailed analysis.
As part of the process of loading data from the NoSQL database into a
relational database, the researcher will necessarily define schemas,
relationships between entities, and so forth. Analysis results can be
stored in a relational database or back into the NoSQL store.
```{r figdb-dbs, out.width = '70%', fig.align = 'center', echo = FALSE, fig.cap = 'A research project may use a NoSQL database to accumulate large amounts of data from many different sources, and then extract selected subsets to a relational or other database for more structured processing'}
knitr::include_graphics("ChapterDB/figures/data-fig2.png")
```
Relational DBMSs
----------------
We now provide a more detailed description of relational DBMSs.
Relational DBMSs implement the relational data model, in which data are
represented as sets of records organized in tables. This model is
particularly well suited for the structured data with which we
frequently deal in the social sciences; we discuss in
Section [NoSQL databases](#sec:db:nosql) alternative data models,
such as those used in NoSQL databases.
We use the data shown in
Figure \@ref(fig:figdb-1) to introduce key concepts. These two CSV format
files describe grants made by the US National Science Foundation (NSF).
One file contains information about grants, the other information about
investigators. How should you proceed to manipulate and analyze these
data?
```{r figdb-1, out.width = '70%', fig.align = 'center', echo = FALSE, fig.cap = 'CSV files representing grants and investigators. Each line in the first table specifies a grant number, investigator name, total funding amount, and NSF program name; each line in the second gives an investigator name, institution name, and investigator email address'}
knitr::include_graphics("ChapterDB/figures/figdb-1.png")
```
The main concept underlying the relational data model is a *table* (also
referred to as a *relation*): a set of rows (also referred to as tuples,
records, or observations), each with the same columns (also referred to
as fields, attributes or variables). A database consists of multiple
tables. For example, we show in Figure \@ref(fig:figdb-2) how the data
contained in the two CSV files of Figure \@ref(fig:figdb-1) may be represented
as two tables. The `Grants` table contains one tuple for each row in grants.csv,
with columns `GrantID`, `Person`, `Funding`, and `Program`. The table contains one tuple for each row in investigators.csv, with columns `ID`, `Name`, `Institution`, and `Email`. The CSV files and tables contain essentially the same information, albeit with important differences (the addition of an `ID` field in the `Investigators` table, the substitution of an `ID` column for the `Person` column in the `Grants` table) that we will explain below.
The use of the relational data model provides for physical independence:
a given table can be stored in many different ways. SQL queries are sets
of instructions to execute commands
written in terms of the logical representation of tables (i.e., their
schema definition). Consequently, even if the physical organization of
the data changes (e.g., a different layout is used to store the data on
disk, or a new index is created to speed up access for some queries),
the queries need not change. Another advantage of the relational data
model is that, since a table is a *set*, in a mathematical sense, simple
and intuitive set operations (e.g., union, intersection) can be used to
manipulate the data, as we discuss below. We can easily, for example,
determine the intersection of two relations (e.g., grants that are
awarded to a specific institution), as we describe in the following. The
database further ensures that the data comply with the model (e.g., data
types, key uniqueness, entity relationships), essentially providing core
quality assurance.
```{r figdb-2, out.width = '70%', fig.align = 'center', echo = FALSE, fig.cap = 'Relational tables `Grants` and `Investigators` corresponding to the grants.csv and investigators.csv data in Figure 4.2, respectively. The only differences are the representation in a tabular form, the introduction of a unique numerical investigator identifier (`ID`) in the `Investigators` table, and the substitution of that identifier for the investigator name in the `Grants` table'}
knitr::include_graphics("ChapterDB/figures/figdb-2.png")
```
### Structured Query Language (SQL)
We use query languages to manipulate data in a database (e.g., to add,
update, or delete data elements) and to retrieve (raw and aggregated)
data from a database (e.g., data elements that certain properties).
Most Relational DBMSs support SQL, a simple, powerful query language with a
strong formal foundation based on logic, a foundation that allows
relational DBMSs to perform a wide variety of sophisticated
optimizations. SQL is used for three main purposes:
- **Data definition**: e.g., creation of new tables,
- **Data manipulation**: queries and updates,
- **Control**: creation of assertions to protect data integrity.
We introduce each of these features in the following, although not in
that order, and certainly not completely. Our goal here is to give
enough information to provide the reader with insights into how
relational databases work and what they do well; an in-depth SQL
tutorial is beyond the scope of this book, but we highly recommend checking the references at the end of this chapter.
### Manipulating and querying data {#sec:db:sql}
SQL and other query languages used in DBMSs support the concise,
declarative specification of complex queries. Because we are eager to
show you something immediately useful, we cover these features first,
before talking about how to define data models.
---
**Example: Identifying grants of more than $200,000**
Here is an SQL query to identify all grants with total funding of at
most \$200,000:
```{sql, eval = FALSE}
select * from Grants
where Funding <= 200,000;
```
Notice SQL's declarative nature: this query can be read almost as the
English language statement, "select all rows from the `Grants` table for which
the `Funding` column has value less than or equal 200,000." This query is
evaluated as follows:
1. The input table specified by the `from` clause, `Grants`, is selected.
2. The condition in the `where` clause, `Funding <= 200,000`, is checked against all rows in the input table to identify those rows that match.
3. The `select` clause specifies which columns to keep from the matching rows,
that is, which columns constitute the schema of the output table. (The
"\*" indicates that all columns should be kept.)
The answer, given the data in
Figure \@ref(fig:figdb-2), is the following single-row table. (The fact that
an SQL query returns a table is important when it comes to creating more
complex queries: the result of a query can be stored into the database
as a new table, or passed to another query as input.)
| **Number** | **Person** | **Funding** | **Program** |
|---------|--------|---------|-------------------------------|
| 1500194 | 3 | 200,000 | Accelerating Innovation Rsrch |
---
DBMSs automatically optimize declarative queries such as the example
that we just presented, translating them into a set of low-level data
manipulations (an imperative *query plan*) that can be evaluated
efficiently. This feature allows users to write queries without having
to worry too much about performance issues---the database does the
worrying for you. For example, a DBMS need not consider every row in the `Grants`
table in order to identify those with funding less than \$200,000, a
strategy that would be slow if the `Grants` table were large: it can instead use
an index to retrieve the relevant records much more quickly. We discuss
indices in more detail in Section [Database optimizations](#sec:db:index).
The querying component of SQL supports a wide variety of manipulations
on tables, whether referred to explicitly by a table name (as in the
example just shown) or constructed by another query. We just saw how to
use the `select` operator to both pick certain rows (what is termed *selection*)
and certain columns (what is called *projection*) from a table.
---
**Example: Finding grants awarded to an investigator**
We want to find all grants awarded to the investigator with name "Irving
Weinberg." The information required to answer this question is
distributed over two tables, `Grants` and `Investigators`, and so we *join* the two tables to combine tuples from both:
```{sql, eval = FALSE}
select Number, Name, Funding, Program
from Grants, Investigators
where Grants.Person = Investigators.ID
and Name = "Irving Weinberg";
```
This query combines tuples from the `Grants` and `Investigators` tables for which the `Person` and `ID` fields match. It is evaluated in a similar fashion to the query presented above, except for the `from` clause: when multiple tables are listed, as here, the conditions in the `where` clause are checked for all different combinations of tuples from the tables defined in the `from` clause (i.e., the cartesian product of these tables)---in this case, a total of $3\times 4 = 12$ combinations. We thus determine that Irving Weinberg has two grants. The query further selects the `Number`, `Name`, `Funding`, and `Program` fields from the result, giving the following:
| **Number** | **Person** | **Funding** | **Program** |
|---------|--------|---------|-------------------------------|
| 1500194 | Irving Weinberg | 200,000 | Accelerating Innovation Rsrch |
| 1211853 | Irving Weinberg | 261,437 | GALACTIC ASTRONOMY PROGRAM |
This ability to join two tables in a query is one example of how SQL
permits concise specifications of complex computations. This joining of
tables via a cartesian product operation is formally called a *cross
join*. Other types of join are also supported. We describe one such, the
*inner join*, in Section [Spatial databases](#sec:db:spatial).
---
SQL aggregate functions allow for the computation of aggregate
statistics over tables. For example, we can use the following query to
determine the total number of grants and their total and average funding
levels:
```{sql, eval = FALSE}
select count(*) as 'Number', sum(Funding) as 'Total',
avg(Funding) as 'Average'
from Grants;
```
This yields the following:
| **Number** | **Total** | **Average** |
|---------|--------|---------|
| 4 | 1444631 | 361158 |
The `group by` operator can be used in conjunction with the aggregate functions to
group the result set by one or more columns. For example, we can use the
following query to create a table with three columns: investigator name,
the number of grants associated with the investigator, and the aggregate
funding:
```{sql, eval = FALSE}
select Name, count(*) as 'Number',
avg(Funding) as 'Average funding'
from Grants, Investigators
where Grants.Person = Investigators.ID
group by Name;
```
We obtain the following:
| **Name** | **Number** | **Average Funding** |
|---------|--------|---------|
| Steven Weinberg | 1 | 666000 |
| Howard Weinberg | 1 | 323194 |
| Irving Weinberg | 2 | 230719 |
### Schema design and definition {#sec:db:schema}
```{r NCSES, out.width = '100%', fig.align = 'center', echo = FALSE, fig.cap = 'A database schema can show the ways in which many tables are linked. Here, there are individual-links (shown in green) as well as institution-level links (shown in red) and location-level links (shown in blue).'}
knitr::include_graphics("ChapterDB/figures/NCSES-Database-Diagram.png")
```
When using a pre-existing database, you will be given the database design that includes tables, rows, and columns.
But, when you are starting with your own data and need to create a database, the first step is to come up with the design of the database.
We have seen that a relational database comprises a set of tables. The
task of specifying the structure of the data to be stored in a database
is called *logical design*. This task may be performed by a database
administrator, in the case of a database to be shared by many people, or
directly by users, if they are creating databases themselves. More
specifically, the logical design process involves defining a *schema*. A
schema comprises a set of tables (including, for each table, its columns
and their types), their relationships, and integrity constraints.
The first step in the logical design process is to identify the entities
that need to be modeled. In our example, we identified two important
classes of entity: "grants" and "investigators." We thus define a table
for each; each row in these two tables will correspond to a unique grant
or investigator, respectively. (In a more complete and realistic design,
we would likely also identify other entities, such as institutions and
research products.) During this step, we will often find ourselves
breaking information up into multiple tables, so as to avoid duplicating
information.
For example, imagine that we were provided grant information in the form
of one CSV file rather than two, with each line providing a grant
number, investigator, funding, program, institution, and email. In this
file, the name, institution, and email address for Irving Weinberg would
then appear twice, as he has two grants, which can lead to errors when
updating values and make it difficult to represent certain information.
(For example, if we want to add an investigator who does not yet have a
grant, we will need to create a tuple (row) with empty slots for all
columns (variables) associated with grants.) Thus we would want to break
up the single big table into the two tables that we defined here. This
breaking up of information across different tables to avoid repetition
of information is referred to as **normalization** ^[Normalization involves organizing columns and tables of a relational database to minimize data redundancy.]. ^[Normalization can be done in statistical packages as well. For example, as noted above, PSID splits its data into different files linked through ID variables. The difference here is that the DBMS makes creating, navigating, and querying the resulting data particularly easy.]
The second step in the design process is to define the columns that are
to be associated with each entity. For each table, we define a set of
columns. For example, given the data in
Figure \@ref(fig:figdb-1), the grant table should include columns for
award identifier, title, investigator, and award amount; for an
investigator, the columns will be name, university, and email address. In general, we will
want to ensure that each row in our table has a key: a set of columns
that uniquely identifies that row. In our example tables, grants are
uniquely identified by `Number` and investigators by `ID`.
The third step in the design process is to capture relationships between
entities. In our example, we are concerned with just one relationship,
namely that between grants and investigators: each grant has an
investigator. We represent this relationship between tables by
introducing a `Person` column in the `Grants` table, as shown in
Figure \@ref(fig:figdb-2). Note that we do not simply duplicate the
investigator names in the two tables, as was the case in the two CSV
files shown in Figure \@ref(fig:figdb-1): these names might not be unique, and the
duplication of data across tables can lead to later inconsistencies if a
name is updated in one table but not the other.
The final step in the design process is to represent integrity
constraints (or rules) that must hold for the data. In our example, we
may want to specify that each grant must be awarded to an investigator;
that each value of the grant identifier column must be unique (i.e.,
there cannot be two grants with the same number); and that total funding can
never be negative. Such restrictions can be achieved by specifying
appropriate constraints at the time of schema creation, as we show in
Listing [Grantdata](#list:db1), which contains the code used to create the
two tables that make up our schema.
Listing [Grantdata](#list:db1) contains four SQL statements. The first two
statements, lines 1 and 2, simply set up our new database. The `create table`
statement in lines 1 and 2 creates our first table. It specifies the table
name (`Investigators`) and, for each of the four columns, the column name and
its type.^[These storage types will be familiar to many of you from statistical
software packages.] Relational DBMSs offer a rich set of types to choose from
when designing a table: for example, `int` or `integer` (synonyms); `real` or
`float`(synonyms); `char(n)`, a fixed-length string of `n` characters; and
`varchar(n)`, a variable-length string of up to `n` characters. Types are
important for several reasons. First, they allow
for more efficient encoding of data. For example, the `Funding` field in the
grants.csv file of Figure \@ref(fig:figdb-1) could be represented as a string
in the `Grants` table, `char(15)`, say, to allow for large grants. By
representing it as a floating point number instead (line 15 in
Listing [Grantdata](#list:db1)), we reduce the space requirement per grant
to just four bytes. Second, types allow for integrity checks on data as
they are added to the database: for example, that same type declaration
for `Funding` ensures that only valid numbers will be entered into the database.
Third, types allow for type-specific operations on data, such as
arithmetic operations on numbers (e.g., min, max, sum).
Other SQL features allow for the specification of additional constraints
on the values that can be placed in the corresponding column. For
example, the `not null` constraints for `Name` and `Institution`
(lines 6, 7) indicate that each investigator must have a name
and an institution, respectively. (The lack of such a constraint on the
`Email` column shows that an investigator need not have an email address.)
```{sql, eval = FALSE}
create database grantdata;
use grantdata;
create table Investigators (
ID int auto_increment,
Name varchar(100) not null,
Institution varchar(256) not null,
Email varchar(100),
primary key(ID)
);
create table Grants (
Number int not null,
Person int not null,
Funding float unsigned not null,
Program varchar(100),
primary key(Number)
);
```
<div style="text-align: center">Listing: Code to create the grantdata database and its Investigators and Grants tables</div> <a id="list:db1"></a>
### Loading data
So far we have created a database and two empty tables. The next step is to add
data to the tables. We can of course do that manually, row by row, but in most
cases we will import data from another source, such as a CSV file.
Listing [Load data](#list:db2) shows the two statements that load the data of
Figure \@ref(fig:figdb-1) into our two tables. (Here and elsewhere in this
chapter, we use the MySQL DBMS. The SQL syntax used by different DBMSs
differs in various, mostly minor ways.) Each statement specifies the
name of the file from which data is to be read and the table into which
it is to be loaded. The `fields terminated by ","` statement tells SQL that
values are separated by columns, and `ignore 1 lines` tells SQL to skip the
header. The list of column names is used to specify how values from the file
are to be assigned to columns in the table.
For the `Investigators` table, the three values in each row of the investigators.csv file are assigned to the `Name`, `Institution`, and `Email` columns of the corresponding database row. Importantly, the `auto_increment` declaration on the `ID` column (line 5 in Listing [Grantdata](#list:db1)) causes values for this column to be assigned automatically by the DBMS, as rows are created, starting at `1`. This feature allows us to assign a unique integer identifier to each investigator as its data are loaded.
```{sql, eval = FALSE}
load data local infile "investigators.csv"
into table Investigators
fields terminated by ","
ignore 1 lines
(Name, Institution, Email);
load data local infile "grants.csv" into table Grants
fields terminated by ","
ignore 1 lines
(Number, @var, Funding, Program)
set Person = (select ID from Investigators
where Investigators.Name=@var);
```
<div style="text-align: center">Listing: Code to load data into the Investigators and Grants tables</div> <a id="list:db2"></a>
For the `Grants` table, the `load data` call
(lines 7–12) is somewhat more complex. Rather than loading
the investigator name (the second column of each line in our data file,
represented here by the variable `@var`) directly into the database, we use an
SQL query (the `select` statement in
lines 11–12) to retrieve from the `Investigators` table the `ID` corresponding to that name. By thus replacing the investigator name with the unique
investigator identifier, we avoid replicating the name across the two
tables.
### Transactions and crash recovery
A DBMS protects the data that it stores from computer crashes: if your
computer stops running suddenly (e.g., your operating system crashes or
you unplug the power), the contents of your database are not corrupted.
It does so by supporting *transactions*. A transaction is an atomic
sequence of database actions. In general, every SQL statement is
executed as a transaction. You can also specify sets of statements to be
combined into a single transaction, but we do not cover that capability
here. The DBMS ensures that each transaction is executed completely even
in the case of failure or error: if the transaction succeeds, the
results of all operations are recorded permanently ("persisted") in the
database, and if it fails, all operations are "rolled back" and no
changes are committed. For example, suppose we ran the following SQL
statement to convert the funding amounts in the table from dollars to
euros, by scaling each number by 0.9. The `update` statement specifies the table
to be updated and the operation to be performed, which in this case is
to update the `Funding` column of each row. The DBMS will ensure that either no
rows are altered or all are altered.
```{sql, eval = FALSE}
update Grants set Grants.Funding = Grants.Funding*0.9;
```
Transactions are also key to supporting multi-user access. The
*concurrency control* mechanisms in a DBMS allow multiple users to
operate on a database concurrently, as if they were the only users of
the system: transactions from multiple users can be interleaved to
ensure fast response times, while the DBMS ensures that the database
remains consistent. While entire books could be (and have been) written
on concurrency in databases, the key point is that read operations can
proceed concurrently, while update operations are typically serialized.
### Database optimizations {#sec:db:index}
A relational DBMS applies query planning and optimization methods with
the goal of evaluating queries as efficiently as possible. For example,
if a query asks for rows that fit two conditions, one cheap to evaluate
and one expensive, a relational DBMS may filter first on the basis of
the first condition, and then apply the second conditions only to the
rows identified by that first filter. These sorts of optimization are
what distinguish SQL from other programming languages, as they allow the
user to write queries declaratively and rely on the DBMS to come up with
an efficient execution strategy.
Nevertheless, the user can help the DBMS to improve performance. The
single most powerful performance improvement tool is the index, an
internal data structure that the DBMS maintains to speed up queries.
While various types of indices can be created, with different
characteristics, the basic idea is simple. Consider the column in our
table. Assume that there are $N$ rows in the table. In the absence of an
index, a query that refers to a column value (e.g., `where ID=3`)
would require a linear scan of the table, taking on average $N/2$
comparisons and in the worst case $N$ comparisons. A binary tree index
allows the desired value to be found with just $\log_2 N$ comparisons.
---
**Example: Using indices to improve database performance**
Consider the following query:
```{sql, eval = FALSE}
select ID, Name, sum(Funding) as TotalFunding
from Grants, Investigators
where Investigators.ID=Grants.Person
group by ID;
```
This query joins our two tables to link investigators with the grants
that they hold, groups grants by investigator (using `group by`), and finally sums
the funding associated with the grants held by each investigator. The
result is the following:
| **ID** | **Name** | **TotalFunding** |
|---------|--------|---------|
| 1 | Steven Weinberg | 666000 |
| 2 | Howard Weinberg | 323194 |
| 3| Irving Weinberg | 230719 |
In the absence of indices, the DBMS must compare each row in `Investigators` with each
row in `Grants`, checking for each pair whether ` Investigators.ID = Grants.Person` holds. As the two tables in our sample database have only three and four rows, respectively, the
total number of comparisons is only $3\times 4=12$. But if we had, say,
1 million investigators and 1 million grants, then the DBMS would have
to perform 1 trillion comparisons, which would take a long time. (More
importantly, it would have to perform a large number of
disk I/O operations, if the tables did not fit in memory.) An index on
the `ID` column of the `Investigators` table reduces the number of operations dramatically,
as the DBMS can then take each of the 1 million rows in the `Grants` table and,
for each row, identify the matching row(s) in `Investigators` via an index lookup rather
than a linear scan.
In our example table, the `ID` column has been specified to be a `primary key`, and thus
an index is created for it automatically. If it were not, we could
easily create the desired index as follows:
```{sql, eval = FALSE}
alter table Investigators add index(ID);
```
It can be difficult for the user to determine when an index is required.
A good rule of thumb is to create an index for any column that is
queried often, that is, appears on the right-hand side of a `where` statement that is to be evaluated frequently.
However, the presence of indices makes updates more expensive, as every
change to a column value requires that the index be rebuilt to reflect
the change. Thus, if your data are highly dynamic, you should carefully
select which indices to create. (For bulk load operations, a common
practice is to drop indices prior to the data import, and re-create them
once the load is completed.) Also, indices take disk space, so you need
to consider the tradeoff between query efficiency and resources.
The `explain` command can be useful for determining when indices are required. For
example, we show in the following some of the output produced when we
apply `explain` to our query. (For this example, we have expanded the two tables
to 1,000 rows each, as our original tables are too small for MySQL to
consider the use of indices.) The output provides useful information
such as the key(s) that could be used, if indices exist (`Person` in the `Grants` table,
and the primary key, `ID`, for the `Investigators` table); the key(s) that are actually used (the primary key, `ID`, in the `Investigators` table); the column(s) that are compared to the index (`Investigators.ID` is compared with `Grants.Person`); and the number of rows that must be considered (each of the 1,000 rows in `Grants` is compared with one row in `Investigators`, for a total of 1,000 comparisons).
```{sql, eval = FALSE}
mysql> explain select ID, Name, sum(Funding) as TotalFunding
from Grants, Investigators
where Investigators.ID=Grants.Person group by ID;
+---------------+---------------+---------+---------------+------+
| table | possible_keys | key | ref | rows |
+---------------+---------------+---------+---------------+------+
| Grants | Person | NULL | NULL | 1000 |
| Investigators | PRIMARY | PRIMARY | Grants.Person | 1 |
+---------------+---------------+---------+---------------+------+
```
Contrast this output with the output obtained for equivalent tables in
which is not a primary key. In this case, no keys are used and thus
$1{,}000\times 1{,}000=1{,}000{,}000$ comparisons, and the associated
disk reads, must be performed.
```{sql, eval = FALSE}
+---------------+---------------+------+------+------+
| table | possible_keys | key | ref | rows |
+---------------+---------------+------+------+------+
| Grants | Person | NULL | NULL | 1000 |
| Investigators | ID | NULL | NULL | 1000 |
+---------------+---------------+------+------+------+
```
---
A second way in which the user can contribute to performance improvement
is by using appropriate table definitions and data types. Most DBMSs
store data on disk. Data must be read from disk into memory before it
can be manipulated. Memory accesses are fast, but loading data into
memory is expensive: accesses to main memory can be a million times
faster than accesses to disk. Therefore, to ensure queries are
efficient, it is important to minimize the number of disk accesses. A
relational DBMS automatically optimizes queries: based on how the data
are stored, it transforms a SQL query into a query plan that can be
executed efficiently, and chooses an execution strategy that minimizes
disk accesses. But users can contribute to making queries efficient. As
discussed above, the choice of types made when defining schemas can make
a big difference. As a rule of thumb, only use as much space as needed
for your data: the smaller your records, the more records can be
transferred to main memory using a single disk access. The design of
relational tables is also important. If you put all columns in a single
table (i.e., you do not normalize), more data may come into memory than is
required.
### Caveats and challenges
It is important to keep the following caveats and challenges in mind
when using SQL technology with social science data.
**Data cleaning**
Data created outside an SQL database, such as data in files, are not
always subject to strict constraints: data types may not be correct or
consistent (e.g., numeric data stored as text) and consistency or
integrity may not be enforced (e.g., absence of primary keys, missing
foreign keys). Indeed, as the reader probably knows well from
experience, data are rarely perfect. As a result, the data may fail to
comply with strict SQL schema requirements and fail to load, in which
case either data must be cleaned before or during loading, or the SQL
schema must be relaxed.
**Missing values**
Care must be taken when loading data in which some values may be missing
or blank. SQL engines represent and refer to a missing or blank value as
the built-in constant `null`. Counterintuitively, when loading data from text
files (e.g., CSV), many SQL engines require that missing values be
represented explicitly by the term `null`; if a data value is simply omitted,
it may fail to load or be incorrectly represented, for example as zero
or the empty string (`""`) instead of `null`. Thus, for example, the second row in
the investigators.csv file of
Figure \@ref(fig:figdb-1):
`Howard Weinberg,University of North Carolina Chapel Hill,`
may need to be rewritten as:
`Howard Weinberg,University of North Carolina Chapel Hill,null`
**Metadata for categorical variables**
SQL engines are metadata poor: they do not allow extra information to be
stored about a variable (field) beyond its base name and type (`int`, `char`,
etc., as introduced in Section [Schema design and definition](#sec:db:schema)). They cannot, for example, record directly the fact that the column `class` can only take one of three values, `animal`, `vegetable`, or `mineral`, or what these values mean. Common practice is thus to store information about possible values in another table (commonly referred to as a *dimension table*) that can be used as a lookup and constraint, as in the following:
<div style="text-align: center"> Table **class_values**</div>
| **Value** | **Description** |
|---------|--------|
| `animal` | Is alive |
| `vegetable` | Grows |
| `mineral` | Isn’t alive and doesn’t grow |
A related concept is that a column or list of columns may be declared `primary key` or
`unique`.
Such a statement specifies that no two tuples of the table may agree in the specified column---or, if a list if columns is provided, in all of those columns.
There can be only one `primary key` for a table, but several
`unique` columns.
No column of a `primary key` can ever be `null` in any tuple.
But columns declared `unique` may have `null`s, and there may be several tuples with `null`.
Linking DBMSs and other tools
-----------------------------
Query languages such as SQL are not general-purpose programming
languages; they support easy, efficient access to large data sets, are extremely efficient for specific types of analysis, but
may not be the right choice for all analysis. When complex
computations are required, one can embed query language statements into
a programming language or statistical package. For example, we might
want to calculate the interquartile range of funding for all grants.
While this calculation can be accomplished in SQL, the resulting SQL
code will be complicated (depending on which flavor of SQL your database supports).
Languages like Python make such statistical
calculations straightforward, so it is natural to write a Python (or R,
SAS, Stata, etc.) program that connects to the DBMS that contains our
data, fetches the required data from the DBMS, and then calculates the
interquartile range of those data. The program can then, if desired,
store the result of this calculation back into the database.
Many relational DBMSs also have built-in analytical functions or often
now support different programming languages, providing significant in-database statistical
and analytical capabilities and alleviating the need for external
processing.
```{sql, eval = FALSE}
from mysql.connector import MySQLConnection, Error
from python_mysql_dbconfig import read_db_config
def retrieve_and_analyze_data():
try:
# Open connection to the MySQL database
dbconfig = read_db_config()
conn = MySQLConnection(**dbconfig)
cursor = conn.cursor()
# Transmit the SQL query to the database
cursor.execute('select Funding from Grants;')
# Fetch all rows of the query response
rows = [row for row in cur.fetchall()]
calculate_inter_quartile_range(rows)
except Error as e:
print(e)
finally:
cursor.close()
conn.close()
if __name__ == '__main__':
retrieve_and_analyze_data()
```
<div style="text-align: center">Listing: Embedding SQL in Python</div> <a id="list:db3"></a>
---
**Example: Embedding database queries in Python**
The Python script in Listing [Embedding](#list:db3) shows how this
embedding of database queries in Python is done. This script establishes
a connection to the database, transmits the desired SQL query to the
database (line 7–9), retrieves the query results into a Python array
(line 11), and calls a Python procedure (not given) to perform the
desired computation (line 14). A similar program could be used to load
the results of a Python (or R, SAS, Stata, etc.) computation into a
database.
---
---
**Example: Loading other structured data**
We saw in Listing [Load data](#list:db2) how to load data from CSV files
into SQL tables. Data in other formats, such as the commonly used JSON,
can also be loaded into a relational DBMS. Consider, for example, the
following JSON format data, a simplified version of data shown in
Chapter [Working with Web Data and APIs](#chap:web).
```{r, eval = FALSE}
[
{
institute : Janelia Campus,
name : Laurence Abbott,
role : Senior Fellow,
state : VA,
town : Ashburn
},
{
institute : Jackson Lab,
name : Susan Ackerman,
role : Investigator,
state : ME,
town : Bar Harbor
}
]
```
While some relational DBMSs (such as PostgresQL) provide built-in support for JSON objects,
we assume here that we want to convert these data into normal SQL
tables. Using one of the many utilities for converting JSON into CSV, we
can construct the following CSV file, which we can load into an SQL
table using the method shown earlier.
``` {style="inline"}
institute,name,role,state,town
Janelia Campus,Laurence Abbott,Senior Fellow,VA,Ashburn