-
Notifications
You must be signed in to change notification settings - Fork 17
/
04-dataviz.Rmd
2324 lines (1603 loc) · 125 KB
/
04-dataviz.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Visualizing Data in the Tidyverse {#dataviz}
```{r, include=FALSE}
knitr::opts_chunk$set(fig.path = "images/dataviz-")
```
## About This Course
Data visualization is a critical part of any data science project. Once data have been imported and wrangled into place, visualizing your data can help you get a handle on what's going on in the dataset. Similarly, once you've completed your analysis and are ready to present your findings, data visualizations are a highly effective way to communicate your results to others. In this course we will cover what data visualization is and define some of the basic types of data visualizations.
In this course you will learn about the ggplot2 R package, a powerful set of tools for making stunning data graphics that has become the industry standard. You will learn about different types of plots, how to construct effect plots, and what makes for a successful or unsuccessful visualization.
In this specialization we assume familiarity with the R programming language. If you are not yet familiar with R, we suggest you first complete [R Programming](https://www.coursera.org/learn/r-programming) before returning to complete this course.
## Data Visualization Background
At its core, the term 'data visualization' refers to any visual display of data that helps us understand the underlying data better. This can be a plot or figure of some sort or a table that summarizes the data. Generally, there are a few characteristics of all good plots.
### General Features of Plots
Good plots have a number of features. While not exhaustive, good plots have:
1. Clearly-labeled axes.
2. Text that are large enough to see.
3. Axes that are not misleading.
4. Data that are displayed appropriately considering the type of data you have.
More specifically, however, there are two general approaches to data visualization: **exploratory plots** and **explanatory plots**.
#### Exploratory Plots
These are **data displays to help you better understand and discover hidden patterns in the data** you're working with. These won't be the prettiest plots, but they will be incredibly helpful. Exploratory visualizations have a number of general characteristics:
* They are made quickly.
* You'll make a large number of them.
* The axes and legends are cleaned up.
Below we have a graph where the axes are labeled and general pattern can be determined. This is a great example of an exploratory plot. It lets you the analyst know what's going on in your data, but it isn't yet ready for a big presentation.
![Exploratory Plot](images/gslides/172.png)
As you're trying to understand the data you have on hand, you'll likely make a lot of plots and tables just to figure out to explore and understand the data. Because there are a lot of them and they're for your use (rather than for communicating with others), you don't have to spend all your time making them perfect. But, you do have to spend enough time to make sure that you're drawing the right conclusions from this. Thus, you don't have to spend a long time considering what colors are perfect on these, but you do want to make sure your axes are not cut off.
Other Exploratory Plotting Examples:
[Map of Reddit](http://opensource.datacratic.com/mtlpy50/)
[Air Quality Data](https://blog.datazar.com/exploratory-data-analysis-using-r-part-i-17e4e8e03961)
#### Explanatory Plots
These are data displays that aim to **communicate insights to others**. These are plots that you spend a lot of time making sure they're easily interpretable by an audience. General characteristics of explanatory plots:
* They take a while to make.
* There are only a few of these for each project.
* You've spent a lot of time making sure the colors, labels, and sizes are all perfect for your needs.
Here we see an improvement upon the exploratory plot we looked at previously. Here, the axis labels are more descriptive. All of the text is larger. The legend has been moved onto the plot. The points on the plot are larger. And, there is a title. All of these changes help to improve the plot, making it an explanatory plot that would be presentation-ready.
![Explanatory Plots](images/gslides/173.png)
Explanatory plots are made after you've done an analysis and once you really understand the data you have. The goal of these plots is to communicate your findings clearly to others. To do so, you want to make sure these plots are made carefully - the axis labels should all be clear, the labels should all be large enough to read, the colors should all be carefully chosen, etc.. As this takes times and because you do not want to overwhelm your audience, you only want to have a few of these for each project. We often refer to these as "publication ready" plots. These are the plots that would make it into an article at the New York Times or in your presentation to your bosses.
Other Explanatory Plotting Examples:
* [How the Recession Shaped the Economy (NYT)](https://www.nytimes.com/interactive/2014/06/05/upshot/how-the-recession-reshaped-the-economy-in-255-charts.html)
* [2018 Flue Season (FiveThirtyEight)](https://fivethirtyeight.com/features/america-should-have-stayed-home-this-flu-season/)
## Plot Types
Above we saw data displayed as both an exploratory plot and an explanatory plot. That plot was an example of a scatterplot. However, there are many types of plots that are helpful. We'll discuss a few basic ones below and will include links to a few galleries where you can get a sense of the many different types of plots out there.
To do this, we'll use the "Davis" dataset of the `carData` package which includes, height and weight information for 200 people.
To use this data first make sure the `carData` package is installed and load it.
```{r}
#install.packages(carData)
library(carData)
Davis <- carData::Davis
```
![Dataset](images/gslides/174.png)
### Histogram
Histograms are helpful when you want to **better understand what values you have in your dataset for a single set of numbers**. For example, if you had a dataset with information about many people, you may want to know how tall the people in your dataset are. To quickly visualize this, you could use a histogram. Histograms let you know what range of values you have in your dataset. For example, below you can see that in this dataset, the height values range from around 50 to around 200 cm. The shape of the histogram also gives you information about the individuals in your dataset. The number of people at each height are also counted. So, the tallest bars show that there are about 40 people in the dataset whose height is between 165 and 170 cm. Finally, you can quickly tell, at a glance that most people in this dataset are at least 150 cm tall, but that there is at least one individually whose reported height is much lower.
![Histogram](images/gslides/175.png)
### Densityplot
Densityplots are smoothed versions of histograms, visualizing the distribution of a continuous variable. These plots effectively visualize the distribution shape and are, unlike histograms, are not sensitive to the number of bins chosen for visualization.
![Densityplot](images/gslides/176.png)
### Scatterplot
Scatterplots are helpful when you have **numerical values for two different pieces of information** and you want to understand the relationship between those pieces of information. Here, each dot represents a different person in the dataset. The dot's position on the graph represents that individual's height and weight. Overall, in this dataset, we can see that, in general, the more someone weighs, the taller they are. Scatterplots, therefore help us at a glance better understand the relationship between two sets of numbers.
![Scatter Plot](images/gslides/177.png)
### Barplot
When you only have **a single categorical variable that you want broken down and quantified by category**, a barplot will be ideal. For example if you wanted to look at how many females and how many males you have in your dataset, you could use a barplot. The comparison in heights between bars clearly demonstrates that there are more females in this dataset than males.
![Barplot](images/gslides/178.png)
#### Grouped Barplot
Grouped barplots, like simple barplots, demonstrate the counts for a group; however, they break this down by an additional categorical variable. For example, here we see the number of individuals within each % category along the x-axis. But, these data are further broken down by gender (an additional categorical variable). Comparisons between bars that are side-by-side are made most easily by our visual system. So, it's important to ensure that the bars you want viewers to be able to compare most easily are next to one another in this plot type.
![Grouped Barplot](images/gslides/179.png)
#### Stacked Barplot
Another common variation on barplots are stacked barplots. Stacked barplots take the information from a grouped barplot but stacks them on top of one another. This is most helpful when the bars add up to 100%, such as in a survey response where you're measuring percent of respondents within each category. Otherwise, it can be hard to compare between the groups within each bar.
```{r, eval = FALSE, echo = FALSE}
library(ggplot2)
library(dplyr)
Davis %<>% mutate(percent_weight =round(percent_rank(weight) *100))
Davis %<>% mutate(percent_height =round(percent_rank(height) *100))
Davis %<>% mutate(percent_height_breaks = cut(
percent_height,
breaks = seq(0, 101, by = 20), right = TRUE, include.lowest = TRUE,
labels = c("0-20%", "21-40%", "41-60%", "61-80%", "81-100%")))
Davis %>%ggplot() +
geom_bar(aes(x = percent_height_breaks, fill = sex), position = "fill") + scale_y_continuous(labels = scales::percent) + coord_flip()+labs(y = "Percent of Each Gender for Each Height Category", x = "Category of % Height Percentile") + theme_linedraw()
```
![Stacked Barplot](images/gslides/180_update.png)
### Boxplot
Boxplots also summarize **numerical values across a category**; however, instead of just comparing the heights of the bar, they give us an idea of the range of values that each category can take. For example, if we wanted to compare the heights of men to the heights of women, we could do that with a boxplot.
![Boxplot](images/gslides/181.png)
To interpret a boxplot, there are a few places where we'll want to focus our attention. For each category, the horizontal line through the middle of the box corresponds to the median value for that group. So, here, we can say that the median, or most typical height for females is about 165 cm. For males, this value is higher, just under 180 cm. Outside of the colored boxes, there are dashed lines. The ends of these lines correspond to the typical range of values. Here, we can see that females tend to have heights between 150 and 180cm. Lastly, when individuals have values outside the typical range, a boxplot will show these individuals as circles. These circles are referred to as outliers.
### Line Plots
The final type of basic plot we'll discuss here are line plots. Line plots are most effective at showing a **quantitative trend over time**.
```{r, eval = FALSE, echo = FALSE}
USPop %>% ggplot(aes(x = year, y = population)) + geom_line(col = "blue", size = 2) + labs(title = "United States Population Over Time", y = "Population in Millions", x = "Year") + theme_linedraw()+ theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14), plot.title = element_text(hjust = .5, size = 16))
```
![Line Plot](images/gslides/182_update.png)
#### Resources to look at these and other types of plots:
* [R Graph Gallery](https://www.r-graph-gallery.com/)
* [Ferdio Data Visualization Catalog](http://datavizproject.com/)
## Making Good Plots
The goal of data visualization in data analysis is to improve understanding of the data. As mentioned in the last lesson, this could mean improving our own understanding of the data *or* using visualization to improve someone else's understanding of the data.
We discussed some general characteristics and basic types of plots in the last lesson, but here we will step through a number of general tips for making good plots.
When generating exploratory or explanatory plots, you'll want to ensure information being displayed is being done so accurately and in a away that best reflects the reality within the dataset. Here, we provide a number of tips to keep in mind when generating plots.
### Choose the Right Type of Plot
If your goal is to allow the viewer to compare values across groups, pie charts should largely be avoided. This is because it's easier for the human eye to differentiate between bar heights than it is between similarly-sized slices of a pie. Thinking about the best way to visualize your data before making the plot is an important step in the process of data visualization.
![Choose an appropriate plot for the data you're visualizing.](images/gslides/183.png)
### Be Mindful When Choosing Colors
Choosing colors that work for the story you're trying to convey with your visualization is important. Avoiding colors that are hard to see on a screen or when projected, such as pastels, is a good idea. Additionally, red-green color blindness is common and leads to difficulty in distinguishing reds from greens. Simply avoiding making comparisons between these two colors is a good first step when visualizing data.
![Choosing appropriate colors for visualizations is important](images/gslides/184.png)
Beyond red-green color blindness, there is an entire group of experts out there in color theory.To learn more about available [color palettes in R](https://github.com/EmilHvitfeldt/r-color-palettes) or to read more from a pro named Lisa Charlotte Rost [talking about color choices in data visualization](https://lisacharlotterost.github.io/2016/04/22/Colors-for-DataVis/), feel free to read more.
### Label the Axes
Whether you're making an exploratory or explanatory visualization, labeled axes are a must. They help tell the story of the figure. Making sure the axes are clearly labeled is also important. Rather than labeling the graph below with "h" and "g," we chose the labels "height" and "gender," making it clear to the viewer exactly what is being plotted.
![Having descriptive labels on your axes is critical](images/gslides/185.png)
### Make Sure the Text is Readable
Often text on plots is too small for viewers to read. By being mindful of the size of the text on your axes, in your legend, and used for your labels, your visualizations will be greatly improved.
![On the right, we see that the text is easily readable](images/gslides/186.png)
### Make Sure the Numbers Add Up
When you're making a plot that should sum to 100, make sure that it in fact does. Taking a look at visualizations after you make them to ensure that they make sense is an important part of the data visualization process.
![At left, the pieces of the pie only add up to 95%. On the right, this error has been fixed and the pieces add up to 100%](images/gslides/187.png)
### Make Sure the Numbers and Plots Make Sense Together
Another common error is having labels that don't reflect the underlying graphic. For example, here, we can see on the left that the turquoise piece is more than half the graph, and thus the label 45% must be incorrect. At right, we see that the labels match what we see in the figure.
![Checking to make sure the numbers and plot make sense together is important](images/gslides/188.png)
### Make Comparisons Easy on Viewers
There are many ways in which you can make comparisons easier on the viewer. For example, avoiding unnecessary whitespace between the bars on your graph can help viewers make comparisons between the bars on the barplot.
![At left, there is extra white space between the bars of the plot that should be removed. On the right, we see an improved plot](images/gslides/189.png)
### Use y-axes That Start at Zero
Often, in an attempt to make differences between groups look larger than they are, y-axis will be started at a value other than zero. This is misleading. Y-axis for numerical information should start at zero.
![At left, the differences between the vars appears larger than on the right; however, this is just because the y-axis starts at 200. The proper way to start this graph is to start the y-axis at 0.](images/gslides/190.png)
### Keep It Simple
The goal of data visualization is to improve understanding of data. Sometimes complicated visualizations cannot be avoided; however, when possible, keep it simple.
Here, the graphic on the left does not immediately convey a main point. It's hard to interpret what each point means or what the story of this graphic is supposed to be. In contrast, the graphics on the right are simpler and each show a more obvious story. Make sure that your main point comes through:
![Main point unclear](images/gslides/191_update.png)
```{r, eval = FALSE, echo = FALSE}
library(tidyr)
iris %>% pivot_longer(cols = -Species) %>%
separate(col = "name", into = c("Type","Measurement")) %>%
mutate(row = rep(1:(600/2), each=2))%>%
tidyr::pivot_wider(names_from = Measurement, values_from = value) %>% ggplot() +
geom_point(aes(x = Width, y = Length, col = Species, shape = Type), size = 3, alpha = .7) +labs(title = "Iris Characteristics") + theme_linedraw()+ theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14), plot.title = element_text(hjust = .5, size = 16), legend.text = element_text(size = 13), legend.title = element_text(size = 13))
iris %>% ggplot(aes(x = Sepal.Width, y = Sepal.Length, col = Species)) +geom_point(size = 2)+ labs(x = "Width", y = "Length", title = "Iris Sepal Characteristics") + theme_linedraw()+ theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14), plot.title = element_text(hjust = .5, size = 16), legend.text = element_text(size = 13), legend.title = element_text(size = 13)) + ylim(c(0,10))
iris %>% ggplot(aes(x = Petal.Width, y = Petal.Length, col = Species)) +geom_point(size =2)+ labs( x = "Width", y = "Length", title = "Iris Petal Characteristics") + theme_linedraw()+ theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14), plot.title = element_text(hjust = .5, size = 16), legend.text = element_text(size = 13), legend.title = element_text(size = 13)) + ylim(c(0,10))
```
Similarly, the intention of your graphic should never be to mislead or confuse. Be sure that your data visualizations improve viewers' understanding. Using unusual axes limits or point sizes, or using vague labels can make plots misleading. This plot creates an effective exclamation mark shape which is fun, but it is no longer clear what points correspond to what species. Furthermore, this plot makes it look like petal width is not very distinguishable across the different species (particularly for versicolor and virginica), which is the opposite of what the previous petal plot conveyed.
```{r, eval = FALSE, echo = FALSE}
library(tidyr)
library(ggplot2)
library(dplyr)
iris %>% ggplot(aes(x = Petal.Width, y = Petal.Length, shape = Species)) +geom_point(size =10, alpha = 0.5)+ labs( x = "Width", y = "Length", title = "Iris Petal Characteristics") + theme_linedraw()+ theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14), plot.title = element_text(hjust = .5, size = 16), legend.text = element_text(size = 13), legend.title = element_text(size = 13)) + ylim(c(0,10))+ xlim(c(0,20))
```
![Confusion is conveyed here](images/gslides/192_update.png)
## Plot Generation Process
Having discussed some general guidelines, there are a number of questions you should ask yourself before making a plot. These have been nicely laid out in a [blog post](https://blog.datawrapper.de/better-charts/) from the wonderful team at [Chartable](https://blog.datawrapper.de/), [Datawrapper's](https://www.datawrapper.de/) blog and we will summarize them here. The post argues that there are three main questions you should ask any time you create a visual display of your data. We will discuss these three questions below
### What's your point?
Whenever you have data you're trying to plot, think about what you're actually trying to show. Once you've taken a look at your data, a good title for the plot can be helpful. Your title should **tell viewers what they'll see when they look at the plot**.
### How can you emphasize your point in your chart?
We talked about it in the last lesson, but an incredibly important decision is choosing an appropriate chart for the type of data you have. In the next section of this lesson, we'll discuss what type of data are appropriate for each type of plot in R; however, for now, we'll just focus on an iPhone data example. With this example, we'll discuss that you can emphasize your point by:
* Adding data
* Highlighting data with color
* Annotating your plot
#### Adding data
In any plot that makes a specific claim, it usually important to show additional data as a reference for comparison. For example, if you were making a plot of that suggests that the iPhone has been Apple's most successful product, it would be helpful for the plot to compare iPhone sales with other Apple products, say, the iPad or the iPod. By adding data about other Apple products over time, we can visualize just how successful the iPhone has been compared to other products.
#### Highlighting data with color
Colors help direct viewers' eyes to the most important parts of the figure. Colors tell your readers where to focus their attention. Grays help to tell viewers where to focus less of their attention, while other colors help to highlight the point your trying to make.
#### Annotate your plot
By highlighting parts of your plot with arrows or text on your plot, you can further draw viewers' attention to certain part of the plot. These are often details that are unnecessary in exploratory plots, where the goal is just to better understand the data, but are very helpful in explanatory plots, when you're trying to draw conclusions from the plot.
### What Does Your Final Chart Show?
A plot title should first tell viewers what they would see in the plot. The second step is to show them with the plot. The third step is to make it extra clear to viewers what they should be seeing with descriptions, annotations, and legends. You explain to viewers what they should be seeing in the plot and the source of your data. Again, these are important pieces of creating a complete explanatory plot, but are not all necessary when making exploratory plots.
#### Write precise descriptions
Whether it's a figure legend at the bottom of your plot, a subtitle explaining what data are plotted, or clear axes labels, text describing clearly what's going on in your plot is important. Be sure that viewers are able to easily determine what each line or point on a plot represents.
#### Add a source
When finalizing an explanatory plot, be sure to source your data. It's always best for readers to know where you obtained your data and what data are being used to create your plot. Transparency is important.
## `ggplot2`: Basics
R was initially developed for statisticians, who often are interested in generating plots or figures to visualize their data. As such, a few basic plotting features were built in when R was first developed. These are all still available; however, over time, a new approach to graphing in R was developed. This new approach implemented what is known as the [grammar of graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448), which allows you to develop elegant graphs flexibly in R. Making plots with this set of rules requires the R package `ggplot2`. This package is a core package in the tidyverse. So as along as the tidyverse has been loaded, you're ready to get started.
```{r message = FALSE}
# load the tidyverse
library(tidyverse)
```
### `ggplot2` Background
The grammar of graphics implemented in `ggplot2` is based on the idea that you can build *any* plot as long as you have a few pieces of information. To start building plots in `ggplot2`, we'll need some data and we'll need to know the type of plot we want to make. The type of plot you want to make in `ggplot2` is referred to as a geom. This will get us started, but the idea behind ggplot2 is that every new concept we introduce will be layered on top of the information you've already learned. In this way, ggplot2 is *layered* - layers of information add on top of each other as you build your graph. In code written to generate a `ggplot2` figure, you will see each line is separated by a plus sign (`+`). Think of each line as a different layer of the graph. We're simply adding one layer on top of the previous layers to generate the graph. You'll see exactly what we mean by this throughout each section in this lesson.
To get started, we'll start with the two basics (data and a geom) and build additional layers from there.
As we get started plotting in `ggplot2`, plots will take the following general form:
```{r eval = FALSE}
ggplot(data = DATASET) +
geom_PLOT_TYPE(mapping = aes(VARIABLE(S)))
```
When using `ggplot2` to generate figures, you will always begin by calling the `ggplot()` function. You'll then specify your dataset within the `ggplot()` function. Then, before making your plot you will also have to specify what **geom** type you're interested in plotting. We'll focus on a few basic geoms in the next section and give examples of each plot type (geom), but for now we'll just work with a single geom: `geom_point`.
`geom_point` is most helpful for creating scatterplots. As a reminder from an earlier lesson, scatterplots are useful when you're looking at the relationship between two numeric variables. Within `geom` you will specify the arguments needed to tell `ggplot2` how you want your plot to look.
You will map your variables using the aesthetic argument **`aes`**. We'll walk through examples below to make all of this clear. However, get comfortable with the overall look of the code now.
### Example Dataset: `diamonds`
To build your first plot in `ggplot2` we'll make use of the fact that there are some datasets already available in R. One frequently-used dataset is known as `diamonds`. This dataset contains prices and other attributes of 53,940 diamonds, with each row containing information about a different diamond. If you look at the first few rows of data, you can get an idea of what data are included in this dataset.
```{r}
diamonds <- as_tibble(diamonds)
diamonds
```
![First 12 rows of diamonds dataset](images/gslides/193.png)
Here you see a lot of numbers and can get an idea of what data are available in this dataset. For example, in looking at the column names across the top, you can see that we have information about how many carats each diamond is (`carat`), some information on the quality of the diamond cut (`cut`), the color of the diamond from J (worst) to D (best) (`color`), along with a number of other pieces of information about each diamond.
We will use this dataset to better understand how to generate plots in R, using `ggplot2`.
### Scatterplots: `geom_point()`
In `ggplot2` we specify these by defining `x` and `y` *within* the `aes()` argument. The `x` argument defines which variable will be along the bottom of the plot. The `y` refers to which variable will be along the left side of the plot. If we wanted to understand the relationship between the number of carats in a diamond and that diamond's price, we may do the following:
```{r}
# generate scatterplot with geom_point()
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
```
![diamonds scatterplot](images/gslides/194.png)
In this plot, we see that, in general, the larger the diamond is (or the more carats it has), the more expensive the diamond is (price), which is probably what we would have expected. However, now, we have a plot that definitively supports this conclusion!
### Aesthetics
What if we wanted to alter the size, color or shape of the points? Probably unsurprisingly, these can all be changed within the aesthetics argument. After all, something's aesthetic refers to how something looks. Thus, if you want to change the look of your graph, you'll want to play around with the plot's aesthetics.
In fact, in the plots above you'll notice that we specified what should be on the x and y axis within the `aes()` call. These are aesthetic mappings too! We were telling ggplot2 what to put on each axis, which will clearly affect how the plot looks, so it makes sense that these calls have to occur within `aes()`. Additionally now, we'll focus on arguments within `aes()` that change how the points on the plot look.
#### Point color
In the scatterplot we just generated, we saw that there was a relationship between carat and price, such that the more carats a diamond has, generally, the higher the price. But, it's not a perfectly linear trend. What we mean by that is that not all diamonds that were 2 carats were exactly the same price. And, not all 3 carat diamonds were exactly the same price. What if we were interested in finding out a little bit more about why this is the case?
Well, we could look at the clarity of the diamonds to see whether or not that affects the price of the diamonds? To add clarity to our plot, we could change the color of our points to differ based on clarity:
```{r}
# adjusting color within aes
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price, color = clarity))
```
![changing point colors helps us better understand the data](images/gslides/195_update.png)
Here, we see that not only are the points now colored by clarity, ggplot2 has also automatically added a legend for us with the various classes and their corresponding point color.
The Help pages of the diamonds dataset (accessed using `?diamonds`) state that clarity is "a measurement of how clear the diamond is." The documentation also tells us that I1 is the worst clarity and IF is the best (Full scale: I1, SI1, SI2, VS1, VS2, VVS1, VVS2, IF). This makes sense with what we see in the plot. Small (<1 carat) diamonds that have the best clarity level (IF) are some of the most expensive diamonds. While, relatively large diamonds (diamonds between 2 and 3 carats) of the lowest clarity (I1) tend to cost less.
By coloring our points by a different variable in the dataset, we now understand our dataset better. This is one of the goals of data visualization! And, specifically, what we're doing here in `ggplot2` is known as **mapping a variable to an aesthetic**. We took another variable in the dataset, mapped it to a color, and then put those colors on the points in the plot. Well, we only told `ggplot2` what variable to map. It took care of the rest!
Of course, we can also *manually* specify the colors of the points on our graph; however, manually specifying the colors of points happens *outside* of the `aes()` call. This is because `ggplot2` does not have to go through the process of mapping the variable to an aesthetic (color in this case). In the code here, `ggplot2` doesn't have to go through the trouble of figuring out which level of the variable is going to be which color on the plot (the mapping to the aesthetic part of the process). Instead, it just colors every point red. Thus, **manually specifying the color of your points happens _outside_ of `aes()`**:
```{r}
# manually control color point outside aes
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), color = "red")
```
![manually specifying point color occurs outside of `aes()`](images/gslides/196.png)
#### Point size
As above, we can change the point size by mapping another variable to the `size` argument within `aes`:
```{r}
# adjust point size within aes
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price, size = clarity))
```
![mapping to size changes point size on plot](images/gslides/197.png)
As above, `ggplot2` handles the mapping process. All you have to do is specify what variable you want mapped (`clarity`) and how you want ggplot2 to handle the mapping (change the point `size`). With this code, you do get a warning when you run it in R that using a "discrete variable is not advised." This is because mapping to size is usually done for numeric variables, rather than categorical variables like clarity.
This makes sense here too. The relationship between clarity, carat, and price was easier to visualize when clarity was mapped to `color` than here where it is mapped to `size`.
Like the above example with color, the size of *every* point can be changed by calling `size` outside of `aes`:
```{r}
# global control of point size
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), size = 4.5)
```
![manually specifying point size of all points occurs outside of `aes()`](images/gslides/198.png)
Here, we have manually increased the size of *all* the points on the plot.
#### Point Shape
You can also change the shape of the points (`shape`). We've used solid, filled circles thus far (the default in `geom_point`), but we could specify a different shape for each clarity.
```{r}
# map clarity to point shape within aes
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price, shape = clarity))
```
![mapping clarity to shape](images/gslides/199.png)
Here, while the mapping occurs correctly within ggplot2, we do get a warning message that discriminating more than six different shapes is difficult for the human eye. Thus, ggplot2 won't allow more than six different shapes on a plot. This suggests that while you *can* do something, it's not always the *best* to do that thing. Here, with more than six levels of clarity, it's best to stick to mapping this variable to `color` as we did initially.
To manually specify a shape for all the points on your plot, you would specify it outside of `aes` using one of the twenty-five different shape options available:
![options for points in ggplot2's `shape`](images/gslides/200.png)
For example, to plot all of the points on the plot as filled diamonds (it is a dataset about diamonds after all...), you would specify shape '18':
```{r}
# global control of point shape outside aes
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), shape = 18)
```
![specifying filled diamonds as shape for all points manually](images/gslides/201.png)
### Facets
In addition to mapping variables to different aesthetics, you can also opt to use facets to help make sense of your data visually. Rather than plotting all the data on a single plot and visually altering the point size or color of a third variable in a scatterplot, you could break each level of that third variable out into a separate subplot. To do this, you would use faceting. Faceting is particularly helpful for looking at categorical variables.
To use faceting, you would add an additional layer (+) to your code and use the `facet_wrap()` function. Within facet wrap, you specify the variable by which you want your subplots to be made:
```{r}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price)) +
# facet by clarity
facet_wrap(~clarity, nrow = 2)
```
Here, read the tilde as the word "by". Specifically here, we want a scatterplot of the relationship between carat and price and we want it faceted (broken down) **by (~)** clarity.
![facet_wrap breaks plots down into subplots](images/gslides/202.png)
Now, we have eight different plots, one for each level of clarity, where we can see the relationship between diamond carats and price.
You'll note here we've opted to specify that we want 2 rows of subplots (`nrow = 2`). You can play around with the number of rows you want in your output to customize how your output plot appears.
### Geoms
Thus far in this lesson we've only looked at scatterplots, which means we've only called `geom_point`. However, there are *many* additional geoms that we could call to generate different plots. Simply, a *geom* is just a shape we use to represent the data. In the case of scatterplots, they don't *really* use a geom since each actual point is plotted individually. Other plots, such as the boxplots, barplots, and histograms we described in previous lessons help to summarize or represent the data in a meaningful way, without plotting each individual point. The shapes used in these different types of plots to represent what's going on in the data is that plot's geom.
To see exactly what we mean by geoms being "shapes that represent the data", let's keep using the `diamonds` dataset, but instead of looking at the relationship between two numeric variables in a scatterplot, let's take a step back and take a look at a single numeric variable using a histogram.
#### Histograms: `geom_histogram`
To review, histograms allow you to quickly visualize the range of values your variable takes and the shape of your data. (Are all the numbers clustered around center? Or, are they all at the extremes of the range? Somewhere in between? The answers to these questions describe the "shape" of the values of your variable.)
For example, if we wanted to see what the distribution of carats was for these data, we could to the following.
```{r}
# change geom_ to generate histogram
ggplot(data = diamonds) +
geom_histogram(mapping = aes(carat))
```
![histogram of carat shows range and shape](images/gslides/203.png)
The code follows what we've seen so far in this lesson; however, we've now called `geom_histogram` to specify that we want to plot a histogram rather than a scatterplot.
Here, the rectangular boxes on the plot are geoms (shapes) that represent the number of diamonds that fall into each bin on the plot. Rather than plotting each individual point, histograms use rectangular boxes to summarize the data. This summarization helps us quickly understand what's going on in our dataset.
Specifically here, we can quickly see that most of the diamonds in the dataset are less than 1 carat. This is not necessarily something we could be sure of from the scatterplots generated previously in this lesson (since some points could have been plotted directly on top of one another). Thus, it's often helpful to visualize your data in a number of ways when you first get a dataset to ensure that you understand the variables and relationships between variables in your dataset!
#### Barplots: geom_bar
Barplots show the relationship between a set of numbers and a **categorical** variable. In the diamonds dataset, we may be interested in knowing how many diamonds there are of each cut of diamonds. There are five categories for cut of diamond. If we make a barplot for this variable, we can see the number of diamonds in each category.
```{r}
# geom_bar for bar plots
ggplot(data = diamonds) +
geom_bar(mapping = aes(cut))
```
Again, the changes to the code are minimal. We are now interested in plotting the categorical variable `cut` and state that we want a bar plot, by including `geom_bar()`.
![diamonds barplot](images/gslides/204.png)
Here, we again use rectangular shapes to represent the data, but we're not showing the distribution of a single variable (as we were with `geom_histogram`). Rather, we're using rectangles to show the count (number) of diamonds within each category within cut. Thus, we need a different geom: `geom_bar`!
#### Boxplots: `geom_boxplot`
Boxplots provide a summary of a numerical variable across categories. For example, if you were interested to see how the price of a diamond (a numerical variable) changed across different diamond color categories (categorical variable), you may want to use a boxplot. To do so, you would specify that using `geom_boxplot`:
```{r}
# geom_boxplot for boxplots
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = color, y = price))
```
In the code, we see that again, we only have to change what variables we want to be included in the plot and the type of plot (or geom). We want to use `geom_boxplot()` to get a basic boxplot.
![diamonds boxplot](images/gslides/205.png)
In the figure itself we see that the median price (the black horizontal bar in the middle of each box represents the median for each category) increases as the diamond color increases from the worst category (J) to the best (D).
Now, if you wanted to change the color of this boxplot, it would just take a small addition to the code for the plot you just generated.
```{r}
# fill globally changes bar color outside aes
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = color, y = price),
fill = "red")
```
![diamonds boxplot with red fill](images/gslides/206.png)
Here, by specifying the color "red" in the `fill` argument, you're able to change the plot's appearance. In the next lesson, we'll go deeper into the many ways in which a plot can be customized within `ggplot2`!
#### Other plots
While we've reviewed basic code to make a few common types of plots, there are a number of other plot types that can be made in `ggplot2.` These are listed in the [online reference material for ggplot2](http://ggplot2.tidyverse.org/reference/) or can be accessed through RStudio directly. To do so, you would type `?geom_` into the Console in RStudio. A list of geoms will appear. You can hover your cursor over any one of these to get a short description.
![?geom in Console](images/gslides/207.png)
Or, you can select a geom from this list and click enter. After selecting a geom, such as geom_abline and hitting 'Enter,' the help page for that geom will pop up in the 'Help' tab at bottom right. Here, you can find more detailed information about the selected geom.
![geom_abline help page](images/gslides/208.png)
### EDA Plots
As mentioned previously, an important step after you've read your data into R and wrangled it into a tidy format is to carry out **Exploratory Data Analysis (EDA)**. EDA is the process of understanding the data in your dataset fully. To understand your dataset fully, you need a full understanding of the variables stored in your dataset, what information you have and what information you don't have (missingness!). To gain this understanding, we've discussed using packages like `skimr` to get a quick idea of what information is stored in your dataset. However, generating plots is another critical step in this process. We encourage you to use `ggplot2` to understand the distribution of each single variable as well as the relationship between each variable in your dataset.
In this process, using `ggplot2` defaults is totally fine. These plots do not have to be the most effective visualizations for communication, so you don't want to spend a ton of time making them visually perfect. Only spend as much time on these as you need to understand your data!
## `ggplot2`: Customization
So far, we have walked through the steps of generating a number of different graphs (using different `geoms`) in `ggplot2`. We discussed the basics of mapping variables to your graph to customize its appearance or aesthetic (using size, shape, and color within `aes()`). Here, we'll build on what we've previously learned to really get down to how to customize your plots so that they're as clear as possible for communicating your results to others.
The skills learned in this lesson will help take you from generating exploratory plots that help *you* better understand your data to explanatory plots -- plots that help you communicate your results *to others*. We'll cover how to customize the colors, labels, legends, and text used on your graph.
Since we're already familiar with it, we'll continue to use the `diamonds` dataset that we've been using to learn about `ggplot2`.
### Colors
To get started, we'll learn how to control color across plots in `ggplot2`. Previously, we discussed using color within `aes()` on a scatterplot to automatically color points by the clarity of the diamond when looking at the relationship between price and carat.
```{r}
ggplot(diamonds) +
geom_point(mapping = aes(x = carat, y = price, color = clarity))
```
However, what if we wanted to carry this concept over to a bar plot and look at how many diamonds we have of each clarity group?
```{r}
# generate bar plot
ggplot(diamonds) +
geom_bar(aes(x = clarity))
```
![diamonds broken down by clarity](images/gslides/209.png)
As a general note, we've stopped including `data =` and `mapping =` here within our code. We included it so far to be explicit; however, in code you see in the world, the names of the arguments will typically be excluded and we want you to be familiar with code that appears as you see above.
OK, well that's a start since we see the breakdown, but all the bars are the same color. What if we adjusted color within `aes()`?
```{r}
# color changes outline of bar
ggplot(diamonds) +
geom_bar(aes(x = clarity, color = clarity))
```
![color does add color but around the bars](images/gslides/210.png)
As expected, color added a legend for each level of clarity; however, it colored the lines around the bars on the plot, rather than the bars themselves. In order to color the bars themselves, you want to specify the more helpful argument `fill`:
```{r}
# use fill to color bars
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = clarity))
```
![`fill` automatically colors the bars](images/gslides/211.png)
Great! We now have a plot with bars of different colors, which was our first goal! However, adding colors here, while maybe making the plot prettier doesn't actually give us any more information. We can see the same pattern of which clarity is most frequent among the diamonds in our dataset like we could see in the first plot we made.
Color is particularly helpful here, however, if we wanted to map a *different* variable onto each bar. For example, what if we wanted to see the breakdown of diamond "cut" within each "clarity" bar?
```{r}
# fill by separate variable (cut) = stacked bar chart
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = cut))
```
![mapping a different variable to fill provides new information](images/gslides/212.png)
Now we're getting some new information! We can see that each level in clarity appears to have diamonds of all levels of cut. Color here has really helped us understand more about the data.
But what if we were going to present these data? While there is no comparison between red and green (which is good!), there is a fair amount of yellow in this plot. Some projectors don't handle projecting yellow well, and it will show up too light on the screen. To avoid this, let's manually change the colors in this bar chart! To do so we'll add an additional layer to the plot using `scale_fill_manual`.
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = cut)) +
# manually control colors used
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4"))
```
![manually setting colors using `scale_fill_manual`](images/gslides/213.png)
Here, we've specified five different colors within the `values` argument of `scale_fill_manual()`, one for each cut of diamond. The names of these colors can be specified using the names explained on the third page of the cheatsheet [here](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf). (Note: There are other ways to specify colors within R. Explore the details in that cheatsheet to better understand the various ways!)
Additionally, it's important to note that here we've used `scale_fill_manual()` to adjust the color of what was mapped using `fill = cut`. If we had colored our chart using `color` within `aes()`, there is a different function called `scale_color_manual`. This makes good sense! You use scale_fill_manual() with `fill` and `scale_color_manual()` with `color`. Keep that in mind as you adjust colors in the future!
Now that we have some sense of which clarity is most common in our diamonds dataset and now that we are able to successfully specify the colors we want manually in order to make this plot useful for presentation, what if we wanted to compare the proportion of each cut across the different clarities? Currently, that's difficult because there is a different number within each clarity. In order to compare the proportion of each cut we have to use **position adjustment**.
What we've just generated is a **stacked bar chart**. It's a pretty good name for this type of chart as the bars for cut are all stacked on top of one another. If you don't want a stacked bar chart you could use one of the other `position` options: `identity`, `fill`, or `dodge`.
Returning to our question about proportion of each cut within each clarity group, we'll want to use `position = "fill"` within `geom_bar()`. Building off of what we've already done:
```{r}
ggplot(diamonds) +
# fill scales to 100%
geom_bar(aes(x = clarity, fill = cut), position = "fill") +
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4"))
```
![`position = "fill"` allows for comparison of proportion across groups](images/gslides/214.png)
Here, we've specified how we want to adjust the position of the bars in the plot. Each bar is now of equal height and we can compare each colored bar across the different clarities. As expected, we see that among the best clarity group (IF), we see more diamonds of the best cut ("Ideal")!
Briefly, we'll take a quick detour to look at `position = "dodge"`. This position adjustment places each object *next to one another*. This will not allow for easy comparison across groups, as we just saw with the last group but will allow values within each clarity group to be visualized.
```{r eval = FALSE}
ggplot(diamonds) +
# dodge rather than stack produces grouped bar plot
geom_bar(aes(x = clarity, fill = cut), position = "dodge") +
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4"))
```
![`position = "dodge"` helps compare values within each group](images/gslides/215.png)
Unlike in the first plot where we specified `fill = cut`, we can actually see the relationship between each cut within the lowest clarity group (I1). Before, when the values were stacked on top of one another, we were not able to visually see that there were more "Fair" and "Premium" cut diamonds in this group than the other cuts. Now, with `position = "dodge"`, this information is visually apparent.
Note: `position = "identity"` is not very useful for bars, as it *places each object exactly where it falls within the graph*. For bar charts, this will lead to *overlapping bars*, which is not visually helpful. However, for scatterplots (and other 2-Dimensional charts), this is the default and is exactly what you want.
### Labels
Text on plots is incredibly helpful. A good title tells viewers what they should be getting out of the plot. Axis labels are incredibly important to inform viewers of what's being plotted. Annotations on plots help guide viewers to important points in the plot. We'll discuss how to control all of these now!
#### Titles
Now that we have an understanding of how to manually adjust color, let's improve the clarity of our plots by including helpful labels by adding an additional `labs()` layer. We'll return to the plot where we were comparing proportions of diamond cut across diamond clarity groups.
You can include a `title`, `subtitle`, and/or `caption` within the `labs()` function. Each argument, as per usual, will be specified by a comma.
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = cut), position = "fill") +
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
# add titles
labs(title = "Clearer diamonds tend to be of higher quality cut",
subtitle = "The majority of IF diamonds are an \"Ideal\" cut")
```
![`labs()` adds helpful tittles, subtitles, and captions](images/gslides/216.png)
#### Axis labels
You may have noticed that our y-axis label says "count", but it's not actually a count anymore. In reality, it's a proportion. Having appropriately labeled axes is *so important*. Otherwise, viewers won't know what's being plotted. So, we should really fix that now using the `ylab()` function. Note: we won't be changing the x-axis label, but if you were interested in doing so, you would use `xlab("label")`.
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = cut), position = "fill") +
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
labs(title = "Clearer diamonds tend to be of higher quality cut",
subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
# add y axis label explicitly
ylab("proportion")
```
Note that the x- and y- axis labels can *also* be changed within `labs()`, using the argument (`x = ` and `y =` respectively).
![Accurate axis labels are incredibly important](images/gslides/217.png)
### Themes
To change the overall aesthetic of your graph, there are 8 themes built into `ggplot2` that can be added as an additional layer in your graph:
![themes](images/gslides/218.png)
For example, if we wanted remove the gridlines and grey background from the chart, we would use `theme_classic()`. Building on what we've already generated:
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = cut), position = "fill") +
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
labs(title = "Clearer diamonds tend to be of higher quality cut",
subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
ylab("proportion") +
# change plot theme
theme_classic()
```
![`theme_classic` changes aesthetic of our plot](images/gslides/219.png)
We now have a pretty good looking plot! However, a few additional changes would make this plot *even better* for communication.
Note: Additional themes are available from the [`ggthemes` package](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/). Users can also generate their own themes.
### Custom Theme
In addition to using available themes, we can also adjust parts of the theme of our graph using an additional `theme()` layer. There are **a lot** of options within theme. To see them all, look at the help documentation within RStudio Cloud using: `?theme`. We'll simply go over the syntax for using a few of them here to get you comfortable with adjusting your theme. Later on, you can play around with all the options on your own to become an expert!
#### Altering text size
For example, if we want to increase text size to make our plots more easily to view when presenting, we could do that within theme. Notice here that we're increasing the text size of the `title`, `axis.text`, `axis.title`, and `legend.text` all within `theme()`! The syntax here is important. Within each of the elements of the theme you want to alter, you have to specify what it is you want to change. Here, for all three, we want to alter the text, so we specify `element_text()`. Within that, we specify that it's `size` that we want to adjust.
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = cut), position = "fill") +
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
labs(title = "Clearer diamonds tend to be of higher quality cut",
subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
ylab("proportion") +
theme_classic() +
# control theme
theme(title = element_text(size = 16),
axis.text = element_text(size =14),
axis.title = element_text(size = 16),
legend.text = element_text(size = 14))
```
![`theme()` allows us to adjust font size](images/gslides/220.png)
#### Additional text alterations
Changing the size of text on your plot is not the only thing you can control within `theme()`. You can make text **bold** and change its color within `theme()`. Note here that multiple changes can be made to a single element. We can change size and make the text **bold**. All we do is separate each argument with a comma, per usual.
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = cut), position = "fill") +
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
labs(title = "Clearer diamonds tend to be of higher quality cut",
subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
ylab("proportion") +
theme_classic() +
theme(title = element_text(size = 16),
axis.text = element_text(size = 14),
axis.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 14),
# additional control
plot.subtitle = element_text(color = "gray30"))
```
![`theme()` allows us to tweak many parts of our plot](images/gslides/221.png)
Any alterations to plot spacing/background, title, axis, and legend will all be made within `theme()`.
### Legends
At this point, all the text on the plot is pretty visible! However, there's one thing that's still not quite clear to viewers. In daily life, people refer to the "cut" of a diamond by terms like "round cut" or "princess cut" to describe the *shape* of the diamond. That's not what we're talking about here when we're discussing "cut". In these data, "cut" refers to the quality of the diamond, not the shape. Let's be sure that's clear as well! We can change the name of the legend by using an additional layer and the `guides()` and `guide_legend()` functions of the `ggplot2` package!
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = cut), position = "fill") +
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
labs(title = "Clearer diamonds tend to be of higher quality cut",
subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
ylab("proportion") +
theme_classic() +
theme(title = element_text(size = 16),
axis.text = element_text(size = 14),
axis.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 14),
plot.subtitle = element_text(color = "gray30")) +
# control legend
guides(fill = guide_legend("cut quality"))
```
![`guides()` allows us to change the legend title](images/gslides/222.png)
This `guides()` function, as well as the `guides_*` functions allow us to modify legends even further.
This is especially useful if you have many colors in your legend and you want to control how the legend is displayed in terms of the number of columns and rows using `ncol` and `nrow` respectively.
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = cut), position = "fill") +
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
labs(title = "Clearer diamonds tend to be of higher quality cut",
subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
ylab("proportion") +
theme_classic() +
theme(title = element_text(size = 16),
axis.text = element_text(size = 14),
axis.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 14),
plot.subtitle = element_text(color = "gray30")) +
# control legend
guides(fill = guide_legend("Cut Quality",
ncol = 2))
```
Or, we can modify the font of the legend title using `title.theme()`.
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = cut), position = "fill") +
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
labs(title = "Clearer diamonds tend to be of higher quality cut",
subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
ylab("proportion") +
theme_classic() +
theme(title = element_text(size = 16),
axis.text = element_text(size = 14),
axis.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 14),
plot.subtitle = element_text(color = "gray30")) +
# control legend
guides(fill = guide_legend("Cut Quality",
title.theme = element_text(face = "bold")))
```
Alternatively, we can do this modification, as well as other legend modifications, like adding a rectangle around the legend, using the `theme()` function.
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity, fill = cut), position = "fill") +
scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
labs(title = "Clearer diamonds tend to be of higher quality cut",
subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
ylab("proportion") +
# changing the legend title:
guides(fill = guide_legend("Cut Quality")) +
theme_classic() +
theme(title = element_text(size = 16),
axis.text = element_text(size = 14),
axis.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 14),
plot.subtitle = element_text(color = "gray30"),
# changing the legend style:
legend.title = element_text(face = "bold"),
legend.background = element_rect(color = "black"))
```
At this point, we have an informative title, clear colors, a well-labeled legend, and text that is large enough throughout the graph. This is certainly a graph that could be used in a presentation. We've taken it from a graph that is useful to just ourselves (exploratory) and made it into a plot that can communicate our findings well to others (explanatory)!
We have touched on a number of alterations you can make by adding additional layers to a ggplot. In the rest of this lesson we'll touch on a few more changes you can make within `ggplot2`.
### Scales
There may be times when you want a different number of values to be displayed on an axis. The scale of your plot for **continuous variables** (i.e. numeric variables) can be controlled using `scale_x_continuous` or `scale_y_continuous`. Here, we want to increase the number of labels displayed on the y-axis, so we'll use `scale_y_continuous`:
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity)) +
# control scale for continuous variable
scale_y_continuous(breaks = seq(0, 17000, by = 1000))
```
![Continuous scales can be altered](images/gslides/223.png)
There is very handy argument called `trans` for the `scale_y_continuous` or the `scale_x_continuous` functions to change the scale of the axes. For example, it can be very useful to show the logarithmic version of the scale if you have very high values with large differences.
According to the documentation for the `trans` argument:
> Built-in transformations include "asn", "atanh", "boxcox", "date", "exp", "hms", "identity", "log", "log10", "log1p", "log2", "logit", "modulus", "probability", "probit", "pseudo_log", "reciprocal", "reverse", "sqrt" and "time".
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity)) +
# control scale for continuous variable
scale_y_continuous(trans = "log10") +
labs(y = "Count (log10 scale)",
x = "Clarity")
```
Notice that the values are not changed, just the way they are plotted. Now the y-axis increases by a factor of 10 for each break.
We will create a plot of the price of the diamonds to demonstrate the utility of creating a plot with a log10 scaled y-axis.
```{r}
ggplot(diamonds) +
geom_boxplot(aes(y = price, x = clarity))
ggplot(diamonds) +
geom_boxplot(aes(y = price, x = clarity)) +
scale_y_continuous(trans = "log10") +
labs(y = "Price (log10 scale)",
x = "Diamond Clarity")
```
In the first plot, it is difficult to tell what values the boxplots correspond to and it is difficult to compare the boxplots (particularly for the last three clarity categories), however this is greatly improved in the second plot.
We can also use another argument of the `scale_y_continuous()` function to add specific labels to our plot. For example, it would be nice to add dollar signs to the y-axis. We can do so using the `labels` argument. A variety of `label_*` functions within the `scales` package can be used to modify axis labels. See [here](https://scales.r-lib.org/reference/index.html) to take a look at the many options.
```{r}
ggplot(diamonds) +
geom_boxplot(aes(y = price, x = clarity)) +
scale_y_continuous(trans = "log10",
labels = scales::label_dollar()) +
labs(y = "Price (log10 scale)",
x = "Diamond Clarity")
```
In the above plot, we might also want to order the boxplots by the median price, we can do so using the `fct_reorder` function of `forcats` package to change the order for the `clarity` levels to be based on the median of the `price` values.
```{r}
ggplot(diamonds) +
geom_boxplot(aes(y = price, x = forcats::fct_reorder(clarity, price, .fun = median))) +
scale_y_continuous(trans = "log10",
labels = scales::label_dollar()) +
labs(y = "Price (log10 scale)",
x = "Diamond Clarity")
```
Now we can more easily determine that the `SI2` diamonds are the most expensive.
Another way to modify **discrete variables** (aka factors or categorical variables where there is a limited number of levels), is to use `scale_x_discrete` or `scale_y_discrete`. In this case we will just pick a few of the clarity categories to plot and we will specify the order.
```{r, eval = FALSE}
ggplot(diamonds) +
geom_bar(aes(x = clarity)) +
# control scale for discrete variable
scale_x_discrete(limit = c("SI2", "SI1", "I1")) +
scale_y_continuous(breaks = seq(0, 17000, by = 1000))
```
![Discrete scales can be altered](images/gslides/224.png)
### Coordinate Adjustment
There are times when you'll want to flip your axis. This can be accomplished using `coord_flip()`. Adding an additional layer to the plot we just generated switches our x- and y-axes, allowing for horizontal bar charts, rather than the default vertical bar charts:
```{r}
ggplot(diamonds) +
geom_bar(aes(x = clarity)) +
scale_y_continuous(breaks = seq(0, 17000, by = 1000)) +
scale_x_discrete(limit = c("SI2", "SI1", "I1")) +
# flip coordinates
coord_flip() +
labs(title = "Clearer diamonds tend to be of higher quality cut",
subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
ylab("proportion") +
theme_classic() +
theme(title = element_text(size = 18),
axis.text = element_text(size = 14),
axis.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 14),
plot.subtitle = element_text(color = "gray30")) +
guides(fill = guide_legend("cut quality"))
```
![Axes can be flipped using `coord_flip`](images/gslides/225.png)
It's important to remember that all the additional alterations we already discussed can still be applied to this graph, due to the fact that ggplot2 uses layering!
```{r}
p <- ggplot(diamonds) +
geom_bar(mapping = aes(x = clarity)) +
scale_y_continuous(breaks = seq(0, 17000, by = 1000)) +
scale_x_discrete(limit = c("SI2", "SI1", "I1")) +
coord_flip() +
labs(title = "Number of diamonds by diamond clarity",