04-dataviz.Rmd

# Visualizing Data in the Tidyverse {#dataviz}

```{r, include=FALSE}
knitr::opts_chunk$set(fig.path = "images/dataviz-")
```

## About This Course

Data visualization is a critical part of any data science project. Once data have been imported and wrangled into place, visualizing your data can help you get a handle on what's going on in the dataset. Similarly, once you've completed your analysis and are ready to present your findings, data visualizations are a highly effective way to communicate your results to others. In this course we will cover what data visualization is and define some of the basic types of data visualizations.

In this course you will learn about the ggplot2 R package, a powerful set of tools for making stunning data graphics that has become the industry standard. You will learn about different types of plots, how to construct effect plots, and what makes for a successful or unsuccessful visualization. 

In this specialization we assume familiarity with the R programming language. If you are not yet familiar with R, we suggest you first complete [R Programming](https://www.coursera.org/learn/r-programming) before returning to complete this course.


## Data Visualization Background

At its core, the term 'data visualization' refers to any visual display of data that helps us understand the underlying data better. This can be a plot or figure of some sort or a table that summarizes the data. Generally, there are a few characteristics of all good plots.

### General Features of Plots

Good plots have a number of features. While not exhaustive, good plots have:

1. Clearly-labeled axes.
2. Text that are large enough to see.
3. Axes that are not misleading.
4. Data that are displayed appropriately considering the type of data you have.  

More specifically, however, there are two general approaches to data visualization: **exploratory plots** and **explanatory plots**.

#### Exploratory Plots

These are **data displays to help you better understand and discover hidden patterns in the data** you're working with. These won't be the prettiest plots, but they will be incredibly helpful. Exploratory visualizations have a number of general characteristics:

* They are made quickly.
* You'll make a large number of them.
* The axes and legends are cleaned up.

Below we have a graph where the axes are labeled and general pattern can be determined. This is a great example of an exploratory plot. It lets you the analyst know what's going on in your data, but it isn't yet ready for a big presentation.

![Exploratory Plot](images/gslides/172.png)

As you're trying to understand the data you have on hand, you'll likely make a lot of plots and tables just to figure out to explore and understand the data. Because there are a lot of them and they're for your use (rather than for communicating with others), you don't have to spend all your time making them perfect. But, you do have to spend enough time to make sure that you're drawing the right conclusions from this. Thus, you don't have to spend a long time considering what colors are perfect on these, but you do want to make sure your axes are not cut off.  

Other Exploratory Plotting Examples:
[Map of Reddit](http://opensource.datacratic.com/mtlpy50/)
[Air Quality Data](https://blog.datazar.com/exploratory-data-analysis-using-r-part-i-17e4e8e03961)

#### Explanatory Plots 

These are data displays that aim to **communicate insights to others**. These are plots that you spend a lot of time making sure they're easily interpretable by an audience. General characteristics of explanatory plots:

* They take a while to make.
* There are only a few of these for each project.
* You've spent a lot of time making sure the colors, labels, and sizes are all perfect for your needs.

Here we see  an improvement upon the exploratory plot we looked at previously. Here, the axis labels are more descriptive. All of the text is larger. The legend has been moved onto the plot. The points on the plot are larger. And, there is a title. All of these changes help to improve the plot, making it an explanatory plot that would be presentation-ready.

![Explanatory Plots](images/gslides/173.png)

Explanatory plots are made after you've done an analysis and once you really understand the data you have. The goal of these plots is to communicate your findings clearly to others. To do so, you want to make sure these plots are made carefully - the axis labels should all be clear, the labels should all be large enough to read, the colors should all be carefully chosen, etc.. As this takes times and because you do not want to overwhelm your audience, you only want to have a few of these for each project. We often refer to these as "publication ready" plots. These are the plots that would make it into an article at the New York Times or in your presentation to your bosses.

Other Explanatory Plotting Examples:

* [How the Recession Shaped the Economy (NYT)](https://www.nytimes.com/interactive/2014/06/05/upshot/how-the-recession-reshaped-the-economy-in-255-charts.html)
* [2018 Flue Season (FiveThirtyEight)](https://fivethirtyeight.com/features/america-should-have-stayed-home-this-flu-season/)

## Plot Types

Above we saw data displayed as both an exploratory plot and an explanatory plot. That plot was an example of a scatterplot. However, there are many types of plots that are helpful. We'll discuss a few basic ones below and will include links to a few galleries where you can get a sense of the many different types of plots out there.

To do this, we'll use the "Davis" dataset of the `carData` package which includes, height and weight information for 200 people.

To use this data first make sure the `carData` package is installed and load it. 

```{r}
#install.packages(carData)
library(carData)
Davis <- carData::Davis
```

![Dataset](images/gslides/174.png)

### Histogram

Histograms are helpful when you want to **better understand what values you have in your dataset for a single set of numbers**. For example, if you had a dataset with information about many people, you may want to know how tall the people in your dataset are. To quickly visualize this, you could use a histogram. Histograms let you know what range of values you have in your dataset. For example, below you can see that in this dataset, the height values range from around 50 to around 200 cm. The shape of the histogram also gives you information about the individuals in your dataset. The number of people at each height are also counted. So, the tallest bars show that there are about 40 people in the dataset whose height is between 165 and 170 cm. Finally, you can quickly tell, at a glance that most people in this dataset are at least 150 cm tall, but that there is at least one individually whose reported height is much lower. 

![Histogram](images/gslides/175.png) 

### Densityplot

Densityplots are smoothed versions of histograms, visualizing the distribution of a continuous variable. These plots effectively visualize the distribution shape and are, unlike histograms, are not sensitive to the number of bins chosen for visualization.

![Densityplot](images/gslides/176.png) 

### Scatterplot

Scatterplots are helpful when you have **numerical values for two different pieces of information** and you want to understand the relationship between those pieces of information. Here, each dot represents a different person in the dataset. The dot's position on the graph represents that individual's height and weight. Overall, in this dataset, we can see that, in general, the more someone weighs, the taller they are. Scatterplots, therefore help us at a glance better understand the relationship between two sets of numbers.

![Scatter Plot](images/gslides/177.png)

### Barplot

When you only have **a single categorical variable that you want broken down and quantified by category**, a barplot will be ideal. For example if you wanted to look at how many females and how many males you have in your dataset, you could use a barplot.  The comparison in heights between bars clearly demonstrates that there are more females in this dataset than males. 

![Barplot](images/gslides/178.png)

#### Grouped Barplot

Grouped barplots, like simple barplots, demonstrate the counts for a group; however, they break this down by an additional categorical variable. For example, here we see the number of individuals within each % category along the x-axis. But, these data are further broken down by gender (an additional categorical variable). Comparisons between bars that are side-by-side are made most easily by our visual system. So, it's important to ensure that the bars you want viewers to be able to compare most easily are next to one another in this plot type. 

![Grouped Barplot](images/gslides/179.png)

#### Stacked Barplot

Another common variation on barplots are stacked barplots. Stacked barplots take the information from a grouped barplot but stacks them on top of one another. This is most helpful when the bars add up to 100%, such as in a survey response where you're measuring percent of respondents within each category. Otherwise, it can be hard to compare between the groups within each bar. 

```{r, eval = FALSE, echo = FALSE}
library(ggplot2)
library(dplyr)

Davis %<>% mutate(percent_weight =round(percent_rank(weight) *100))
Davis %<>% mutate(percent_height =round(percent_rank(height) *100))
Davis %<>% mutate(percent_height_breaks =  cut(
  percent_height,
breaks = seq(0, 101, by = 20), right = TRUE, include.lowest = TRUE,
  labels = c("0-20%", "21-40%", "41-60%", "61-80%", "81-100%")))


Davis  %>%ggplot() + 
geom_bar(aes(x = percent_height_breaks, fill = sex), position = "fill") + scale_y_continuous(labels = scales::percent) + coord_flip()+labs(y = "Percent of Each Gender for Each Height Category", x = "Category of % Height Percentile") +  theme_linedraw()  
```

![Stacked Barplot](images/gslides/180_update.png)

### Boxplot

Boxplots also summarize **numerical values across a category**; however, instead of just comparing the heights of the bar, they give us an idea of the range of values that each category can take. For example, if we wanted to compare the heights of men to the heights of women, we could do that with a boxplot.

![Boxplot](images/gslides/181.png)

To interpret a boxplot, there are a few places where we'll want to focus our attention. For each category, the horizontal line through the middle of the box corresponds to the median value for that group. So, here, we can say that the median, or most typical height for females is about 165 cm. For males, this value is higher, just under 180 cm. Outside of the colored boxes, there are dashed lines. The ends of these lines correspond to the typical range of values. Here, we can see that females tend to have heights between 150 and 180cm. Lastly, when individuals have values outside the typical range, a boxplot will show these individuals as circles. These circles are referred to as outliers.

### Line Plots

The final type of basic plot we'll discuss here are line plots. Line plots are most effective at showing a **quantitative trend over time**.

```{r, eval = FALSE, echo = FALSE}

USPop %>% ggplot(aes(x = year, y = population)) + geom_line(col = "blue", size = 2) + labs(title = "United States Population Over Time", y = "Population in Millions", x = "Year") + theme_linedraw()+ theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14), plot.title = element_text(hjust = .5, size = 16))

```

![Line Plot](images/gslides/182_update.png)


#### Resources to look at these and other types of plots:

* [R Graph Gallery](https://www.r-graph-gallery.com/)
* [Ferdio Data Visualization Catalog](http://datavizproject.com/)


## Making Good Plots

The goal of data visualization in data analysis is to improve understanding of the data. As mentioned in the last lesson, this could mean improving our own understanding of the data *or* using visualization to improve someone else's understanding of the data. 

We discussed some general characteristics and basic types of plots in the last lesson, but here we will step through a number of general tips for making good plots.

When generating exploratory or explanatory plots, you'll want to ensure information being displayed is being done so accurately and in a away that best reflects the reality within the dataset. Here, we provide a number of tips to keep in mind when generating plots.

### Choose the Right Type of Plot

If your goal is to allow the viewer to compare values across groups, pie charts should largely be avoided. This is because it's easier for the human eye to differentiate between bar heights than it is between similarly-sized slices of a pie. Thinking about the best way to visualize your data before making the plot is an important step in the process of data visualization.

![Choose an appropriate plot for the data you're visualizing.](images/gslides/183.png)

### Be Mindful When Choosing Colors

Choosing colors that work for the story you're trying to convey with your visualization is important. Avoiding colors that are hard to see on a screen or when projected, such as pastels, is a good idea. Additionally, red-green color blindness is common and leads to difficulty in distinguishing reds from greens. Simply avoiding making comparisons between these two colors is a good first step when visualizing data. 


![Choosing appropriate colors for visualizations is important](images/gslides/184.png)

Beyond red-green color blindness, there is an entire group of experts out there in color theory.To learn more about available [color palettes in R](https://github.com/EmilHvitfeldt/r-color-palettes) or to read more from a pro named Lisa Charlotte Rost [talking about color choices in data visualization](https://lisacharlotterost.github.io/2016/04/22/Colors-for-DataVis/), feel free to read more.

### Label the Axes

Whether you're making an exploratory or explanatory visualization, labeled axes are a must. They help tell the story of the figure. Making sure the axes are clearly labeled is also important. Rather than labeling the graph below with "h" and "g," we chose the labels "height" and "gender," making it clear to the viewer exactly what is being plotted.

![Having descriptive labels on your axes is critical](images/gslides/185.png)

### Make Sure the Text is Readable

Often text on plots is too small for viewers to read. By being mindful of the size of the text on your axes, in your legend, and used for your labels, your visualizations will be greatly improved.

![On the right, we see that the text is easily readable](images/gslides/186.png)

### Make Sure the Numbers Add Up

When you're making a plot that should sum to 100, make sure that it in fact does. Taking a look at visualizations after you make them to ensure that they make sense is an important part of the data visualization process. 

![At left, the pieces of the pie only add up to 95%. On the right, this error has been fixed and the pieces add up to 100%](images/gslides/187.png)

### Make Sure the Numbers and Plots Make Sense Together

Another common error is having labels that don't reflect the underlying graphic. For example, here, we can see on the left that the turquoise piece is more than half the graph, and thus the label 45% must be incorrect. At right, we see that the labels match what we see in the figure.


![Checking to make sure the numbers and plot make sense together is important](images/gslides/188.png)

### Make Comparisons Easy on Viewers

There are many ways in which you can make comparisons easier on the viewer. For example, avoiding unnecessary whitespace between the bars on your graph can help viewers make comparisons between the bars on the barplot.

![At left, there is extra white space between the bars of the plot that should be removed. On the right, we see an improved plot](images/gslides/189.png)

### Use y-axes That Start at Zero

Often, in an attempt to make differences between groups look larger than they are, y-axis will be started at a value other than zero. This is misleading. Y-axis for numerical information should start at zero. 

![At left, the differences between the vars appears larger than on the right; however, this is just because the y-axis starts at 200. The proper way to start this graph is to start the y-axis at 0.](images/gslides/190.png)

### Keep It Simple

The goal of data visualization is to improve understanding of data. Sometimes complicated visualizations cannot be avoided; however, when possible, keep it simple. 

Here, the graphic on the left does not immediately convey a main point. It's hard to interpret what each point means or what the story of this graphic is supposed to be. In contrast, the graphics on the right are simpler and each show a more obvious story. Make sure that your main point comes through:

![Main point unclear](images/gslides/191_update.png)

```{r, eval = FALSE, echo = FALSE}
library(tidyr)

  iris %>% pivot_longer(cols = -Species) %>%
separate(col = "name", into = c("Type","Measurement")) %>% 
      mutate(row = rep(1:(600/2), each=2))%>%
      tidyr::pivot_wider(names_from = Measurement, values_from = value) %>% ggplot() + 
    geom_point(aes(x = Width, y = Length, col = Species, shape = Type), size = 3, alpha = .7) +labs(title = "Iris Characteristics") + theme_linedraw()+ theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14), plot.title = element_text(hjust = .5, size = 16), legend.text = element_text(size = 13), legend.title = element_text(size = 13))


iris %>% ggplot(aes(x = Sepal.Width, y = Sepal.Length,  col = Species)) +geom_point(size = 2)+ labs(x = "Width", y = "Length", title = "Iris Sepal Characteristics") + theme_linedraw()+ theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14), plot.title = element_text(hjust = .5, size = 16), legend.text = element_text(size = 13), legend.title = element_text(size = 13)) + ylim(c(0,10))

iris %>% ggplot(aes(x = Petal.Width, y = Petal.Length,  col = Species)) +geom_point(size =2)+ labs( x = "Width", y = "Length", title = "Iris Petal Characteristics") + theme_linedraw()+ theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14), plot.title = element_text(hjust = .5, size = 16), legend.text = element_text(size = 13), legend.title = element_text(size = 13)) + ylim(c(0,10))

```

Similarly, the intention of your graphic should never be to mislead or confuse. Be sure that your data visualizations improve viewers' understanding. Using unusual axes limits or point sizes, or using vague labels can make plots misleading. This plot creates an effective exclamation mark shape which is fun, but it is no longer clear what points correspond to what species. Furthermore, this plot makes it look like petal width is not very distinguishable across the different species (particularly for versicolor and virginica), which is the opposite of what the previous petal plot conveyed.

```{r, eval = FALSE, echo = FALSE}
library(tidyr)
library(ggplot2)
library(dplyr)

  iris %>% ggplot(aes(x = Petal.Width, y = Petal.Length,  shape = Species)) +geom_point(size =10, alpha = 0.5)+ labs( x = "Width", y = "Length", title = "Iris Petal Characteristics") + theme_linedraw()+ theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14), plot.title = element_text(hjust = .5, size = 16), legend.text = element_text(size = 13), legend.title = element_text(size = 13)) + ylim(c(0,10))+ xlim(c(0,20))
```

![Confusion is conveyed here](images/gslides/192_update.png)


## Plot Generation Process

Having discussed some general guidelines, there are a number of questions you should ask yourself before making a plot. These have been nicely laid out in a [blog post](https://blog.datawrapper.de/better-charts/) from the wonderful team at [Chartable](https://blog.datawrapper.de/), [Datawrapper's](https://www.datawrapper.de/) blog and we will summarize them here. The post argues that there are three main questions you should ask any time you create a visual display of your data. We will discuss these three questions below

### What's your point?

Whenever you have data you're trying to plot, think about what you're actually trying to show. Once you've taken a look at your data, a good title for the plot can be helpful. Your title should **tell viewers what they'll see when they look at the plot**. 

### How can you emphasize your point in your chart?

We talked about it in the last lesson, but an incredibly important decision is choosing an appropriate chart for the type of data you have. In the next section of this lesson, we'll discuss what type of data are appropriate for each type of plot in R; however, for now, we'll just focus on an iPhone data example. With this example, we'll discuss that you can emphasize your point by:

* Adding data
* Highlighting data with color
* Annotating your plot

#### Adding data

In any plot that makes a specific claim, it usually important to show additional data as a reference for comparison. For example, if you were making a plot of that suggests that the iPhone has been Apple's most successful product, it would be helpful for the plot to compare iPhone sales with other Apple products, say, the iPad or the iPod. By adding data about other Apple products over time, we can visualize just how successful the iPhone has been compared to other products.

#### Highlighting data with color

Colors help direct viewers' eyes to the most important parts of the figure. Colors tell your readers where to focus their attention. Grays help to tell viewers where to focus less of their attention, while other colors help to highlight the point your trying to make.

#### Annotate your plot

By highlighting parts of your plot with arrows or text on your plot, you can further draw viewers' attention to certain part of the plot. These are often details that are unnecessary in exploratory plots, where the goal is just to better understand the data, but are very helpful in explanatory plots, when you're trying to draw conclusions from the plot. 


### What Does Your Final Chart Show?

A plot title should first tell viewers what they would see in the plot. The second step is to show them with the plot. The third step is to make it extra clear to viewers what they should be seeing with descriptions, annotations, and legends. You explain to viewers what they should be seeing in the plot and the source of your data. Again, these are important pieces of creating a complete explanatory plot, but are not all necessary when making exploratory plots.

#### Write precise descriptions 

Whether it's a figure legend at the bottom of your plot, a subtitle explaining what data are plotted, or clear axes labels, text describing clearly what's going on in your plot is important. Be sure that viewers are able to easily determine what each line or point on a plot represents. 

#### Add a source

When finalizing an explanatory plot, be sure to source your data. It's always best for readers to know where you obtained your data and what data are being used to create your plot. Transparency is important.


## `ggplot2`: Basics

R was initially developed for statisticians, who often are interested in generating plots or figures to visualize their data. As such, a few basic plotting features were built in when R was first developed. These are all still available; however, over time, a new approach to graphing in R was developed. This new approach implemented what is known as the [grammar of graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448), which allows you to develop elegant graphs flexibly in R. Making plots with this set of rules requires the R package `ggplot2`. This package is a core package in the tidyverse. So as along as the tidyverse has been loaded, you're ready to get started.

```{r message = FALSE}
# load the tidyverse
library(tidyverse)
```

### `ggplot2` Background

The grammar of graphics implemented in `ggplot2` is based on the idea that you can build *any* plot as long as you have a few pieces of information. To start building plots in `ggplot2`, we'll need some data and we'll need to know the type of plot we want to make. The type of plot you want to make in `ggplot2` is referred to as a geom. This will get us started, but the idea behind ggplot2 is that every new concept we introduce will be layered on top of the information you've already learned. In this way, ggplot2 is *layered* - layers of information add on top of each other as you build your graph. In code written to generate a `ggplot2` figure, you will see each line is separated by a plus sign (`+`). Think of each line as a different layer of the graph. We're simply adding one layer on top of the previous layers to generate the graph. You'll see exactly what we mean by this throughout each section in this lesson.

To get started, we'll start with the two basics (data and a geom) and build additional layers from there. 

As we get started plotting in `ggplot2`, plots will take the following general form:

```{r eval = FALSE}
ggplot(data = DATASET) + 
  geom_PLOT_TYPE(mapping = aes(VARIABLE(S)))
```

When using `ggplot2` to generate figures, you will always begin by calling the `ggplot()` function. You'll then specify your dataset within the `ggplot()` function. Then, before making your plot you will also have to specify what **geom** type you're interested in plotting. We'll focus on a few basic geoms in the next section and give examples of each plot type (geom), but for now we'll just work with a single geom: `geom_point`. 

`geom_point` is most helpful for creating scatterplots. As a reminder from an earlier lesson, scatterplots are useful when you're looking at the relationship between two numeric variables. Within `geom` you will specify the arguments needed to tell `ggplot2` how you want your plot to look. 

You will map your variables using the aesthetic argument **`aes`**. We'll walk through examples below to make all of this clear. However, get comfortable with the overall look of the code now.

### Example Dataset: `diamonds`

To build your first plot in `ggplot2` we'll make use of the fact that there are some datasets already available in R. One frequently-used dataset is known as `diamonds`. This dataset contains prices and other attributes of 53,940 diamonds, with each row containing information about a different diamond. If you look at the first few rows of data, you can get an idea of what data are included in this dataset. 

```{r}
diamonds <- as_tibble(diamonds)
diamonds
```

![First 12 rows of diamonds dataset](images/gslides/193.png)

Here you see a lot of numbers and can get an idea of what data are available in this dataset. For example, in looking at the column names across the top, you can see that we have information about how many carats each diamond is (`carat`), some information on the quality of the diamond cut (`cut`), the color of the diamond from J (worst) to D (best) (`color`), along with a number of other pieces of information about each diamond.

We will use this dataset to better understand how to generate plots in R, using `ggplot2`.

### Scatterplots: `geom_point()`

In `ggplot2` we specify these by defining `x` and `y` *within* the `aes()` argument. The `x` argument defines which variable will be along the bottom of the plot. The `y` refers to which variable will be along the left side of the plot. If we wanted to understand the relationship between the number of carats in a diamond and that diamond's price, we may do the following:

```{r}
# generate scatterplot with geom_point()
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price))
```

![diamonds scatterplot](images/gslides/194.png)

In this plot, we see that, in general, the larger the diamond is (or the more carats it has), the more expensive the diamond is (price), which is probably what we would have expected. However, now, we have a plot that definitively supports this conclusion!

### Aesthetics

What if we wanted to alter the size, color or shape of the points? Probably unsurprisingly, these can all be changed within the aesthetics argument. After all, something's aesthetic refers to how something looks. Thus, if you want to change the look of your graph, you'll want to play around with the plot's aesthetics. 

In fact, in the plots above you'll notice that we specified what should be on the x and y axis within the `aes()` call. These are aesthetic mappings too! We were telling ggplot2 what to put on each axis, which will clearly affect how the plot looks, so it makes sense that these calls have to occur within `aes()`. Additionally now, we'll focus on arguments within `aes()` that change how the points on the plot look. 

#### Point color

In the scatterplot we just generated, we saw that there was a relationship between carat and price, such that the more carats a diamond has, generally, the higher the price. But, it's not a perfectly linear trend. What we mean by that is that not all diamonds that were 2 carats were exactly the same price. And, not all 3 carat diamonds were exactly the same price. What if we were interested in finding out a little bit more about why this is the case?

Well, we could look at the clarity of the diamonds to see whether or not that affects the price of the diamonds? To add clarity to our plot, we could change the color of our points to differ based on clarity:

```{r}
# adjusting color within aes
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price, color = clarity))
```

![changing point colors helps us better understand the data](images/gslides/195_update.png)

Here, we see that not only are the points now colored by clarity, ggplot2 has also automatically added a legend for us with the various classes and their corresponding point color. 

The Help pages of the diamonds dataset (accessed using `?diamonds`) state that clarity is "a measurement of how clear the diamond is." The documentation also tells us that I1 is the worst clarity and IF is the best (Full scale: I1, SI1, SI2, VS1, VS2, VVS1, VVS2, IF). This makes sense with what we see in the plot. Small (<1 carat) diamonds that have the best clarity level (IF) are some of the most expensive diamonds. While, relatively large diamonds (diamonds between 2 and 3 carats) of the lowest clarity (I1) tend to cost less. 

By coloring our points by a different variable in the dataset, we now understand our dataset better. This is one of the goals of data visualization! And, specifically, what we're doing here in `ggplot2` is known as **mapping a variable to an aesthetic**. We took another variable in the dataset, mapped it to a color, and then put those colors on the points in the plot. Well, we only told `ggplot2` what variable to map. It took care of the rest!

Of course, we can also *manually* specify the colors of the points on our graph; however, manually specifying the colors of points happens *outside* of the `aes()` call. This is because `ggplot2` does not have to go through the process of mapping the variable to an aesthetic (color in this case). In the code here, `ggplot2` doesn't have to go through the trouble of figuring out which level of the variable is going to be which color on the plot (the mapping to the aesthetic part of the process). Instead, it just colors every point red. Thus, **manually specifying the color of your points happens _outside_ of `aes()`**:

```{r}
# manually control color point outside aes
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), color = "red") 
```

![manually specifying point color occurs outside of `aes()`](images/gslides/196.png)

#### Point size

As above, we can change the point size by mapping another variable to the `size` argument within `aes`:

```{r}
# adjust point size within aes
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price, size = clarity))
```

![mapping to size changes point size on plot](images/gslides/197.png)

As above, `ggplot2` handles the mapping process. All you have to do is specify what variable you want mapped (`clarity`) and how you want ggplot2 to handle the mapping (change the point `size`). With this code, you do get a warning when you run it in R that using a "discrete variable is not advised." This is because mapping to size is usually done for numeric variables, rather than categorical variables like clarity. 

This makes sense here too. The relationship between clarity, carat, and price was easier to visualize when clarity was mapped to `color` than here where it is mapped to `size`.

Like the above example with color, the size of *every* point can be changed by calling `size` outside of `aes`:

```{r}
# global control of point size
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), size = 4.5)
```


![manually specifying point size of all points occurs outside of `aes()`](images/gslides/198.png)

Here, we have manually increased the size of *all* the points on the plot. 

#### Point Shape

You can also change the shape of the points (`shape`). We've used solid, filled circles thus far (the default in `geom_point`), but we could specify a different shape for each clarity.

```{r}
# map clarity to point shape within aes 
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price, shape = clarity))
```

![mapping clarity to shape](images/gslides/199.png)

Here, while the mapping occurs correctly within ggplot2, we do get a warning message that discriminating more than six different shapes is difficult for the human eye. Thus, ggplot2 won't allow more than six different shapes on a plot. This suggests that while you *can* do something, it's not always the *best* to do that thing. Here, with more than six levels of clarity, it's best to stick to mapping this variable to `color` as we did initially.

To manually specify a shape for all the points on your plot, you would specify it outside of `aes` using one of the twenty-five different shape options available:

![options for points in ggplot2's `shape`](images/gslides/200.png)

For example, to plot all of the points on the plot as filled diamonds (it is a dataset about diamonds after all...), you would specify shape '18':

```{r}
# global control of point shape outside aes
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), shape = 18)
```

![specifying filled diamonds as shape for all points manually](images/gslides/201.png)

### Facets

In addition to mapping variables to different aesthetics, you can also opt to use facets to help make sense of your data visually. Rather than plotting all the data on a single plot and visually altering the point size or color of a third variable in a scatterplot, you could break each level of that third variable out into a separate subplot. To do this, you would use faceting. Faceting is particularly helpful for looking at categorical variables. 

To use faceting, you would add an additional layer (+) to your code and use the `facet_wrap()` function. Within facet wrap, you specify the variable by which you want your subplots to be made:

```{r}
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price)) + 
  # facet by clarity
  facet_wrap(~clarity, nrow = 2)
```

Here, read the tilde as the word "by". Specifically here, we want a scatterplot of the relationship between carat and price and we want it faceted (broken down) **by (~)** clarity.

![facet_wrap breaks plots down into subplots](images/gslides/202.png)

Now, we have eight different plots, one for each level of clarity, where we can see the relationship between diamond carats and price.

You'll note here we've opted to specify that we want 2 rows of subplots (`nrow = 2`). You can play around with the number of rows you want in your output to customize how your output plot appears. 

### Geoms

Thus far in this lesson we've only looked at scatterplots, which means we've only called `geom_point`. However, there are *many* additional geoms that we could call to generate different plots. Simply, a *geom* is just a shape we use to represent the data. In the case of scatterplots, they don't *really* use a geom since each actual point is plotted individually. Other plots, such as the boxplots, barplots, and histograms we described in previous lessons help to summarize or represent the data in a meaningful way, without plotting each individual point. The shapes used in these different types of plots to represent what's going on in the data is that plot's geom. 

To see exactly what we mean by geoms being "shapes that represent the data", let's keep using the `diamonds` dataset, but instead of looking at the relationship between two numeric variables in a scatterplot, let's take a step back and take a look at a single numeric variable using a histogram. 

#### Histograms: `geom_histogram`

To review, histograms allow you to quickly visualize the range of values your variable takes and the shape of your data. (Are all the numbers clustered around center? Or, are they all at the extremes of the range? Somewhere in between? The answers to these questions describe the "shape" of the values of your variable.)

For example, if we wanted to see what the distribution of carats was for these data, we could to the following.

```{r}
# change geom_ to generate histogram
ggplot(data = diamonds) + 
  geom_histogram(mapping =  aes(carat))
```

![histogram of carat shows range and shape](images/gslides/203.png)

The code follows what we've seen so far in this lesson; however, we've now called `geom_histogram` to specify that we want to plot a histogram rather than a scatterplot. 

Here, the rectangular boxes on the plot are geoms (shapes) that represent the number of diamonds that fall into each bin on the plot. Rather than plotting each individual point, histograms use rectangular boxes to summarize the data. This summarization helps us quickly understand what's going on in our dataset.

Specifically here, we can quickly see that most of the diamonds in the dataset are less than 1 carat. This is not necessarily something we could be sure of from the scatterplots generated previously in this lesson (since some points could have been plotted directly on top of one another). Thus, it's often helpful to visualize your data in a number of ways when you first get a dataset to ensure that you understand the variables and relationships between variables in your dataset!

#### Barplots: geom_bar

Barplots show the relationship between a set of numbers and a **categorical** variable. In the diamonds dataset, we may be interested in knowing how many diamonds there are of each cut of diamonds. There are five categories for cut of diamond. If we make a barplot for this variable, we can see the number of diamonds in each category.

```{r}
# geom_bar for bar plots
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(cut))
```

Again, the changes to the code are minimal. We are now interested in plotting the categorical variable `cut` and state that we want a bar plot, by including `geom_bar()`. 

![diamonds barplot](images/gslides/204.png)

Here, we again use rectangular shapes to represent the data, but we're not showing the distribution of a single variable (as we were with `geom_histogram`). Rather, we're using rectangles to show the count (number) of diamonds within each category within cut. Thus, we need a different geom: `geom_bar`!

#### Boxplots: `geom_boxplot`

Boxplots provide a summary of a numerical variable across categories. For example, if you were interested to see how the price of a diamond (a numerical variable) changed across different diamond color categories (categorical variable), you may want to use a boxplot. To do so, you would specify that using `geom_boxplot`:

```{r}
# geom_boxplot for boxplots
ggplot(data = diamonds) + 
  geom_boxplot(mapping = aes(x = color, y = price))
```

In the code, we see that again, we only have to change what variables we want to be included in the plot and the type of plot (or geom). We want to use `geom_boxplot()` to get a basic boxplot.

![diamonds boxplot](images/gslides/205.png)

In the figure itself we see that the median price (the black horizontal bar in the middle of each box represents the median for each category) increases as the diamond color increases from the worst category (J) to the best (D).

Now, if you wanted to change the color of this boxplot, it would just take a small addition to the code for the plot you just generated.

```{r}
# fill globally changes bar color outside aes
ggplot(data = diamonds) + 
  geom_boxplot(mapping = aes(x = color, y = price), 
               fill = "red")
```

![diamonds boxplot with red fill](images/gslides/206.png)

Here, by specifying the color "red" in the `fill` argument, you're able to change the plot's appearance. In the next lesson, we'll go deeper into the many ways in which a plot can be customized within `ggplot2`!

#### Other plots

While we've reviewed basic code to make a few common types of plots, there are a number of other plot types that can be made in `ggplot2.` These are listed in the [online reference material for ggplot2](http://ggplot2.tidyverse.org/reference/) or can be accessed through RStudio directly. To do so, you would type `?geom_` into the Console in RStudio. A list of geoms will appear. You can hover your cursor over any one of these to get a short description. 

![?geom in Console](images/gslides/207.png)

Or, you can select a geom from this list and click enter. After selecting a geom, such as geom_abline and hitting 'Enter,' the help page for that geom will pop up in the 'Help' tab at bottom right. Here, you can find more detailed information about the selected geom.

![geom_abline help page](images/gslides/208.png)

### EDA Plots

As mentioned previously, an important step after you've read your data into R and wrangled it into a tidy format is to carry out **Exploratory Data Analysis (EDA)**. EDA is the process of understanding the data in your dataset fully. To understand your dataset fully, you need a full understanding of the variables stored in your dataset, what information you have and what information you don't have (missingness!). To gain this understanding, we've discussed using packages like `skimr` to get a quick idea of what information is stored in your dataset. However, generating plots is another critical step in this process. We encourage you to use `ggplot2` to understand the distribution of each single variable as well as the relationship between each variable in your dataset.

In this process, using `ggplot2` defaults is totally fine. These plots do not have to be the most effective visualizations for communication, so you don't want to spend a ton  of time making them visually perfect. Only spend as much time on these as you need to understand your data!

## `ggplot2`: Customization

So far, we have walked through the steps of generating a number of different graphs (using different `geoms`) in `ggplot2`. We discussed the basics of mapping variables to your graph to customize its appearance or aesthetic (using size, shape, and color within `aes()`). Here, we'll build on what we've previously learned to really get down to how to customize your plots so that they're as clear as possible for communicating your results to others. 

The skills learned in this lesson will help take you from generating exploratory plots that help *you* better understand your data to explanatory plots -- plots that help you communicate your results *to others*. We'll cover how to customize the colors, labels, legends, and text used on your graph.

Since we're already familiar with it, we'll continue to use the `diamonds` dataset that we've been using to learn about `ggplot2`.

### Colors

To get started, we'll learn how to control color across plots in `ggplot2`. Previously, we discussed using color within `aes()` on a scatterplot to automatically color points by the clarity of the diamond when looking at the relationship between price and carat.

```{r}
ggplot(diamonds) + 
  geom_point(mapping = aes(x = carat, y = price, color = clarity))
```

However, what if we wanted to carry this concept over to a bar plot and look at how many diamonds we have of each clarity group?

```{r}
# generate bar plot
ggplot(diamonds) + 
  geom_bar(aes(x = clarity))
```

![diamonds broken down by clarity](images/gslides/209.png)

As a general note, we've stopped including `data =` and `mapping =` here within our code. We included it so far to be explicit; however, in code you see in the world, the names of the arguments will typically be excluded and we want you to be familiar with code that appears as you see above. 

OK, well that's a start since we see the breakdown, but all the bars are the same color. What if we adjusted color within `aes()`?

```{r}
# color changes outline of bar
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, color = clarity))
```

![color does add color but around the bars](images/gslides/210.png)

As expected, color added a legend for each level of clarity; however, it colored the lines around the bars on the plot, rather than the bars themselves. In order to color the bars themselves, you want to specify the more helpful argument `fill`:

```{r}
# use fill to color bars
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = clarity))
```

![`fill` automatically colors the bars](images/gslides/211.png)

Great! We now have a plot with bars of different colors, which was our first goal! However, adding colors here, while maybe making the plot prettier doesn't actually give us any more information. We can see the same pattern of which clarity is most frequent among the diamonds in our dataset like we could see in the first plot we made. 

Color is particularly helpful here, however, if we wanted to map a *different* variable onto each bar. For example, what if we wanted to see the breakdown of diamond "cut" within each "clarity" bar? 

```{r}
# fill by separate variable (cut) = stacked bar chart
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut))
```

![mapping a different variable to fill provides new information](images/gslides/212.png)

Now we're getting some new information! We can see that each level in clarity appears to have diamonds of all levels of cut. Color here has really helped us understand more about the data.

But what if we were going to present these data? While there is no comparison between red and green (which is good!), there is a fair amount of yellow in this plot. Some projectors don't handle projecting yellow well, and it will show up too light on the screen. To avoid this, let's manually change the colors in this bar chart! To do so we'll add an additional layer to the plot using `scale_fill_manual`.

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut)) +
  # manually control colors used
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4"))
```

![manually setting colors using `scale_fill_manual`](images/gslides/213.png)

Here, we've specified five different colors within the `values` argument of `scale_fill_manual()`, one for each cut of diamond. The names of these colors can be specified using the names explained on the third page of the cheatsheet [here](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf). (Note: There are other ways to specify colors within R. Explore the details in that cheatsheet to better understand the various ways!)

Additionally, it's important to note that here we've used `scale_fill_manual()` to adjust the color of what was mapped using `fill = cut`. If we had colored our chart using `color` within `aes()`, there is a different function called `scale_color_manual`. This makes good sense! You use scale_fill_manual() with `fill` and `scale_color_manual()` with `color`. Keep that in mind as you adjust colors in the future!

Now that we have some sense of which clarity is most common in our diamonds dataset and now that we are able to successfully specify the colors we want manually in order to make this plot useful for presentation, what if we wanted to compare the proportion of each cut across the different clarities?  Currently, that's difficult because there is a different number within each clarity. In order to compare the proportion of each cut we have to use **position adjustment**. 

What we've just generated is a **stacked bar chart**. It's a pretty good name for this type of chart as the bars for cut are all stacked on top of one another. If you don't want a stacked bar chart you could use one of the other `position` options: `identity`, `fill`, or `dodge`.

Returning to our question about proportion of each cut within each clarity group, we'll want to use `position = "fill"` within `geom_bar()`. Building off of what we've already done:

```{r}
ggplot(diamonds) + 
  # fill scales to 100%
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4"))
```

![`position = "fill"` allows for comparison of proportion across groups](images/gslides/214.png)

Here, we've specified how we want to adjust the position of the bars in the plot. Each bar is now of equal height and we can compare each colored bar across the different clarities. As expected, we see that among the best clarity group (IF), we see more diamonds of the best cut ("Ideal")!

Briefly, we'll take a quick detour to look at `position = "dodge"`. This position adjustment places each object *next to one another*. This will not allow for easy comparison across groups, as we just saw with the last group but will allow values within each clarity group to be visualized.

```{r eval = FALSE}
ggplot(diamonds) + 
  # dodge rather than stack produces grouped bar plot
  geom_bar(aes(x = clarity, fill = cut), position = "dodge") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4"))
```

![`position = "dodge"` helps compare values within each group](images/gslides/215.png)

Unlike in the first plot where we specified `fill = cut`, we can actually see the relationship between each cut within the lowest clarity group (I1). Before, when the values were stacked on top of one another, we were not able to visually see that there were more "Fair" and "Premium" cut diamonds in this group than the other cuts. Now, with `position = "dodge"`, this information is visually apparent.

Note: `position = "identity"` is not very useful for bars, as it *places each object exactly where it falls within the graph*. For bar charts, this will lead to *overlapping bars*, which is not visually helpful. However, for scatterplots (and other 2-Dimensional charts), this is the default and is exactly what you want.  

### Labels

Text on plots is incredibly helpful. A good title tells viewers what they should be getting out of the plot. Axis labels are incredibly important to inform viewers of what's being plotted. Annotations on plots help guide viewers to important points in the plot. We'll discuss how to control all of these now!

#### Titles 

Now that we have an understanding of how to manually adjust color, let's improve the clarity of our plots by including helpful labels by adding an additional `labs()` layer. We'll return to the plot where we were comparing proportions of diamond cut across diamond clarity groups.

You can include a `title`, `subtitle`, and/or `caption` within the `labs()` function. Each argument, as per usual, will be specified by a comma. 

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  # add titles
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut")
```

 ![`labs()` adds helpful tittles, subtitles, and captions](images/gslides/216.png)
 
#### Axis labels
 
You may have noticed that our y-axis label says "count", but it's not actually a count anymore. In reality, it's a proportion. Having appropriately labeled axes is *so important*. Otherwise, viewers won't know what's being plotted. So, we should really fix that now using the `ylab()` function. Note: we won't be changing the x-axis label, but if you were interested in doing so, you would use `xlab("label")`.

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  # add y axis label explicitly
  ylab("proportion")
```

Note that the x- and y- axis labels can *also* be changed within `labs()`, using the argument (`x = ` and `y =` respectively).

![Accurate axis labels are incredibly important](images/gslides/217.png)
 
### Themes

To change the overall aesthetic of your graph, there are 8 themes built into `ggplot2` that can be added as an additional layer in your graph:

![themes](images/gslides/218.png)

For example, if we wanted remove the gridlines and grey background from the chart, we would use `theme_classic()`. Building on what we've already generated:

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  # change plot theme
  theme_classic()
```

![`theme_classic` changes aesthetic of our plot](images/gslides/219.png)

We now have a pretty good looking plot! However, a few additional changes would make this plot *even better* for communication.

Note: Additional themes are available from the [`ggthemes` package](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/). Users can also generate their own themes.

### Custom Theme

In addition to using available themes, we can also adjust parts of the theme of our graph using an additional `theme()` layer. There are **a lot** of options within theme. To see them all, look at the help documentation within RStudio Cloud using: `?theme`. We'll simply go over the syntax for using a few of them here to get you comfortable with adjusting your theme. Later on, you can play around with all the options on your own to become an expert!

#### Altering text size

For example, if we want to increase text size to make our plots more easily to view when presenting, we could do that within theme. Notice here that we're increasing the text size of the `title`, `axis.text`, `axis.title`, and `legend.text` all within `theme()`! The syntax here is important. Within each of the elements of the theme you want to alter, you have to specify what it is you want to change. Here, for all three, we want to alter the text, so we specify `element_text()`. Within that, we specify that it's `size` that we want to adjust. 

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  # control theme
  theme(title = element_text(size = 16), 
        axis.text = element_text(size =14),
        axis.title = element_text(size = 16),
        legend.text = element_text(size = 14))
```

![`theme()` allows us to adjust font size](images/gslides/220.png)

#### Additional text alterations

Changing the size of text on your plot is not the only thing you can control within `theme()`. You can make text **bold** and change its color within `theme()`. Note here that multiple changes can be made to a single element. We can change size and make the text **bold**. All we do is separate each argument with a comma, per usual.

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        # additional control
        plot.subtitle = element_text(color = "gray30"))
```


![`theme()` allows us to tweak many parts of our plot](images/gslides/221.png)

Any alterations to plot spacing/background, title, axis, and legend will all be made within `theme()`.

### Legends

At this point, all the text on the plot is pretty visible! However, there's one thing that's still not quite clear to viewers. In daily life, people refer to the "cut" of a diamond by terms like "round cut" or "princess cut" to describe the *shape* of the diamond. That's not what we're talking about here when we're discussing "cut". In these data, "cut" refers to the quality of the diamond, not the shape. Let's be sure that's clear as well! We can change the name of the legend by using an additional layer and the `guides()` and `guide_legend()` functions of the `ggplot2` package!

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30")) +
  # control legend
  guides(fill = guide_legend("cut quality")) 
```

![`guides()` allows us to change the legend title](images/gslides/222.png)

This `guides()` function, as well as the `guides_*` functions allow us to modify legends even further. 

This is especially useful if you have many colors in your legend and you want to control how the legend is displayed in terms of the number of columns and rows using `ncol` and `nrow` respectively.

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30")) +
  # control legend
  guides(fill = guide_legend("Cut Quality", 
                              ncol = 2))
```


Or, we can modify the font of the legend title using `title.theme()`.

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30")) +
  # control legend
  guides(fill = guide_legend("Cut Quality", 
                             title.theme = element_text(face = "bold")))
```

Alternatively, we can do this modification, as well as other legend modifications, like adding a rectangle around the legend, using the `theme()` function.

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
# changing the legend title:
guides(fill = guide_legend("Cut Quality")) +
  theme_classic() +
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30"),
# changing the legend style:
        legend.title = element_text(face = "bold"),
        legend.background = element_rect(color = "black"))
 
```


At this point, we have an informative title, clear colors, a well-labeled legend, and text that is large enough throughout the graph. This is certainly a graph that could be used in a presentation. We've taken it from a graph that is useful to just ourselves (exploratory) and made it into a plot that can communicate our findings well to others (explanatory)! 

We have touched on a number of alterations you can make by adding additional layers to a ggplot. In the rest of this lesson we'll touch on a few more changes you can make within `ggplot2`.

### Scales

There may be times when you want a different number of values to be displayed on an axis. The scale of your plot for **continuous variables** (i.e. numeric variables) can be controlled using `scale_x_continuous` or `scale_y_continuous`. Here, we want to increase the number of labels displayed on the y-axis, so we'll use `scale_y_continuous`:

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity)) +
  # control scale for continuous variable
  scale_y_continuous(breaks = seq(0, 17000, by = 1000))
```

![Continuous scales can be altered](images/gslides/223.png)

There is very handy argument called `trans` for the `scale_y_continuous` or the `scale_x_continuous` functions to change the scale of the axes. For example, it can be very useful to show the logarithmic version of the scale if you have very high values with large differences.

According to the documentation for the `trans` argument:

> Built-in transformations include "asn", "atanh", "boxcox", "date", "exp", "hms", "identity", "log", "log10", "log1p", "log2", "logit", "modulus", "probability", "probit", "pseudo_log", "reciprocal", "reverse", "sqrt" and "time".

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity)) +
  # control scale for continuous variable
  scale_y_continuous(trans = "log10") +
    labs(y = "Count (log10 scale)",
         x = "Clarity")
```

Notice that the values are not changed, just the way they are plotted. Now the y-axis increases by a factor of 10 for each break.


We will create a plot of the price of the diamonds to demonstrate the utility of creating a plot with a log10 scaled y-axis.

```{r}
ggplot(diamonds) + 
  geom_boxplot(aes(y = price, x = clarity))

ggplot(diamonds) + 
  geom_boxplot(aes(y = price, x = clarity)) + 
  scale_y_continuous(trans = "log10") + 
  labs(y = "Price (log10 scale)",
       x = "Diamond Clarity")
```

In the first plot, it is difficult to tell what values the boxplots correspond to and it is difficult to compare the boxplots (particularly for the last three clarity categories), however this is greatly improved in the second plot. 


We can also use another argument of the `scale_y_continuous()` function to add specific labels to our plot. For example, it would be nice to add dollar signs to the y-axis. We can do so using the `labels` argument. A variety of `label_*` functions  within the `scales` package can be used to modify axis labels. See [here](https://scales.r-lib.org/reference/index.html) to take a look at the many options.

```{r}
ggplot(diamonds) + 
  geom_boxplot(aes(y = price, x = clarity)) + 
  scale_y_continuous(trans = "log10", 
                    labels = scales::label_dollar()) +
  labs(y = "Price (log10 scale)",
       x = "Diamond Clarity")
```
In the above plot, we might also want to order the boxplots by the median price, we can do so using the `fct_reorder` function of `forcats` package to change the order for the `clarity` levels to be based on the median of the `price` values.

```{r}

ggplot(diamonds) + 
  geom_boxplot(aes(y = price, x = forcats::fct_reorder(clarity, price, .fun = median))) + 
  scale_y_continuous(trans = "log10", 
                    labels = scales::label_dollar()) +
  labs(y = "Price (log10 scale)",
       x = "Diamond Clarity")

```

Now we can more easily determine that the `SI2` diamonds are the most expensive.

Another way to modify **discrete variables** (aka factors or categorical variables where there is a limited number of levels), is to use `scale_x_discrete` or `scale_y_discrete`. In this case we will just pick a few of the clarity categories to plot and we will specify the order.

```{r, eval = FALSE}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity)) +
  # control scale for discrete variable
  scale_x_discrete(limit = c("SI2", "SI1", "I1")) +
  scale_y_continuous(breaks = seq(0, 17000, by = 1000)) 
```

![Discrete scales can be altered](images/gslides/224.png)


### Coordinate Adjustment

There are times when you'll want to flip your axis. This can be accomplished using `coord_flip()`. Adding an additional layer to the plot we just generated switches our x- and y-axes, allowing for horizontal bar charts, rather than the default vertical bar charts:

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity)) +
  scale_y_continuous(breaks = seq(0, 17000, by = 1000)) +
  scale_x_discrete(limit = c("SI2", "SI1", "I1")) +
  # flip coordinates
  coord_flip() +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  theme(title = element_text(size = 18), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30")) +
  guides(fill = guide_legend("cut quality")) 
```


![Axes can be flipped using `coord_flip`](images/gslides/225.png)

It's important to remember that all the additional alterations we already discussed can still be applied to this graph, due to the fact that ggplot2 uses layering! 

```{r}
p <- ggplot(diamonds) + 
  geom_bar(mapping = aes(x = clarity)) +
  scale_y_continuous(breaks = seq(0, 17000, by = 1000)) +
  scale_x_discrete(limit = c("SI2", "SI1", "I1")) +
  coord_flip() +
  labs(title = "Number of diamonds by diamond clarity",
       subtitle = "Subset of all diamonds, looking three levels of clarity") +
  theme_classic() +
  theme(title = element_text(size = 18), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30") ) 
```

![Additional layers will help customize this plot](images/gslides/226.png)

### Annotation

Finally, there will be times when you'll want to add text to a plot or to annotate points on your plot. We'll discuss briefly how to accomplish that here!

To add text to your plot, we can use the function `annotate`. This requires us to specify that we want to annotate here with "text" (rather than say a shape, like a rectangle - "rect" - which you can also do!). Additionally, we have to specify what we'd like that text to say (using the `label` argument), where on the plot we'd like that text to show up (using `x` and `y` for coordinates), how we'd like the text aligned (using `hjust` for horizontal alignment where the options are "left", "center", or "right" and `vjust` for vertical alignment where the arguments are "top", "center", or "bottom"), and how big we'd like that text to be (using `size`):

```{r}
ggplot(diamonds) + 
  geom_bar(aes(x = clarity)) +
  scale_y_continuous(breaks = seq(0, 17000, by = 1000)) +
  scale_x_discrete(limit = c("SI2", "SI1", "I1")) +
  coord_flip() +
  labs(title = "Number of diamonds by diamond clarity",
       subtitle = "Subset of all diamonds, looking three levels of clarity") +
  theme_classic() +
  theme(title = element_text(size = 18), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30")) +
  # add annotation
  annotate("text", label = "SI1 diamonds are among \n the most frequent clarity diamond", 
           y = 12800, x = 2.9, 
           vjust = "top", hjust = "right", 
           size = 6)
```

![`annotate` helps add text to our plot](images/gslides/227.png)

Note: we could have accomplished this by adding an additional `geom`: `geom_text`. However, this requires creating a new dataframe, as explained [here](http://r4ds.had.co.nz/graphics-for-communication.html#annotations). This can also be used to **label the points on your plot**. Keep this reference in mind in case you have to do that in the future.

### Vertical and Horizontal Lines

Sometimes it is very useful to add a line to our plot to indicate an important threshold. We can do so by using the `geom_hline()` function for a horizontal line and `geom_vline()` for a vertical line. 

In each case, the functions require that a y-axis intercept or x-axis intercept be specified respectively.

For example, it might be useful to add a horizontal line to indicate 50% of the total counts for each of the `clarity` categories. We will also use the `scale_y_continuous()` function to change the y-axis to show percentages.

```{r}
ggplot(diamonds) + 
  # fill scales to 100%
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "Percent of diamonds") +
  geom_hline(yintercept = 0.5, color = "red", size = 1)
```

Now, it is easier to tell that slightly over half of the `VVS2` diamonds have an `Ideal` cut. This would be much more difficult to see without the horizontal line. 


To add a vertical line we would instead use the `geom_vline()` function and we would specify an x-axis intercept. Since this plot has a discrete x-axis, numeric values specify a categorical value based on the order, thus a value of 4 would create a line down the center of the `VS2` bar. However, if we used 5.5 we could add a line offset from the center of a bar, as you can see in the following example:

```{r}
ggplot(diamonds) + 
  # fill scales to 100%
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "Percent of diamonds") +
  geom_hline(yintercept = 0.5, color = "red", size = 1 ) +
  geom_vline(xintercept = 5.5, color = "black", size = .5)
```

This would be helpful if we wanted to especially point out differences between the last three clarity categories of diamonds compared to the other categories. 

## Tables

While we have focused on figures here so far, tables can be incredibly informative at a glance too. If you are looking to display summary numbers, **a table can also visually display information**. 

Using this same dataset, we can use a table to get a quick breakdown of how many males and females there are in the dataset and what percentage of each gender there is.

A few things to keep in mind when making tables is that it's best to:

* Limit the number of digits in the table
* Include a caption
* When possible, keep it simple.

![Table](images/gslides/228.png)

### Tables in R

Now that we have a good understanding of what to consider when making tables, we can to practice making good tables in R. To do this, we'll continue to use the diamonds dataset (which is part of the `ggplot2` package). As a reminder, this dataset contains prices and other information about ~54,000 different diamonds. If we want to provide viewers with a summary of these data, we may want to provide information about diamonds broken down by the quality of the diamond's cut. To get our data in the form we want we will use the `dplyr` package, which we discussed in a lesson earlier.

### Getting the Data in Order

To start figuring out how the quality of the cut of the diamond affects the price of that diamond, we first have to get the data in order. To do that we'll use the `dplyr` package. This allows us to group the data by the quality of the cut (`cut`) before summarizing the data to determine the number of diamonds in each category (`N`), the minimum price of the diamonds in this category (`min`), the average price (`avg`), and the highest price in the category (`max`).

To get these data in order, you could use the following code. This code groups the data by cut (quality of the diamond) and then calculates the number of diamonds in each group (N), the minimum price across each group (min), the average price of diamonds across each group (avg), and the maximum price within each group (max):

```{r}
# get summary data for table in order
df <- diamonds %>% 
  group_by(cut) %>%
  dplyr::summarize(
    N = n(), 
    min = min(price), 
    avg = mean(price), 
    max = max(price),
    .groups = "drop"
  )
```

### An Exploratory Table

```{r}
# look at data
df
```

By getting the data summarized into a single object in R (`df`), we're on our way to making an informative table. However, this is clearly just an exploratory table. The output in R from this code follows some of the good table rules above, but not all of them. At a glance, it will help you to understand the data, but it's not the finished table you would want to send to your boss. 

![Exploratory diamonds table](images/gslides/229.png)

From this output, you, the creator of the table, would be able to see that there are a number of **good qualities**:

* there is a **reasonable number of rows and columns** - There are 5 rows and 5 columns. A viewer can quickly look at this table and determine what's going on.
* the first column `cut` is **organized logically** - The lowest quality diamond category is first and then they are ordered vertically until the highest quality cut (`ideal`))
* comparisons are made **top to bottom** - To compare between the groups, your eye only has to travel up and down, rather than from left to right. 

There are also things that **need to be improved** on this table:

* **column headers** could be even more clear
* there's **no caption/title**
* it could be more **aesthetically pleasing**

### Improving the Table Output

By-default, R outputs tables in the Console using a monospaced font. However, this limits our ability to format the appearance of the table. To fix the remaining few problems with the table's format, we'll use the `kable()` function from the R package `knitr` and the additional formatting capabilities of the R packages `kableExtra`.

The first step to a prettier table just involves using the `kable()` function from the `knitr` package, which improves the readability of this table. The `knitr` package is *not* a core tidyverse package, so you'll have to be sure it's installed and loaded.

```{r message = FALSE}
# install.packages("knitr")
library(knitr)

kable(df)
```

![kable basic output](images/gslides/230.png) 

However, there are still a few issues we want to improve upon:

* column names could be more informative
* too many digits in the `avg` column
* caption/title is missing
* source of data not included

To begin addressing these issues, we can use the `add_header_above` function from `kableExtra()` to specify that the min, avg, and max columns refer to `price` in US dollars (USD). Additionally, `kable()` takes a `digits` argument to specify how many significant digits to display. This takes care of the display of too many digits in the `avg` column. Finally, we can also style the table so that every other row is shaded, helping our eye to keep each row's information separate from the other rows using `kable_styling()` from `kableExtra`.  These few changes really improve the readability of the table. 

If you copy this code into your R console, the formatted table will show up in the Viewer tab at the bottom right-hand side of your RStudio console and the HTML code used to generate that table will appear in your console. 

```{r message = FALSE}
# install.packages("kableExtra")
library(kableExtra)

# use kable_styling to control table appearance
kable(df, digits=0, "html") %>%
  kable_styling("striped", "bordered") %>% 
  add_header_above(c(" " = 2,  "price (USD)" = 3)) 
```  

![Viewer tab with formatted table](images/gslides/231.png)

### Annotating Your Table

We mentioned earlier that captions and sourcing your data are incredibly important. The `kable` package allows for a `caption` argument. Below, an informative caption has been included. Additionally, `kableExtra` has a `footnote()` function. This allows you to include the source of your data at the bottom of the table. With these final additions, you have a table that clearly displays the data and could be confidently shared with your boss. 

```{r}
kable(df, digits=0, "html", caption="Table 1: Diamonds Price by Quality of Cut. Most Diamonds are of the highest quality cut and the most expensive diamonds are of the highest quality") %>%
  kable_styling("striped", "bordered") %>% 
  # add headers and footnote
  add_header_above(c(" " = 2,  "price (USD)" = 3)) %>% 
  footnote(general = "Diamonds dataset from ggplot2", general_title = "Source:", footnote_as_chunk = T)
```  

![Viewer tab with annotated and formatted table](images/gslides/232.png)

## `ggplot2`: Extensions

Beyond the *many* capabilities of `ggplot2`, there are a few additional packages that build *on top of* `ggplot2`'s capabilities. We'll introduce a few packages here so that you can:

* directly annotate points on plots (`ggrepel` and `directlabels`)
* combine multiple plots (`cowplot` + `patchwork`)
* generate animated plots (`gganimate`)

These are referred to as `ggplot2` extensions. 

### `ggrepel`

[`ggrepel`](https://github.com/slowkow/ggrepel) "provides geoms for` ggplot2` to repel overlapping text labels." 

To demonstrate the functionality within the `ggrepel` package and demonstrate cases where it would be needed, let's use a dataset available from R - the `mtcars` dataset: 

> The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

This dataset includes information about 32 different cars. Let's first convert this from a data.frame to a tibble. Note that we will keep the rownames and make it a new variable called `model`.

```{r}
# see first 6 rows of mtcars
mtcars <- mtcars %>%
as_tibble(rownames = "model")
head(mtcars)
```

What if we were to plot a scatterplot between horsepower (`hp`) and weight (`wt`) of each car and wanted to label each point in that plot with the care model. 

If we were to do that with `ggplot2`, we'd see the following:

```{r}
# label points without ggrepel
ggplot(mtcars, aes(wt, hp, label = model)) +
  geom_text() +
  geom_point(color = 'dodgerblue') +
  theme_classic()
```

The overall trend is clear here - the more a car weights, the more horsepower it tends to have. However, many of the labels are overlapping and impossible to read - this is where `ggrepel` plays a role:

```{r, message = FALSE}
# install and load package
# install.packages("ggrepel")
library(ggrepel)

# label points with ggrepel
ggplot(mtcars, aes(wt, hp, label = model)) +
  geom_text_repel() +
  geom_point(color = 'dodgerblue') +
  theme_classic()
```

The only bit of code here that changed was that we changed `geom_text()` to `geom_text_repel()`. This, like `geom_text()` adds text directly to the plot. However, it also helpfully repels overlapping labels away from one another and away from the data points on the plot.


#### Custom Formatting

Within `geom_text_repel()`, there are a number of additional formatting options available. We'll cover a number of the most important here, but the [`ggrepel` vignettes](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html) explore these further.


**Highlighting Specific Points**

Often, you do not want to highlight *all* the points on a plot, but want to draw your viewer's attention to a few *specific* points. To do this and say highlight only cars that are the make Mercedes, you could use the following approach:

```{r}
# create a new column "merc" with true or false for Mercedes
# value is true for rows with "Merc" in model column
mtcars <- mtcars%>%
          mutate(merc = str_detect(string = model, pattern = "Merc"))

# Let's just label these items and manually color the points
ggplot(mtcars, aes(wt, hp, label = model)) +
  geom_point(aes(color = merc)) +
  scale_color_manual(values = c("grey50", "dodgerblue")) +
  geom_text_repel(data = filter(mtcars, merc == TRUE),
        nudge_y = 1, 
        hjust = 1,
        direction = "y") +
  theme_classic() +
  theme(legend.position = "none")

```

Here, notice that we first create a new column in our dataframe called `merc`. Here, we include the model of the car, only if "Merc" is in the model of the car's name.

We then, specify that only these Mercedes cars should be labeled *and* that we only want them colored blue if they are Mercedes. This allows us to focus in on those few points we're interested in. And, we can see that among the cars in this dataset, Mercedes tend to be of average weight but have varying horsepower, depending on the model of the car.

**Text Alignment**

Other times, you want to ensure that your labels are aligned on the top or bottom edge, relative to one another. This can be controlled using the `hjust` and `vjust` arguments. The values for particular alignment are:  

* `hjust = 0`   |  to left-align  
* `hjust = 0.5` |  to center  
* `hjust = 1`   |  to right-align  

Additionally, you can adjust the starting position of text vertically with `nudge_y` (or horizontally with `nudge_x`). To use this, you'll specify the distance from the point to the label you want to use.

You can also allow the labels to move horizontally with `direction = "x"` (or vertically with `direction = "y"`. This will put the labels at a right angle from the point. The default here is "both", which allows labels to move in both directions.

For example, what if we wanted to plot the relationship between quarter mile time (`qsec`) and miles per gallon (`mpg`) to identify cars that get the best gas mileage.

To do this, we're specifying within `geom_text_repel()` the following arguments:  

* `data` | only label those cars with gas mileage > 30 mpg  
* `nudge_y` | position all labels so that they're vertically aligned  
* `hjust` | center-align the labels  
* `direction` | allow labels to move horizontally  

For further customization, we're also changing the segment color from the default black to a light gray ("gray60").


```{r}
# customize within geom_text_repel
# first create a new column for mpg > 30 within mtcars and pipe this into ggplot
mtcars %>%
    mutate(mpg_highlight = case_when(mpg > 30 ~ "high", mpg < 30 ~ "low")) %>%
    ggplot(aes(qsec, mpg, label = model)) +
      geom_point(aes(color = mpg_highlight)) +
      scale_color_manual(values = c( "dodgerblue", "black")) +
      theme_minimal() +
      theme(legend.position = "none") +
      geom_text_repel(data = mtcars %>% filter(mpg > 30), 
        nudge_y = 3, 
        hjust = 0.5, 
        direction = "x",
        segment.color = "gray60") +
      scale_x_continuous(expand = c(0.05, 0.05)) +
      scale_y_continuous(limits = c(9, 38))
```
  
Notice that we've also had to provide the plot with more room by customizing the x- and y- axes, using the final two lines of code you see above.

With this plot, it is clear that there are four cars with mpg > 30. And, among these, we now know that the Toyota Corolla has the fastest quarter mile time, all thanks to direct labeling of the points!

### `directlabels`

The `directlabels` package also helps you to add labels directly to plots. There are functions that allow you to also add labels that generally don't overlap using less code than `ggrepel`, however there are less specification options.

There are several method options for adding direct labels to scatter plots, such as: `first.points` (which will place the label on the left in a scatterplot), and `last.points` (which will place the label on the right in a scatterplot).

```{r}
#install.packages("directlabels")
library(directlabels)
mtcars %>%
  mutate(mpg_highlight = case_when(mpg > 30 ~ "high", mpg < 30 ~ "low")) %>%
  ggplot(aes(qsec, mpg, label = model)) +
  geom_point(aes(color = mpg_highlight)) +
  scale_color_manual(values = c("dodgerblue", "black")) +
  scale_x_continuous(expand = c(0.05, 0.05)) +
  scale_y_continuous(limits = c(NA, 36)) +
  geom_dl(data = filter(mtcars, mpg > 30), aes(label = model), 
          method = list(c("first.points"),
                        cex = 1)) +
  theme_minimal() +
  theme(legend.position = "none")

mtcars %>%
  mutate(mpg_highlight = case_when(mpg > 30 ~ "high", mpg < 30 ~ "low")) %>%
  ggplot(aes(qsec, mpg, label = model)) +
  geom_point(aes(color = mpg_highlight)) +
  scale_color_manual(values = c("dodgerblue", "black")) +
  scale_x_continuous(expand = c(0.05, 0.05)) +
  scale_y_continuous(limits = c(NA, 36)) +
  geom_dl(data = filter(mtcars, mpg > 30), aes(label = model), 
          method = list(c("last.points"),
                        cex = 1)) +
  theme_minimal() +
  theme(legend.position = "none")
```

This package is especially useful for labeling lines in a lineplot. There are several methods, one of which is the `angled.boxes` method. This often negates the need for a legend.

```{r}
ggplot(mtcars, aes(qsec, mpg, color = cyl, group = cyl)) + 
  geom_line() +
  geom_dl(aes(label = cyl), 
          method = list(c("angled.boxes"),
                        cex = 1)) +
  theme_minimal()+
  theme(legend.position = "none") +
  labs(title = "Differences in cars with 4, 6, or 8 cylinders")
```

See [here](https://cran.r-project.org/web/packages/directlabels/directlabels.pdf) for more information about this package.

### `cowplot`

Beyond customization within `ggplot2` and labeling points, there are times you'll want to arrange multiple plots together in a single grid, applying a standard theme across all plots. This often occurs when you're ready to present or publish your visualizations. When you're ready to share and present your work with others, you want to be sure your visualizations are conveying the point you want to convey to your viewer in as simple a plot as possible. And, if you're presenting multiple plots, you typically want to ensure that they have a consistent theme from one plot to the next. This allows the viewer to focus on the data you're plotting, rather than changes in theme. The `cowplot` package assists in this process!


#### Theme

The standard theme within `cowplot`, which works for many types of plots is `theme_cowplot()`. This theme is very similar to `theme_classic()`, which removes the background color, removes grid lines, and plots only the x- and y- axis lines (rather than a box around all the data). We'll use this theme for the examples from this package. However, note that there are a number of additional themes available from the `cowplot` package. We will use the number 12 within this function to indicate that we want to use a font size of 12.

```{r message = FALSE}
# install and load package
# install.packages("cowplot")
library(cowplot)
```

We'll continue to use the `mtcars` dataset for these examples. Here, using the `forcats` package (which is part of the core tidyverse), we'll add two new columns: `transmission`, where we recode the `am` column to be "automatic" if `am == 0` and "manual" if `am == 1`, and `engine`, where we recode the `vs` column to be "v-shaped" if `vs == 0` and "straight" if `vs == 1`.

```{r}
mtcars <- mtcars %>%
  mutate(transmission = fct_recode(as.factor(am), "automatic" = "0", "manual" = "1"),
         engine = fct_recode(as.factor(vs), "v-shaped" = "0", "straight" = "1"))
```

We'll use these data to generate a scatterplot plotting the relationship between 1/4 mile speed (`qsec`) and gas mileage (`mpg`). We'll color the points by this new column `transmission` and apply `theme_cowplot`. Finally, we'll assign this to `p1`, as we'll ultimately generate a few different plots.

```{r}
p1 <- ggplot(mtcars, aes(qsec, mpg, color = transmission)) + 
  geom_point() +
  theme_cowplot(12) + 
  theme(legend.position = c(0.7, 0.2))
p1
```

Let's make a similar plot, but color by engine type. We'll want to manually change the colors here so that we aren't using the same colors for transmission and engine. We'll store this in `p2`.

```{r}
p2 <- ggplot(mtcars, aes(qsec, mpg, color = engine)) + 
  geom_point() +
  scale_color_manual(values = c("red", "blue")) +
  theme_cowplot(12) +
  theme(legend.position = c(0.7, 0.2))
p2
```

Great - we've now got two plots with the same theme and similar appearance. What if we wanted to combine these into a single grid for presentation purposes?

#### Multiple Plots

Combining plots is made simple within the `cowplot` package using the `plot_grid()` function:

```{r}
# plot side by side
plot_grid(p1, p2, ncol = 2)
```

Here, we specify the two plots we'd like to plot on a single grid and we also optionally include how many columns we'd like using the `ncol` parameter. 

To plot these one on top of the other, you could specify for `plot_grid()` to use a single column.

```{r}
# plot on top of one another
plot_grid(p1, p2, ncol = 1)
```

Note that by default, the plots will share the space equally, but it's also possible to make one larger than the other within the grid using `rel_widths` and `rel_heights`.

For example, if you had a faceted plot summarizing the number of cars by transmission and engine (which we'll call `p3`):

```{r}
# generate faceted plot
p3 <- ggplot(mtcars, aes(engine)) + 
  geom_bar() + 
  facet_wrap(~transmission) +
  scale_y_continuous(expand = expand_scale(mult = c(0, 0.1))) +
  theme_minimal_hgrid(12) +
  panel_border(color = "black") +
  theme(strip.background = element_rect(fill = "gray80"))
p3
```

Note that for this plot we've chosen a different theme, allowing for horizontal grid lines. This can be helpful when visualizing bar plots.

If we were to plot these next to one another using the defaults, the faceted plot would be squished:

```{r}
# plot next to one another
plot_grid(p1, p3)
```

We can use `rel_widths` to specify the relative width for the plot on the left relative to the plot on the right: 

```{r}
# control relative spacing within grid
plot_grid(p1, p3, rel_widths = c(1, 1.3))
```
Notice how the plot on the left is now a bit more narrow and the plot on the right is a bit wider.

**Adding Labels**

Within these grids, you'll often want to label these plots so that you can refer to them in your report or presentation. This can be done using the `labels` parameter within `plot_grid()`:

```{r}
# add A and B labels to plots
plot_grid(p1, p3, labels = "AUTO", rel_widths = c(1, 1.3))
```

**Adding Joint Titles**

Finally, when generating grids with multiple plots, at times you'll often want a single title to explain what's going on across the plots. Here, the process *looks* slightly confusing, but this is only because we're putting all of these `cowplot` pieces together.

Generally, there are three steps:  

1. Create grid with plots  
2. Create title object  
3. Piece title and grid of plots together  

```{r}
# use plot_grid to generate plot with 3 plots together
first_col <- plot_grid(p1, p2, nrow = 2, labels = c('A', 'B'))
three_plots <- plot_grid(first_col, p3, ncol = 2, labels = c('', 'C'), rel_widths = c(1, 1.3))

# specify title
title <- ggdraw() + 
  # specify title and alignment
  draw_label("Transmission and Engine Type Affect Mileage and 1/4 mile time",
              fontface = 'bold', x = 0, hjust = 0) +
  # add margin on the left of the drawing canvas,
  # so title is aligned with left edge of first plot
  theme(plot.margin = margin(0, 0, 0, 7))

# put title and plots together
plot_grid(title, three_plots, ncol = 1, rel_heights = c(0.1, 1))
```

And, just like that we've got three plots, labeled, spaced out nicely in a grid, with a shared title, all thanks to the functionality within the `cowplot` package.

### `patchwork`

The `patchwork` package is similar to the `cowplot` package in that they both are helpful for combining plots together. They each allow for different plot modifications, so it is useful to know about both packages.

With the `patchwork` package, plots can be combined using a few operators, such as `"+"`, `"/"`, and `"|".

To combine two plots together we can simply add them together with the `+` sign or place them next to one another using the `|`:

```{r}
#install.packages(patchwork)
library(patchwork)
p1 + p2
p1 | p2
```

If we want a plot above another plot we can use the "/" symbol:


```{r}
p1 / p2
```
Grouping or nesting plots together using parenthesis can result in two or more plots taking up a single grid space. 

Thus, to combine multiple plots in a more complicated layout, one can combine two plots on one row and have a third plot on another row like this:

```{r}
(p3 + p2) / p1
```

Without the parentheses we would have the following:

```{r}
p3 + p2 / p1
```

You can also empty plot spacers using the `plotspacer()` function like so:

```{r}
(plot_spacer() + p2 + plot_spacer()) / p1
```

You can modify the widths of the plots using the `widths` argument of the `plot_layout()` function. In the following example we will make the width of the plot on the left 2 times that of the plot on the right. Any numeric values will do, it is the ratio of the numbers that make the difference.

Thus, both `p1 + p2 + plot_layout(widths = c(2, 1))` and `p1 + p2 + plot_layout(widths = c(60, 30))` will result in the same relative size difference between `p1` and `p2`.
```{r}
p1 + p2 + plot_layout(widths = c(2, 1))
p1 + p2 + plot_layout(widths = c(60, 30))
```

The relative heights of the plots can also be modified using a `heights` argument with the same function.
```{r}
p1 + p2 + plot_layout(heights = c(2, 1))
p1 + p2 + plot_layout(heights = c(60, 30))
```

This package also allows for modification of legends. For example, legends can be gathered together to one side of the combined plots using the `guides = 'collect'` argument of the `plot_layout()` function.

```{r}
p1 + p2 + plot_layout(guides = "collect")
```

You can also specify the number of columns or rows using this same function with the `ncol` or `nrow` as you would with `facet_wrap()` of the `ggplot2` package, where plots are added to **complete a row** before they will be added to a new row. For example, the following will result in an empty 2nd row below the plots.

```{r}
p1 + p2 + plot_layout(nrow = 2, ncol = 2, guides = "collect")
```

However, the `byrow = FALSE` argument can disrupt this behavior and result in an empty 2nd column:

```{r}
p1 +p2 + plot_layout(nrow = 2, ncol = 2, byrow = FALSE, guides = "collect")
```

In this case the **columns will be preferentially completed** before placing a plot in a new column.


We can also use the package to change the theme specifications of specific plots. By default the plot just prior to the `theme()` function will be the only plot changed.

```{r}
p1 + p2 + theme(legend.position = "bottom") + p2
```

Using the `*`, themes can be added to all plots that are nested together. 

```{r}
(p1 + p2) *theme_bw() + p2
```


The `&` adds themes to all plots.
```{r}
(p1 + p2) + p2 & theme(axis.title.x = element_text(face = "bold"))
```


Annotations for all the plots combined can also be added using the `plot_annotation()` function, which can also take `theme()` function specifications with the `theme` argument.

```{r}

(p1 + p2) + p3 + theme(axis.text.x = element_text(angle = 90)) +
  plot_annotation(title = "Plot title for all 3 plots", 
                  theme = theme(plot.title = element_text(hjust = 0.5)))

```


See [here](https://patchwork.data-imaginist.com/) for more information about the `patchwork` package.

### `gganimate`

The final `ggplot2` extension we'll discuss here is [`gganmiate`](https://gganimate.com/). This extends the grammar of graphics implemented in the `ggplot2` package to work with animated plots. To get started, you'll need to install and load the package:

```{r}
library(gganimate)
```

The `gganimate` package adds functionality by providing a number of these grammar classes. Across the animation being generated, the following classes are made available, with each classes' corresponding functionality:  

* `transition_*()` | specifies how data should transition 
* `enter_*()`/`exit_*()`  |  specifies how data should appear and disappear  
* `ease_aes()`  |  specifies how different aesthetics should be eased during transitions  
* `view_*()`  |  specifies how positional scales should change  
* `shadow_*()` |  specifies how points from a previous frame should be displayed in a current frame  

We'll walk through these grammar classes using the `mtcars` dataset.

#### Example: `mtcars`

As noted, `gganimate` builds on top of the functionality within `ggplot2`, so the code should look pretty familiar by now. 

First, using the `mtcars` dataset we generate a static boxplot looking at the relationship between the number of cylinders a car has (`cyl`) and its gas mileage (`mpg`).

```{r}
# generate static boxplot
ggplot(mtcars) + 
  geom_boxplot(aes(factor(cyl), mpg))
```

But what if we wanted to understand the relationship between those two variables and the number of gears a car has (`gear`)?

One option would be to use faceting:

```{r}
# facet by gear
ggplot(mtcars) + 
  geom_boxplot(aes(factor(cyl), mpg)) +
  facet_wrap(~gear)
```

Alternatively, we could animate the plot, using `gganimate` so that on a single plot we rotate between each of these three plots. 

##### Transitions

We'll transition in this example between frames by gear:

```{r message = FALSE}
# animate it!
mtcars %>%
  mutate(gear = factor(gear),
         cyl = factor(cyl)) %>%
  ggplot() + 
  geom_boxplot(aes(cyl, mpg)) + 
  transition_manual(gear)
```

Note that here, `transition_manual()` is a new grammar class that we add right on top of our `ggplot2` plot code! Within this grammar class, we specify the variable upon which we'd like the transition in the animation to occur.

This means that the number of frames in the animation will be the number of levels in the variable by which you're transitioning. Here, there will be 3 frames, as there are three different levels of `gear` in the `mtcars` dataset.

`transition_manual()` is not the only option for transition. In fact, there are a number of different `transition_*` options, depending upon your animation needs. One of the most commonly used is `transition_states()`. The animation will plot the same information as the example above; however, it allows for finer tune control of the animation

```{r message = FALSE}
# animate it!
anim <- ggplot(mtcars) + 
  geom_boxplot(aes(factor(cyl), mpg)) + 
  transition_states(gear,
                    transition_length = 2,
                    state_length = 1)

anim
```

Note here that we've stored the output from `transition_states` in the object `anim`. We'll build on this object below.

##### Labeling

Currently, it's hard to know which frame corresponds to which `gear`. To make this easier on the viewer, let's label our animation:

```{r message = FALSE}
# animate it!
anim <- anim +  
  labs(title = 'Now showing gear: {closest_state}')

anim
```

Note that we're referring to the appropriate gear within the animation frame by specifying `{closest_state}`.

##### Easing

Easing determines how the change from one value to another should occur. The default is linear, but there are other options that will control the appearance of progression in your animation. for example, 'cubic-in-out' allows for the start and end of the animation to have a smoother look.

```{r message = FALSE}
anim <- anim + 
  # Slow start and end for a smoother look
  ease_aes('cubic-in-out')

anim
```

There are a number of easing functions, all of which are listed within the documentation, which can be viewed, as always, using the `?` help function: `?ease_aes`.

##### Enter & Exit

Building on top of easing, we can also control how the data appears (enters) and disappears (exits), using `enter_*` and `exit_*`:

```{r message = FALSE}
anim <- anim + 
  # fade to enter
  enter_fade() +
  # shrink on exit
  exit_shrink()

anim
```

The changes are subtle but you'll notice that on transition the data fades in to appear and shrinks upon exit.

##### Position Scales

To  demonstrate how changing positional scales can be adjusted, let's take a look at a scatterplot. Here, we're plotting the relationship between 1/4 mile speed and miles per gallon and we'll be transitioning between gear.

The static plot would be as follows:
```{r}
ggplot(mtcars) + 
  geom_point(aes(qsec, mpg))
```

However, upon adding animation, we see how the data changes by gear.

```{r message = FALSE}
# animate it!
scat <- ggplot(mtcars) + 
  geom_point(aes(qsec, mpg)) + 
  transition_states(gear,
                    transition_length = 2,
                    state_length = 1) +
    labs(title = 'Now showing gear: {closest_state}')

scat
```

However, the x- and y-axes remain constant throughout the animation.

If you wanted to allow these positional scales to adjust between each gear level, you could use `view_step`:

```{r message = FALSE}
# allow positional scaling for each level in gear
scat +
  view_step(pause_length = 2, 
            step_length = 1, 
            nsteps = 3, 
            include = FALSE)
```


#### Example: gapminder

One of the most famous examples of an animated plot uses the gapminder dataset. This dataset includes economic data for countries all over the world. The animated plot here demonstrates the relationship between GDP (per capita) and life expectancy over time, by continent. Note that the year whose data is being displayed is shown at the top left portion of the plot.

```{r message = FALSE}
# install.packages("gapminder")
library(gapminder)

gap <- ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  scale_colour_manual(values = country_colors) +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  facet_wrap(~continent, ncol = 5) +
  theme_half_open(12) +
  panel_border() + 
  # animate it!
  labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
  transition_time(year)

gap
```

Note that in this example, we're now using `transition_time()` rather than `transition_states()`. This is a variant of `transition_states()` that is particularly useful when the states represent points in time, such as the years we're animating through in the plot above. The transition length is set to correspond to the time difference between the points in the data.

##### Shadow

However, what if we didn't want the data to completely disappear from one frame to the next and instead wanted to see the pattern emerge over time? We didn't demonstrate this using the `mtcars` dataset because each observation in that dataset is a different car. However, here, with the gapminder dataset, where we're looking at a trend over time, it makes more sense to include a trail.

To do this, we would use `shadow_*`. Here, we'll use `shadow_trail()` to see a trail of the data from one frame to the next:

```{r message = FALSE}
gap +
 shadow_trail(distance = 0.05,
              alpha = 0.3)
```

Here, distance specifies the temporal distance between the frames to show and alpha specifies the transparency of the trail, so that the current frame's data is always discernible.

## Case Studies

At this point, we've done a lot of work with our case studies. We've introduced the case studies, read them into R, and have wrangled the data into a usable format. Now, we get to peak at the data using visualizations to better understand each dataset's observations and variables! 

Let's start by loading our wrangled tidy data that we previously saved:

```{r}
load(here::here("data", "tidy_data", "case_study_1_tidy.rda"))

load(here::here("data", "tidy_data", "case_study_2_tidy.rda"))
```

### Case Study #1: Health Expenditures

We've now got the data in order so that we can start to explore the relationships between the variables contained in the health care dataset (`hc`) to answer our questions of interest:

1. Is there a relationship between health care coverage and health care spending in the United States?
2. How does the spending distribution change across geographic regions in the United States?
3. Does the relationship between health care coverage and health care spending in the United States change from 2013 to 2014?

```{r}
# see health care data
hc
```

As a reminder, we have state level data, broken down by year and type of insurance. For each, we have the total number of individuals who have health care coverage (`tot_coverage`), the amount spent on coverage (`tot_spending`), the proportion of individuals covered (`prop_coverage`), and the amount spent per capita (`spending_capita`). Additionally, we have the state name (`Location`), the two letter state abbreviation  (`abb`), and the region of the United States where the state is located (`region`). Let's get visualizing!

#### Exploratory Data Analysis (EDA)

To first get a sense of what information we do and do not have, the `visdat` package can be very helpful. This package uses `ggplot2` to visualize missingness in a dataframe. For example, `vis_dat()` takes the dataframe as an input and visualizes the observations on the left and the variables across the top. Solid colors indicate that a value is present. Each type of variable is represented by a different color. Grey lines indicate missing values.

```{r}
# install.packages("visdat")
library(visdat)

vis_dat(hc)
```

We now have a sense that a few states, for some years are missing coverage data, which also affects the ability to calculate proportion covered.

To see these values highlighted more specifically, we can use the related `vis_miss()` function.

```{r}
vis_miss(hc)
```

Here we see that missing values only occur 0.6% of the time, with 3.1% of the observations missing entries for `tot_coverage` and `prop_coverage`. So, all in all, there is not a lot of missing data, but we still want to be sure we understand where missingness occurs before answering our questions.

Let's use `dplyr` to see *which* observations have missing data:

```{r}
hc %>%
  filter(is.na(tot_coverage))
```

Ah, so we see that the "Other" type of coverage is missing in both 2013 and 2014 for a subset of states. We'll be focusing on the non-"Other" types of health care coverage, so this shouldn't pose a problem, but it is good to know!

Taking this one step further, let's skim the entire dataset to get a sense of the information stored in each variable:

```{r}
library(skimr)

# get summary of the data
skim(hc)
```

At a glance, by looking at the `hist` column of the output for our numeric/integer variables, we see that there is a long right tail in `tot_coverage` and `tot_pop`. 

We see that proportion coverage varies from 0.0054 (`p0`) to 0.61 (`p100`). So, we'll want to know which states are at each end of the spectrum here. 

We also see that for `tot_spending`, the `mean` is 49004 and the median (`p50`) is slightly lower and that this variable also has a long right tail (`hist`).

We also know that these data come from two different years, so we can group by `year` and again summarize the data: 

```{r}
# group by year
hc %>% 
  group_by(year) %>%
  skim()
```

At a glance, there doesn't appear to be a huge difference in the variables from one year to the next, but we'll explore this again in a bit.

With at least some understanding of the data in our dataset, let's start generating exploratory plots that will help us answer each of our questions of interest.

#### Q1: Relationship between coverage and spending?

To answer the question:

> Is there a relationship between health care coverage and health care spending in the United States?

We'll have to visualize coverage and spending data across the United States.

And, to understand the relationship between two continuous variables - coverage and spending - a scatterplot will be the most effective visualization.

We'll first look at a scatterplot:

```{r}
hc %>%
  filter(type == "Employer", 
         year == "2013") %>% 
  ggplot(aes(x = spending_capita, 
             y = prop_coverage)) +
  geom_point() + 
  labs(x = "spending per capita",
       y = "coverage proportion")
```

We see that there appears to be some relationship, with those states that spend more per capita also having higher proportions of their population having health care coverage.

We can continue to improve this plot to better understand the underlying data. For example, we can add a best-fit line using `geom_smooth()` to visualize the magnitude of the linear relationship:

```{r}
# generate scatterplot with fit line
hc %>%
  filter(type == "Employer", 
         year == "2013") %>% 
  ggplot(aes(x = spending_capita, 
             y = prop_coverage)) + 
  geom_point() + 
  labs(x = "spending per capita",
       y = "coverage proportion") +
  geom_smooth(method = "lm", col = "red")
```

Beyond that, we likely want to know which point represents which state, so we can add state labels:

```{r}
# add state abbreviation labels
hc %>%
  filter(type == "Employer", 
         year == "2013") %>% 
  ggplot(aes(x = spending_capita, 
             y = prop_coverage)) + 
  geom_point() + 
  labs(x = "spending per capita",
       y = "coverage proportion") +
  geom_smooth(method = "lm", col = "red") + 
  geom_text(aes(label=abb), 
            nudge_x = 150)
```

From there, it'd likely be helpful to have information from each region:

```{r}
# color by region
hc %>%
  filter(type == "Employer", 
         year == "2013") %>% 
  ggplot(aes(x = spending_capita, 
             y = prop_coverage,
             color = region)) + 
  geom_point() + 
  labs(x = "spending per capita",
       y = "coverage proportion") +
  geom_smooth(method = "lm", col = "red") + 
  geom_text(aes(label=abb), 
            nudge_x = 150, 
            show.legend = FALSE)
```

So far, we've only been focusing on data from 2013. What about looking at data from both 2013 and 2014? We can do that using `facet_wrap()`:

```{r}
# color by region
hc %>%
  filter(type == "Employer") %>% 
  ggplot(aes(x = spending_capita, 
             y = prop_coverage,
             color = region)) + 
  geom_point() + 
  facet_wrap(~year) +
  labs(x = "spending per capita",
       y = "coverage proportion") +
  geom_smooth(method = "lm", col = "red") + 
  geom_text(aes(label=abb), 
            nudge_x = 150, 
            show.legend = FALSE)
```

We see that the overall trend holds, but there has been some movement. For example, we see at a glance that DC has a higher proportion of its population covered in 2014 relative to 2013, while MA saw a drop in coverage. UT appears to be an outlier in both years having low spending but a high proportion of individuals covered.

Beyond "Employer"-held health care coverage, let's look at the other types of coverage data we have. Here, we'll facet by type, rather than year, again focusing on just data from 2013.

```{r warning = FALSE}
# visualize 2013 data by type
hc %>%
  filter(year == "2013") %>% 
  ggplot(aes(x = spending_capita, 
             y = prop_coverage,
             color = region)) + 
  geom_point() + 
  facet_wrap(~type) +
  labs(x = "spending per capita",
       y = "coverage proportion") +
  geom_smooth(method = "lm", col = "red") + 
  geom_text(aes(label=abb), 
            nudge_x = 150, 
            show.legend = FALSE)
```

From these data, we see that Employer health care coverage is the most popular way in which individuals receive their health insurance across all states. We also see a flat or positive relationship for all other types of insurance, except for "Uninsured". We see that the more money spent per capita the fewer individuals the state has without insurance, as one might expect.

We can quickly peak at the data from 2014 to be sure the same general patterns hold:

```{r warning = FALSE}
# visualize 2014 data by type
hc %>%
  filter(year == "2014") %>% 
  ggplot(aes(x = spending_capita, 
             y = prop_coverage,
             color = region)) + 
  geom_point() + 
  facet_wrap(~type) +
  labs(x = "spending per capita",
       y = "coverage proportion") +
  geom_smooth(method = "lm", col = "red") + 
  geom_text(aes(label=abb), 
            nudge_x = 150, 
            show.legend = FALSE)
```

The same general patterns hold in 2014 as we saw in 2013; however, the patterns are not exactly the same.

With these plots we have a pretty good understanding of the relationship between spending and coverage across this country when it comes to health care.

Let's save some of these plots for later:

```{r}
pdf(here::here("figures", "exploratory", "2013and2014_spending_and_coverage.pdf"))

hc %>%
  filter(type == "Employer") %>% 
  ggplot(aes(x = spending_capita, 
             y = prop_coverage,
             color = region)) + 
  geom_point() + 
  facet_wrap(~year) +
  labs(x = "spending per capita",
       y = "coverage proportion") +
  geom_smooth(method = "lm", col = "red") + 
  geom_text(aes(label=abb), 
            nudge_x = 150, 
            show.legend = FALSE)
dev.off()

pdf(here::here("figures", "exploratory", "2013_coverage_type.pdf"))
hc %>%
  filter(year == "2013") %>% 
  ggplot(aes(x = spending_capita, 
             y = prop_coverage,
             color = region)) + 
  geom_point() + 
  facet_wrap(~type) +
  labs(x = "spending per capita",
       y = "coverage proportion") +
  geom_smooth(method = "lm", col = "red") + 
  geom_text(aes(label=abb), 
            nudge_x = 150, 
            show.legend = FALSE)
dev.off()

pdf(here::here("figures", "exploratory", "2014_coverage_type.pdf"))
hc %>%
  filter(year == "2014") %>% 
  ggplot(aes(x = spending_capita, 
             y = prop_coverage,
             color = region)) + 
  geom_point() + 
  facet_wrap(~type) +
  labs(x = "spending per capita",
       y = "coverage proportion") +
  geom_smooth(method = "lm", col = "red") + 
  geom_text(aes(label=abb), 
            nudge_x = 150, 
            show.legend = FALSE)
dev.off()
```

#### Q2: Spending Across Geographic Regions?

To answer the question:

> Which US states spend the most and which spend the least on health care? How does the spending distribution change across geographic regions in the United States?

We'll want to visualize health care spending across regions of the US.

We saw in the previous plots that there are some regional effects when it comes to spending, as the states from the different regions tended to cluster in the previous plot. But, let's look at this explicitly now.

We're looking to visualize a continuous variable (`spending_capita`) by a categorical variable (`region`), so a boxplot is always a good place to start:

```{r}
# generate boxplot
hc %>% 
  ggplot(aes(x = region, 
             y = spending_capita)) + 
  geom_boxplot() +
  labs(y = "spending per capita")
```

Here, we get a sense of the overall trend, seeing that states in the Northeast tend to spend the most on health care, while states in the West spend the least.

But, sometimes, it can be helpful to see the distribution of points along with the boxplot, so we can add those onto the plot. We'll use `geom_jitter()` rather than `geom_point()` here so that we can see all the points without overlap. Note that we also use `alpha = 0.2` to increase the transparency of the points and `outlier.shape = NA` to hide the outliers on the boxplot. This way, each observation is only plotted a single time:

```{r}
# add data points to boxplot
hc %>% 
  filter(type == "Employer") %>%
  ggplot(aes(x = region, 
             y = spending_capita)) + 
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(alpha = 0.2) +
  labs(y = "spending per capita")
```

This gives us a sense of the variation in spending for states in each region. Of note, there are more outliers in the South and West, with a few states spending more on health care than even states in the Northeast, where spending tends to be higher.

With this we have a good sense of the regional effects of spending across the United states.

#### Q3: Coverage and Spending Change Over Time?

To answer the question:

> Does the relationship between health care coverage and health care spending in the United States change from 2013 to 2014?

we'll need to visualize a whole bunch of variables: coverage, spending, `year` and type of insurance. We can return to a scatterplot again, but now we'll put all these pieces we looked at separately in Q1 together at once to answer this temporal question visually. 

We won't filter the dataset but will instead use `facet_grid()`, which "forms a matrix of panels defined by row and column faceting variables":

`facet_wrap()`, which we used previously, *typically* utilizes screen space better than `facet_grid()`; however, in this case, where we want 2013 and 2014 to each be in a separate row, `facet_grid()` is the better visual choice.

```{r warning = FALSE}
# color by region
hc %>% 
  ggplot(aes(x = spending_capita, 
             y = prop_coverage,
             color = region)) + 
  geom_point() + 
  labs(x = "spending per capita",
       y = "coverage proportion") +
  geom_smooth(method = "lm", col = "red") + 
  facet_grid(year~type)
```

With this output, the top row shows the data from 2013 and the bottom shows the data from 2014. We can then visually compare the top plot to the bottom plot for each time of insurance. 

Visually, we can start to get a sense that a few things changed from 2013 to 2014. For example, as we saw previously, individual states changed from one year to the next, but overall patterns seem to hold pretty steady between these two years.

We were able to start answering all our questions of interest with a number of quick plots using `ggplot2`. 

Now, of course there are *many* ways in which we could customize each of these plots to be "publication-ready," but for now, we'll stop with this case study having gained a lot of new insight into these data and answers to our questions we now have, simply from generating a few exploratory plots.

### Case Study #2: Firearms

For our second case study, we're interested in the following question: 

>At the state-level, what is the relationship between firearm legislation strength and annual rate of fatal police shootings?

In the previous course, we wrangled the data into a single, helpful dataframe: `firearms`.

```{r}
# see firearms data
firearms
```

This dataset contains state level information about firearm ownership (broken down by ethnicity and gender), the population of each state (`total_pop`), the number of violent crimes (`violent_crime`), the “total state points” from the Brady Scorecard (`brady_scores`), the number of gunshots (`gunshot_tally`), the number of gunshots from armed, non-white, male individuals (`gunshot_filtered`), the annualized rate per 1,000,000 residents (`gunshot_rate`), the `unemployment rate` and `unemployment_rank`, population density (`density`), and firearm ownership as a percent of firearm suicides to all suicides (`ownership`).

#### Exploratory Data Analysis (EDA)

Similar to how we approached the health care case study, let's get an overall understanding of the data in our dataset, starting with understanding where data are missing in our dataset:

```{r}
# visualize missingness
vis_dat(firearms)
```

We see that we have data for all 50 states (and Washington, D.C.) for most variables in our dataset; however, we're missing information for one state when it comes to gunshot information and another state when it comes to Brady scores and ownership by suicide rates. 

Additionally, we see that most of our variables are numeric (either integers or numeric) while the state name (`NAME`) is a character, confirming what we expect for these data.

We will also again use `skim()` to get a better sense of the values in our dataset:

```{r}
# summarize data
skim(firearms)
```

Ultimately, we're interested in firearm legislation and fatal police shootings, which we'll get to,  but let's explore the relationship between other variables in our dataset first to gain a more complete understanding.

For example, determining if there's a relationship between the number of violent crimes and the total population in a state is good to know. Outliers in this relationship would be important to note as they would be states with either lower or higher crime by population. Let's use a scatterplot to better understand this relationship:

```{r}
# violent crimes
ggplot(firearms, 
       aes(x = total_pop,
           y = violent_crime)) + 
  geom_point() +
  labs(x = "Total Population",
       y = "Violent Crime") + 
  theme_classic() + 
  geom_text_repel(aes(label = NAME))
```

Here we see that the visualization of this relationship is largely dependent upon the state's population, with large states like Texas and California sticking out. However, we do see that, for its size, Washington D.C had many more violent cries than other states, despite its small population. 

It would also be helpful to understand the relationship between unemployment and violent crime:

```{r}
# violent crimes
ggplot(firearms, 
       aes(x = unemployment_rate,
           y = violent_crime)) + 
  geom_point() +
  labs(x = "Unemployment Rate",
       y = "Violent Crime") + 
  theme_classic()
```

Here, there appears to be some relationship with states that have higher rates of unemployment having slightly more violent crimes, but violent crimes is not adjusted by population, so this is likely not *that* helpful.

What about the relationship between fatal police shootings and gun `ownership`  as a percent of firearm suicides to all suicides.

```{r}
# violent crimes
ggplot(firearms, 
       aes(x = gunshot_rate,
           y = ownership)) + 
  geom_point() +
  labs(x = "Annualized Rate of Fatal \n Police Shootings per 1,000,000",
       y = "Gun Ownership \n (firearm suicides:all suicides)") + 
  theme_classic()
```

This suggests that states with more fatal police shootings *tend* to have more firearm suicides relative to non-firearm suicides; however, this relationship is non linear. 

With these plots, we're starting to get an understanding of the data and see that there are patterns and don't appear to be wild outliers in any one variable (although, we should keep an eye on Washington, D.C. as it appeared as an outlier in a few plots). With that we're confident we can move on to start looking into our question of interest.

#### Q: Relationship between Fatal Police Shootings and Legislation?

Ultimately, we're interested in firearm legislation and fatal police shootings, so let's focus in on Brady scores here, which measure legislation and `gunshot_tally`, a measure of the rate of fatal police shootings.

We see that the average `brady_score` (`mean`) is `r round(mean(firearms$brady_scores, na.rm = TRUE), 2)`, with state values ranging from `r round(min(firearms$brady_scores, na.rm = TRUE), 2)` to `r round(max(firearms$brady_scores, na.rm = TRUE), 2)`.

When it comes to the rate of police shootings (`gunshot_rate`), we see the average number (`mean`) for a state is  is `r round(mean(firearms$gunshot_rate, na.rm = TRUE), 2)`, with state values ranging from `r round(min(firearms$gunshot_rate, na.rm = TRUE), 2)` to `r round(max(firearms$gunshot_rate, na.rm = TRUE), 2)`, depending on the state.

To start to understand the relationship between the two, we'll want to visualize this relationship using a scatterplot:

```{r}
# visualize legistlation and shootings
ggplot(firearms, aes(x = brady_scores, 
                     y = gunshot_rate)) + 
  geom_point() +
  labs(x = "Firearm Legislative Strength Score",
       y = "Annualized Rate of Fatal \n Police Shootings per 1,000,000") + 
  theme_classic()
```

In this plot, we see that there is a relationship, but it is non-linear. Overall, the higher the legislative strength score (`brady_scores`), the lower the rate of police shootings; however, this decrease is nonlinear, as all states with a positive Brady Score have a similar police shooting rate. 

For now, we won't explore a model, but we will label the points to better contextualize the information displayed. To see which states are at either end of the distribution, using `geom_text()` is a good option:

```{r}
# label points with state name
ggplot(firearms, aes(x = brady_scores, 
                     y = gunshot_rate)) + 
  geom_point() +
  labs(x = "Firearm Legislative Strength Score",
       y = "Annualized Rate of Fatal \n Police Shootings per 1,000,000") + 
  theme_classic() +
  geom_text(aes(label = NAME), 
            nudge_x = 7)
```

This makes it clear that Wyoming, New Mexico, Oklahoma, Arizona, and Nevada have some of the highest rates of fatal police shootings, while  Connecticut, New York, Pennsylvania, and North Dakota are among the lowest. 

However, a number of labels overlap there, so it can be helpful to use `ggrepel` if you want to be able to match each point to its label:

```{r}
library(ggrepel)

# repel points with state name 
ggplot(firearms, aes(x = brady_scores, 
                     y = gunshot_rate)) + 
  geom_point() +
  labs(x = "Firearm Legislative Strength Score",
       y = "Annualized Rate of Fatal \n Police Shootings per 1,000,000") + 
  theme_classic() +
  geom_text_repel(aes(label = NAME))
```

If we wanted to save this particular plot as a pdf, we could do so like this to save the plot in a directory called exploratory within a directory called figures:

```{r}
pdf(here::here("figures", "exploratory", "Fatal_police_shootings_and_firearm_legislative_strength.pdf"))

ggplot(firearms, aes(x = brady_scores, 
                     y = gunshot_rate)) + 
  geom_point() +
  labs(x = "Firearm Legislative Strength Score",
       y = "Annualized Rate of Fatal \n Police Shootings per 1,000,000") + 
  theme_classic() +
  geom_text_repel(aes(label = NAME))

dev.off()
```