Introduction to Graphics in R

Why not Use Base R for Data Visualization?

Base graphics has a pen on paper model. You can only draw on top of the plot, you cannot modify or delete existing content. There is no (user-accessible) representation of the graphics, apart from their appearance on the screen. Base graphics includes both tools for drawing primitives and entire plots. Base graphics functions are generally fast, but have limited scope.

How does GGplot improve upon this?

ggplot2 is an R package for producing statistical, or data, graphics, but it is unlike most other graphics packages because it has a deep underlying grammar. This grammar, based on the Grammar of Graphics (Wilkinson 2005), is made up of a set of independent components that can be composed in many different ways.

Since it is designed to work iteratively, it allows you to start with a layer showing the raw data then add layers of annotations and statistical summaries. It allows you to produce graphics using a structured pattern.

Installing GGplot

install.packages("tidyverse")
install.packages("ggplot2")

Load Tidyverse or GGplot

library(tidyverse)
library(ggplot2)

Base R vs GGplot

Base R

with(mtcars, plot(mpg, disp))

GGplot

mtcars %>% ggplot(aes(mpg, disp)) +
  geom_point()

What is the Grammar of Graphics?

The grammar of graphic attempts to answer a simple question - What is a statistical graphic?

The grammar tells us that a statistical graphic is a mapping from -

  • Data - the dataset you plan on utilizing.
  • Layers - Geometric representations of objects.
  • Scales - These allow you to map aesthetic attributes.
  • Coordinate System - A coordinate system which describes how data coordinates are mapped to the plane of the graphic.
  • Faceting - Also known as lateccing or trellising. Allows you to break the dataset into subsets.
  • Theme - Allows you to control the finer points of display.

It is the combination of these independent components that make up a graphic. Learning the grammar not only will help you create graphics that you know about now, but will also help you to think about new graphics that would be even better.

Basic introduction to GGplot

The goal of this tutorial is to teach you how to produce useful graphics with ggplot2 as quickly as possible.

Dataset Utilized

mpg Dataset in R
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact

There are 6 character variables in the dataset.

There are 5 numeric variables in the dataset.

GGPlot Key Components

  1. Data
  2. Aesthetic mappings
  3. Geometric representations

Basic GGplot format

mpg %>% ggplot(aes(x = displ, y = hwy)) + # Data and aesthetics
  geom_point() # Geometric Representations for points

Alternative format

mpg %>% ggplot() +
  geom_point(aes(x = displ, y = hwy))

Aesthetic attributes for points

mpg %>% ggplot(aes(x = displ, y = hwy)) + 
  geom_point(size = 3, shape = 21, col = 'red', alpha = 0.6,
             fill = 'orange', stroke = 1.9)

  • Size manipulates point size.
  • Shape manipulates shape for each data point.
  • Color maps a color to each data point.
  • fill also maps a color to each data point.
  • alpha manipulates data transparency.
  • stroke manipulates the outer ring of each data point.

Map Variables to Aesthetics

GGplot allows you to map certain variables to aesthetics. For example, we can map Color to class (class consists of the following compact, midsize, suv, 2seater, minivan, pickup, subcompact).

mpg %>% ggplot(aes(x = displ, y = hwy, color = class))+
  geom_point()

Remember that you can specify the aesthetic attribute in geom_point instead of global aesthetics for the plot. You can additionally manipulate other aesthetics or add them to the same plot. Here I chose not to add anything else.

Multiple Aesthetic Attributes

mpg %>% ggplot(aes(x = displ, y = hwy, color = class, shape = class, size = class))+
  geom_point()

What if I specify a color to the aesthetics?

mpg %>% ggplot(aes(x = displ, y = hwy)) +
  geom_point(col = 'blue')

mpg %>% ggplot(aes(x = displ, y = hwy, col = 'blue')) +
  geom_point()

In the first graph, the points are given the colour blue. However, in the second plot ggplot scale the color “blue” to a pinkish color and adds a legend.

Facetting

ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') +
  facet_wrap(~class)

Facetting allows you to subset the data and display additional categorical variables for the plot. Essentially you’re creating tables of graphics for comparison. You’ll eventually learn about 2 types of facetting in ggplot facet_wrap and facet_grid.

Other commonly used Geoms

Smoother - We use a smoother to help identify a general trend in the data (most commonly used with2 numeric variables).

ggplot(mpg, aes(displ, hwy)) +
geom_smooth(span = 0.8, method = 'loess') # or method = 'lm' or # se = F

Boxplot and Violin Plots - Boxplots and violin plots are typically used when you have 1 numeric and 1 categorical variable. But they can also be used with 2 numeric variables if you cut 1 numeric variable up into suitable intervals.

ggplot(mpg, aes(class, cty)) + geom_boxplot(fill = 'red',col = 'black', notch = T)

ggplot(mpg, aes(class, cty)) + geom_violin(fill = 'red',col = 'black')

ggplot(mpg, aes(class, cty)) + geom_jitter(width = 0.2, col = 'red', shape = 21, fill = 'orange', size = 2)

Bar Plot - We use barplots to obtain frequencies for categorical variables but they can also be used to map certain values for categorical values. For example, gender and average height.

ggplot(mpg, aes(class)) + geom_bar(col = 'black', fill = 'khaki')

Histogram, Frequency Polynomial and Density Plots - All three provide us with information about the distribution of a single numeric variable.

ggplot(mpg, aes(cty)) + geom_histogram(col = 'black', binwidth = 4, fill = 'khaki')

ggplot(mpg, aes(cty)) + geom_freqpoly(col = 'black', binwidth = 4)

ggplot(mpg, aes(cty)) + geom_density(col = 'black',
fill = 'khaki', alpha = 0.6)

Line Plot - Line Plots are perhaps the most effective when dealing with time series data or understanding how a variable has changed over a period of time.

ggplot(economics, aes(date, unemploy)) + 
  geom_line()

ggplot(economics, aes(date, unemploy)) + 
  geom_step()

Alternative to points for large datasets is to use 2d representations - 2d representations can be very effective when dealing with large amounts of data as a scatterplot will appear cluttered in a similar situation.

ggplot(diamonds, aes(carat, log(price))) + geom_bin_2d()

diamonds %>% 
  ggplot(aes(carat, log(price))) +
  geom_hex() +
  scale_fill_distiller(palette = 3, direction = -1) +
  jtools::theme_apa()

Some more geoms

df <- data.frame(
  x = c(3, 1, 5),
  y = c(2, 4, 6),
  label = c("a","b","c")
)
p <- ggplot(df, aes(x, y, label = label)) +
  labs(x = NULL, y = NULL)

knitr::kable(df)
x y label
3 2 a
1 4 b
5 6 c

I created a custom dataset to showcase additional geoms and how they work.

p + geom_point() + ggtitle("point")
p + geom_text() + ggtitle("text")
p + geom_bar(stat = "identity") + ggtitle("bar")
p + geom_tile() + ggtitle("raster")
p + geom_line() + ggtitle("line")
p + geom_area() + ggtitle("area")
p + geom_path() + ggtitle("path")
p + geom_polygon() + ggtitle("polygon")

Brief Introduction to themes

# install.packages("ggthemes")
library(ggthemes)
# install.packages("jtools")
library(jtools)

ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') + theme_bw()

ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') + theme_void()

ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') + theme_minimal()

ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') + ggthemes::theme_solarized()

ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') + ggthemes::theme_par()

ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') + jtools::theme_apa()

Combining Geoms and labelling

Combining Geoms

Earlier I displayed individual geoms, but the most unique thing about ggplot is that you can just as easily combine them to produce novel graphics or make the point you are trying to get across.

  • What type of geoms can we combine?

Combining similar geoms can help illustrate the point further and make the graph more aesthetically pleasing. The most common example is combining a jitter plot with a box plot.

ggplot(mpg, aes(class, cty)) + geom_boxplot(fill = 'red',col = 'black', notch = T) + geom_jitter(width = 0.2)

This is just 1 example, you can combine a number of geoms. More examples are shown below.

Smoother with points

ggplot(mpg, aes(displ, hwy)) +
geom_point(col = 'red', shape = 21, fill ='orange', size = 3) +
geom_smooth(span = 0.8, method = 'lm', col = 'grey20')

Histogram with Frequency polynomial

ggplot(mpg, aes(cty)) +
geom_histogram(col = 'black', binwidth = 4, fill = 'khaki') +
geom_freqpoly(binwidth = 4, col = 'black')

You can combine an infinte amount of geoms to produce the plots you want to. Other support geoms include geom_text, geom_label, geom_vline, geom_hline and quite a few more.

Labelling the Plots

ggplot(mpg, aes(displ, hwy)) +
geom_point(col = 'red', shape = 21, fill ='orange', size = 3) +
geom_smooth(span = 0.8, method = 'lm', col = 'grey20')+
xlab("Engine Displacement") +
ylab("Highway miles per gallon")+
ggtitle("Does Engine Displacement reduce highway efficiency?")

Alternative Labelling

ggplot(mpg, aes(displ, hwy)) +
geom_point(col = 'red', shape = 21, fill ='orange', size = 3) +
geom_smooth(span = 0.8, method = 'lm', col = 'grey20')+
labs(x = "Engine Displacement",
     y = "Highway miles per gallon",
     title = "Does Engine Displacement reduce highway efficiency?",
     caption = "Created by - Arjun",
     subtitle = "Dataset = mpg")

Font Family

Before we explore labelling plots further, it’s important to know the type of fonts available in R. The R system provides you with sans, serif and mono. I would personally recommend installing extrafont package in R as it gives you access to a variety of different fonts.

df <- data.frame(x = c(rep(1,6), rep(2,6)), y = c(6:1, 6:1), family = c("sans", "serif", "mono", "Times New Roman", "Georgia", "Arial Rounded MT Bold", "Verdana", "Luminari", "Arial", "Andale Mono", "Brush Script MT", "Tahoma"))
ggplot(df, aes(x, y)) +
  geom_text(aes(label = family, family = family)) +
  xlim(c(0,4))

Sans, Serif and Mono are available in base R but for the others you need to install the extrafont package.

Font Faces

df <- data.frame(x = 1, y = 4:1, face = c("plain", "bold", "italic", "bold.italic"))
ggplot(df, aes(x, y)) +
  geom_text(aes(label = face, fontface = face), size = 5, angle = 15)

Adding text to the plots

The simplest way to add text to the plot is to use the geometric representations of text or in short geom_text.

However, let’s install the ggrepel** package, it works similarly to geom_text but allows you to remove most overlaps.

install.packages("ggrepel")
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_text(aes(label = model), hjust = "inner") +
  xlim(1, 8)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  ggrepel::geom_text_repel(aes(label = model)) +
  xlim(1, 8)
## Warning: ggrepel: 197 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Adding the legend to the data

# install.packages("directlabels")

ggplot(mpg, aes(displ, hwy, colour = class)) +
  geom_point()

ggplot(mpg, aes(displ, hwy, colour = class)) +
  geom_point(show.legend = FALSE) +
  directlabels::geom_dl(aes(label = class), method = "smart.grid")

Adding text to the data

ggplot(mpg, aes(displ, hwy, colour = class)) +
  geom_point() +
  annotate(geom = 'text', x = 5, y = 40, label = "We can clearly see a \n negative trend between engine\n displacement and highway fuel efficency", hjust = 0.3)

It’s often better to use annotate instead of geom_text when adding text to the data instead of point labels.

Grouping Variables and Data Uncertainty

Grouping variables in R

# install.packages("nlme")
library(nlme)
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
ggplot(Oxboys, aes(age, height)) +
  geom_point() +
  geom_line()

The dataset Oxboys contains of grouped longitudinal data on 26 students. If we try do draw a line plot, it’s quite messy. We can fix this by adding a group argument in the aesthetics. The age refers to the standardized age and height is in centimeters.

ggplot(Oxboys, aes(age, height, group = Subject)) +
  geom_point() +
  geom_line()

Coloring the lines could be quite informative in such a situation to distinguish the Individuals.

ggplot(Oxboys, aes(age, height)) +
  geom_point() +
  geom_line(aes(group = Subject, color = Subject))+
  scale_color_hue()

There are still quite a few issues with the graph. However, those will be discussed later. Now let’s add a general trend line to this graph and remove the points.

ggplot(Oxboys, aes(age, height)) +
  geom_line(aes(group = Subject), alpha = 0.4) +
  geom_smooth(method = "lm", size = 2, se = FALSE, col = 'red')
## `geom_smooth()` using formula 'y ~ x'

Data Uncertainty using error bars

y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 3.0))
base <- ggplot(df, aes(x, y, ymin = y - se, ymax = y + se))
base + geom_crossbar()
base + geom_pointrange()
base + geom_smooth(stat = "identity")
base + geom_errorbar()
base + geom_linerange()
base + geom_ribbon()

Importance of Factoring the Data

Factoring continous or discrete variables

Sometimes in order to get the desired results we have to change the class of the variable.

class(mpg$cyl)
## [1] "integer"
ggplot(mpg, aes(displ, hwy, color = cyl)) +
  geom_point()

For Numeric variables GGplot tries to fill the colors based on a continuous scale instead of treating each data point as an individual entity. To correct this, we can simply factor the variable in question.

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
  geom_point()

GGplot does not allow you to directly manipulate the legend. The key to producing a good legend is to make sure the data is in the correct format.

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
  geom_point() +
  labs(color = 'Number of \nCylinders')

Factors additionally allow you to make sure the data is in the correct order or it allows you to cut up continous variables into segments and treat them as categorical variables.

ggplot(mpg, aes(trans, displ)) +
  geom_histogram(stat = 'identity', fill = 'violet')
## Warning: Ignoring unknown parameters: binwidth, bins, pad

# Factor displ to put it in the correct order

ggplot(mpg, aes(fct_infreq(trans), displ)) +
  geom_histogram(stat = 'identity', fill = 'violet')
## Warning: Ignoring unknown parameters: binwidth, bins, pad

# Producing boxplots with 2 continous variables

ggplot(mpg, aes(displ, hwy)) +
  geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

# Cutting up a numeric variable to use it as a categorical variable

ggplot(mpg, aes(cut_width(displ, 1), hwy)) +
  geom_boxplot(col = 'red', fill = 'violet')

For more information on factoring in detail, I’d refer you to the Forcats Package documentation.

Statistical Transformations and Positional adjustments

Stats

The name of the statistical transformation to use. A statistical transformation performs some useful statistical summary, and is key to histograms and smoothers. To keep the data as is, use the “identity” stat.

You’ll rarely call these functions directly, but they are useful to know about because their documentation often provides more detail about the corresponding statistical transformation.

ggplot(mpg, aes(trans, cty)) +
geom_point() +
geom_point(stat = "summary", fun.y = "mean", colour = "red", size = 4)
## No summary function supplied, defaulting to `mean_se()`

When dealing with distributions stats can be quite useful.

ggplot(mpg, aes(hwy)) +
geom_histogram(binwidth = 5, col = 'red', fill = 'violet')

ggplot(mpg, aes(hwy)) +
geom_histogram(stat = 'density',binwidth = 5, col = 'red', fill = 'violet')
## Warning: Ignoring unknown parameters: binwidth, bins, pad
# If you want to preserve the histogram with density on the y axis 

ggplot(mpg, aes(hwy)) +
geom_histogram(aes(y = ..density..),binwidth = 5, col = 'red', fill = 'violet')

# If we have 2 variables and we want to plot the height instead of count.

ggplot(mpg, aes(manufacturer, hwy)) +
geom_histogram(stat = 'identity', col = 'violet', fill = 'violet')
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Positions

Position adjustments apply minor tweaks to the position of elements within a layer. Three adjustments apply primarily to bars:

  • position_stack(): stack overlapping bars (or areas) on top of each other.
  • position_fill(): stack overlapping bars, scaling so the top is always at 1.
  • position_dodge(): place overlapping bars (or boxplots) side-by-side.
dplot <- ggplot(mpg, aes(class, fill = factor(cyl))) +
  xlab(NULL) + ylab(NULL)

dplot + geom_bar(alpha = 2/3)

dplot + geom_bar(position = "fill", alpha = 2/3)

dplot + geom_bar(position = "dodge",alpha = 2/3)

There are three position adjustments that are primarily useful for points: * position_nudge(): move points by a fixed offset. * position_jitter(): add a little random noise to every position. * position_jitterdodge(): dodge points within groups, then add a little random noise.

ggplot(mpg, aes(cyl, hwy)) +
geom_point(position = "jitter")

ggplot(mpg, aes(cyl, hwy)) +
geom_point(position = position_jitter(width = 0.15, height = 0.5))

ggplot(mpg, aes(cyl, hwy)) +
geom_jitter()

Scales

Introduction to Scales

There are 5 main arguments for any scale. 1. Trans (which stands for a transformation). 2. Name (Labeling the axis) 3. Breaks (Adding Breaks to the scale) 4. Labels (Labeling the breaks) 5. Limits (Choosing the limits for the scale)

# Scales
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class))

# This is the same as the code below
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

The two plots are exactly the same. ggplot scales the values by default but you can make changes to them if you wish. It would be tedious to manually add a scale every time you used a new aesthetic, so ggplot2 does it for you. But if you want to override the defaults, you’ll need to add the scale yourself.

You can label the plots using the defined scales. To label the legend you have to assign a label to the aesthetic mapping you are scaling by.

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  scale_x_continuous(name = "A really awesome x axis label") +
  scale_y_continuous(name = "An amazingly great y axis label")+
  scale_colour_discrete(name = "Legend")

# Breaks
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) + 
  scale_y_continuous(breaks = seq(10,50, by = 5))

# Labeling
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) + 
  scale_y_continuous(breaks = seq(10,50, by = 5), labels = paste0(seq(10,50, by = 5), "m/g")) +
  scale_color_discrete(labels = paste0(unique(mpg$class), "Mine"))

# Limits
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) + 
  scale_y_continuous(breaks = seq(10,50, by = 5), labels = paste0(seq(10,50, by = 5), "m/g"), limits = c(10,50))

ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous("Label 1") +
scale_x_continuous("Label 2")
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

This will replace the first scale one with the 2nd one.

The use of + to “add” scales to a plot is a little misleading. When you + a scale, you’re not actually adding it to the plot, but overriding the existing scale.

Scaling Families

  1. Continuous position scales used to map integer, numeric, and date/time data to x and y position.

Every plot has two position scales, x and y. The most common continuous position scales are scale_x_continuous() and scale_y_continuous(), which linearly map data to the x and y axis.

Every continuous scale takes a trans argument, allowing the use of a variety of transformations:

ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(trans = "reciprocal")

ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(trans = "reverse")

ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_reverse()

ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_log10()

ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_sqrt()

ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(trans = 'log2')

The transformation is carried out by a “transformer”, which describes the transformation, its inverse, and how to draw the labels.

There are shortcuts for the most common: scale_x_log10(), scale_x_sqrt() and scale_x_reverse().

In either case, the transformation occurs before any statistical summaries. To transform, after statistical computation, use coord_trans().

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  coord_trans(x = 'log10' ,y = 'log10')

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  coord_trans(x = 'reverse' ,y = 'reverse')

Scale labeling tricks

ggplot(diamonds, aes(carat, price)) +
  geom_hex() +
  scale_x_log10() +
  scale_fill_distiller(palette = 3) +
  scale_y_continuous(labels = scales::dollar_format())

ggplot(diamonds, aes(carat, price)) +
  geom_hex() +
  scale_x_log10() +
  scale_fill_distiller(palette = 3) +
  scale_y_continuous(labels = scales::comma_format())

ggplot(diamonds, aes(carat, price)) +
  geom_hex() +
  scale_x_log10() +
  scale_fill_distiller(palette = 3) +
  scale_y_continuous(labels = scales::unit_format(suffix = "K", scale = 1/1000))

  1. Colour scales, used to map continuous and discrete data to colours.

Continous colors

erupt <- ggplot(faithfuld, aes(waiting, eruptions, fill = density)) +
geom_raster() +
scale_x_continuous(NULL, expand = c(0, 0)) +
scale_y_continuous(NULL, expand = c(0, 0))

mpgplot <- ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(col = cyl)) 

If you wish to choose prebuilt palettes, you can use scale_color_distiller() or scale_fill_distiller()

erupt + scale_fill_distiller()
erupt + scale_fill_distiller(palette = "RdPu")
erupt + scale_fill_distiller(palette = "YlOrBr")

mpgplot + scale_color_distiller()
mpgplot + scale_color_distiller(palette = 2)
mpgplot + scale_color_distiller(palette = 3)

scale_colour_gradient() and scale_fill_gradient() are perhaps the most effective for scaling continuous colors. However, they are limited by 2 colors.

scale_colour_gradient2() and scale_fill_gradient2() a three-colour gradient,low-med-high (red-white-blue).

scale_colour_gradientn() and scale_fill_gradientn(): a custom n-colour gradient.

erupt + scale_fill_gradient(low = "white", high = "black")

erupt + scale_fill_gradient2(midpoint = 0.02)
erupt + scale_fill_gradient2(low = 'green', mid = 'yellow', high = 'orange', midpoint = 0.02)

erupt + scale_fill_gradientn(colours = terrain.colors(7))
erupt + scale_fill_gradientn(colours = colorspace::heat_hcl(7))
erupt + scale_fill_gradientn(colours = colorspace::diverge_hcl(7))

mpgplot + scale_color_gradient(low = 'red', high = 'blue')
mpgplot + scale_color_gradient2(low = 'red', mid = 'green' ,high = 'blue', midpoint = 6)

Discrete Colors

There are two colors scales I’ll recommend for discrete values and they are the scale_color_brewer() scale and scale_color_hue().

Note: Color or fill depend on how you scaled the values

ggplot(mpg, aes(class, fill = class))+
  geom_bar()

# Default color scale
ggplot(mpg, aes(class, fill = class))+
  geom_bar() + 
  scale_fill_hue()

# C stands for chroma, h is hue, and l is luminance
ggplot(mpg, aes(class, fill = class))+
  geom_bar() + 
  scale_fill_hue(c = 160, l = 50, h = c(80, 360))

ggplot(mpg, aes(class, fill = class))+
  geom_bar() + 
  scale_fill_brewer()

ggplot(mpg, aes(class, fill = class))+
  geom_bar() + 
  scale_fill_brewer(palette = 'Set1')

ggplot(mpg, aes(class, fill = class))+
  geom_bar() + 
  scale_fill_brewer(palette = 'Accent')

Manual Colors

First step is to choose a set of colors. I like using the color HEX wheel on google for this. For Points, choose bright colors, for bars choose uniform colors that are more transparent.

colors <- c('#05fc47','#fc8d05','#ff0d00',"#2e02f2",'#f711ba', "#970af5", "#12b6fc")

ggplot(mpg, aes(class, fill = class))+
  geom_bar(alpha = 0.6) +
  scale_fill_manual(values = colors)

ggplot(mpg, aes(displ, hwy, fill = class))+
  geom_point(size = 4, shape = 21, color = 'black') +
  scale_fill_manual(values = colors)

Customize Legend colours manually

colors <- c(compact = sample(colours(), 1),
            midsize = sample(colours(), 1),
            suv = sample(colours(), 1),
            `2seater` = sample(colours(), 1),
            minivan = sample(colours(), 1),
            pickup = sample(colours(), 1),
            subcompact = sample(colours(), 1))

ggplot(mpg, aes(displ, hwy, fill = class))+
  geom_point(size = 4, shape = 21, color = 'black') +
  scale_fill_manual(values = colors)

Guides: legends and axes

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point(alpha = 0.4, size = 3)  +
  geom_point(aes(displ, cty, fill = class), shape = 8, size = 3, show.legend = F)+
  scale_color_brewer(palette = 'Set1') +
  guides(color = guide_legend(title.position = 'top',keywidth = 0.5,direction = 'vertical',override.aes = list(alpha = 1),keyheight = 1.5))

ggplot(mpg, aes(manufacturer)) +
  geom_bar(alpha = 0.7, fill = 'violet') +
  guides(x = guide_axis(n.dodge = 2,angle = -15, position = 'top'))

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point(alpha = 0.4) +
  scale_y_continuous(breaks = 0:50, trans = 'reverse') +
  guides(y = guide_axis(check.overlap = T,angle = -15,n.dodge = 2),
         color = guide_legend(title.position = 'top',keywidth = 0.5,direction = 'vertical',override.aes = list(alpha = 1)))

ggplot(mpg, aes(displ, hwy, fill = cyl)) +
  geom_point(shape = 21, col = 'black', size = 6) +
  scale_fill_continuous(breaks = seq(4,8,0.5), labels = scales::unit_format(prefix = 'X', suffix = 'k')) +
  guides(fill = guide_colourbar(title.position = 'top', barwidth = 1.5, reverse = T, barheight = 14, frame.colour = 'black', ticks.linewidth = 2, ticks.colour = 'black'))

Guides allow you to manipulate the legend, color bar, axis as you see fit. They allow a great deal of flexibility and can be very useful in producing the graphic you want.

Facetting

base <- ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  xlab(NULL) +
  ylab(NULL)
base + facet_wrap(~class, ncol = 3)

base + facet_wrap(~class, ncol = 3, as.table = FALSE)

base + facet_wrap(~cyl, nrow = 2)

base + facet_wrap(~class, nrow = 3, dir = "v")

base + facet_grid(. ~ class)

base + facet_grid(class ~ .)

base + facet_grid(drv ~ cyl)

Controlling the scales for facet grids

p <- ggplot(mpg, aes(cty, hwy)) +
  geom_smooth(method = 'lm') +
  geom_jitter(width = 0.1, height = 0.1)
p + facet_wrap(~cyl)
## `geom_smooth()` using formula 'y ~ x'

# Free scales
p + facet_wrap(~cyl, scales = "free")
## `geom_smooth()` using formula 'y ~ x'

p + facet_grid(. ~ cyl, scales = "free_y")
## `geom_smooth()` using formula 'y ~ x'

p + facet_grid(cyl ~ ., scales = "free_x")
## `geom_smooth()` using formula 'y ~ x'

Grouping Variables vs Facetting

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = drv))

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = drv)) +
  facet_wrap(~drv)

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = drv)) +
  geom_point(data = mpg %>% keep(is.numeric),size = 1, col = 'grey20', alpha = 0.1) +
  facet_wrap(~drv)

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = drv), alpha = 0.3) +
  geom_point(data = mpg %>% group_by(drv) %>% summarise(displ = mean(displ),hwy = mean(hwy)) %>% rename(drv2 = drv), size = 4, aes(col = drv2)) +
  facet_wrap(~drv)

# Coordinate System

There are two types of coordinate system.

  1. Linear coordinate systems preserve the shape of geoms:
  • coord_cartesian(): the default Cartesian coordinate system, where the 2d position of an element is given by the combination of the x and y positions.
  • coord_flip(): Cartesian coordinate system with x and y axes flipped.
  • coord_fixed(): Cartesian coordinate system with a fixed aspect ratio.
  1. Non-linear coordinate systems can change the shapes:a straight line may no longer be straight. The closest distance between two points may no longer be a straight line.
  • coord_map()/coord_quickmap(): Map projections.
  • coord_polar(): Polar coordinates.
  • coord_trans(): Apply arbitrary transformations to x and y positions, after the data has been processed by the stat.

Caretesian Coordinate

Zooming In

base <- ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()

base + xlim(c(5,7))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 196 rows containing non-finite values (stat_smooth).
## Warning: Removed 196 rows containing missing values (geom_point).

base + scale_x_continuous(limits = c(5,7))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 196 rows containing non-finite values (stat_smooth).
## Removed 196 rows containing missing values (geom_point).

base + coord_cartesian(xlim = c(5, 7))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Flip the axes

base + coord_flip()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Ensring Fixed Scales

coord_fixed() fixes the ratio of length on the x and y axes. The default ratio ensures that the x and y axes have equal scales: i.e., 1 cm along the x axis represents the same range of data as 1 cm along the y axis.

Polar Coordinates

base <- ggplot(mpg, aes(factor(1), fill = factor(cyl))) +
geom_bar(width = 1) +
scale_x_discrete(NULL, expand = c(0, 0)) +
scale_y_continuous(NULL, expand = c(0, 0))
# Stacked barchart
base

# Pie chart
base + coord_polar(theta = "y")

# The bullseye chart
base + coord_polar()

Using polar coordinates allows us to create pie charts and wind roses (from bar geoms), and radar charts (from line geoms). Polar coordinates should be used for circular data, particularly time or direction, but the perceptual properties are not that good.

I briefly introduced coord_trans() earlier and coord_map() is beyond the scope of this course.

Basic Data Manipulation

Selecting the Rows you want.

Here we are selecting the first 3 rows

select(mpg, c(1,2,3)) #Dplyr

mpg[,c(1,2,3)] #Base R

Filtering the dataset

filter(mpg, cty > 20) #Dplyr

subset(mpg, cty > 20) #Base R

Arranging your data

arrange(mpg, desc(cty)) # Dplyr

Factoring

mpg <- mutate(mpg, cyl = factor(cyl))
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : Factor w/ 4 levels "4","5","6","8": 1 1 1 1 3 3 3 1 1 1 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
mpg$cyl <- factor(mpg$cyl)
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : Factor w/ 4 levels "4","5","6","8": 1 1 1 1 3 3 3 1 1 1 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

Grouping and Summarizing

mpg %>% group_by(year) %>% summarise(m = mean(cty))
## # A tibble: 2 × 2
##    year     m
##   <int> <dbl>
## 1  1999  17.0
## 2  2008  16.7
mpg %>% group_by(year, cyl) %>% summarise(m = mean(cty))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 7 × 3
## # Groups:   year [2]
##    year cyl       m
##   <int> <fct> <dbl>
## 1  1999 4      20.8
## 2  1999 6      16.1
## 3  1999 8      12.2
## 4  2008 4      21.2
## 5  2008 5      20.5
## 6  2008 6      16.4
## 7  2008 8      12.8

Gathering

mpg %>% select(year, cty, hwy) %>% gather(cty:hwy, key = type, value = efficiency) %>% ggplot(aes(year, efficiency, fill = type, group = type)) +
  geom_bar(stat = 'identity', position = 'dodge')

ggplot(mpg, aes(year, cty)) +
  geom_bar(stat = 'identity', fill = 'red', alpha = 0.2)  +
  geom_bar(aes(year, hwy),stat = 'identity', fill = 'blue', alpha = 0.4)

Theme

Introduction to Theme System

The theming system is composed of four main components:

  • Theme elements specify the non-data elements that you can control. For example, the plot.title element controls the appearance of the plot title. axis.ticks.x, the ticks on the x axis; legend.key.height, the height of the keys in the legend.

  • Each element is associated with an element function, which describes the visual properties of the element. For example, element_text() sets the font size, colour and face of text elements like plot.title.

  • The theme() function which allows you to override the default theme elements by calling element functions, like theme(plot.title = element_ text(colour = “red”)).

  • Complete themes, like theme_grey() set all of the theme elements to values designed to work together harmoniously.

base <- ggplot(mpg, aes(cty, hwy, color = factor(cyl))) +
geom_jitter() +
geom_smooth(method = 'lm',colour = "grey50", size = 2, se = F) 

base
## `geom_smooth()` using formula 'y ~ x'

We created the base for the plot.

labelled <- base +
labs(
x = "City mileage/gallon",
y = "Highway mileage/gallon",
colour = "Cylinders",
title = "Highway and city mileage are highly correlated",
caption = 'Created by = Arjun'
) +
scale_colour_brewer(type = "seq", palette = "Spectral")
labelled
## `geom_smooth()` using formula 'y ~ x'

I added labels and colours to the base plot.

library(extrafont)
## Registering fonts with R
styled <- labelled +
theme_bw() +
theme(
plot.title = element_text(face = "bold", size = 12, family = "Times New Roman"),
legend.background = element_rect(fill = "white", size = 4, colour = "white"),
legend.justification = c(0, 1),
legend.position = c(0.01, 0.98),
axis.ticks = element_line(colour = "grey70", size = 0.2),
panel.grid.major = element_line(colour = "grey70", size = 0.2),
panel.grid.minor = element_blank()
) +
  jtools::drop_gridlines()
styled
## `geom_smooth()` using formula 'y ~ x'

Modifying theme components

element_text() draws labels and headings. You can control the font family, face, colour, size (in points), hjust, vjust, angle (in degrees) and lineheight (as ratio of fontcase).

base_t <- base + labs(title = "This is a ggplot") + xlab(NULL) + ylab(NULL)

base_t + theme(plot.title = element_text(size = 16))
## `geom_smooth()` using formula 'y ~ x'

base_t + theme(plot.title = element_text(face = "bold", colour = "red"))
## `geom_smooth()` using formula 'y ~ x'

base_t + theme(plot.title = element_text(hjust = 1))
## `geom_smooth()` using formula 'y ~ x'

  • Margins
base_t + theme(plot.title = element_text(margin = margin()))
## `geom_smooth()` using formula 'y ~ x'

base_t + theme(plot.title = element_text(margin = margin(t = 10, b = 10)))
## `geom_smooth()` using formula 'y ~ x'

base_t + theme(axis.title.y = element_text(margin = margin(r = 10)))
## `geom_smooth()` using formula 'y ~ x'

element_line() draws lines parameterised by colour, size and linetype.

base + theme(panel.grid.major = element_line(colour = "black"))
## `geom_smooth()` using formula 'y ~ x'

base + theme(panel.grid.major = element_line(size = 2))
## `geom_smooth()` using formula 'y ~ x'

base + theme(panel.grid.major = element_line(linetype = "dotted"))
## `geom_smooth()` using formula 'y ~ x'

element_rect() draws rectangles, mostly used for backgrounds, parameterised by fill colour and border colour, size and linetype.

base + theme(plot.background = element_rect(fill = "grey80", colour = NA))
## `geom_smooth()` using formula 'y ~ x'

base + theme(plot.background = element_rect(colour = "red", size = 2))
## `geom_smooth()` using formula 'y ~ x'

base + theme(panel.background = element_rect(fill = "linen"))
## `geom_smooth()` using formula 'y ~ x'

element_blank() draws nothing. Use this if you don’t want anything drawn, and no space allocated for that element.

base
## `geom_smooth()` using formula 'y ~ x'

last_plot() + theme(panel.grid.minor = element_blank())
## `geom_smooth()` using formula 'y ~ x'

last_plot() + theme(panel.grid.major = element_blank())
## `geom_smooth()` using formula 'y ~ x'

last_plot() + theme(panel.background = element_blank())
## `geom_smooth()` using formula 'y ~ x'

last_plot() + theme(
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
## `geom_smooth()` using formula 'y ~ x'

last_plot() + theme(axis.line = element_line(colour = "grey50"))
## `geom_smooth()` using formula 'y ~ x'

Combining all themes

Let’s create a custom theme.

mytheme <- function() {
  theme_minimal() %+replace% 
  theme(axis.title = element_text(family = "Georgia", size = 16,color = 'black'),
        panel.background = element_rect(fill = "#5e5848"),
        plot.background = element_rect(fill = "#f5e8c6"),
        axis.title.x = element_text(margin = margin(t = 1, unit = 'lines')),
        axis.title.y = element_text(margin = margin(r = 1, unit = 'lines'), angle = 90),
        axis.text = element_text(family = "Luminari", size = 12,colour = 'gray20'),
        legend.justification = c(0,0),
        legend.position = c(0.86,0.90),
        legend.title = element_blank(),
        legend.text = element_text(family = 'Georgia', size = 22, color = 'white'),
        plot.title = element_text(family = 'Georgia',face = 'bold', size = 26, color = 'black', margin = margin(b = 1, unit = 'lines')),
        plot.caption = element_text(size = 12, hjust = 1),
        panel.grid = element_blank())
}

base <- mpg %>% gather(cty:hwy, key = cityorhwy, value = efficency) %>%
  group_by(manufacturer, cityorhwy) %>% summarise(efficency = mean(efficency)) %>% 
  mutate(manufacturer = str_to_title(manufacturer)) %>% ungroup() %>% 
  ggplot(aes(fct_reorder(manufacturer, efficency, .desc = T), efficency)) +
  geom_bar(aes(fill = cityorhwy),position = 'dodge' ,stat = 'identity', alpha = 0.8) +
  geom_text(aes(group = cityorhwy, label = round(efficency,0)), position = position_dodge(width = 0.8), vjust = -0.3, col = 'gray', size = 6) +
  scale_y_continuous("Efficiency" ,breaks = seq(0, 35, 5), labels = scales::unit_format(suffix = " miles/gallon")) +
  scale_x_discrete("Manufacturer",labels = str_to_title(unique(mpg$manufacturer))) +
  scale_fill_manual("Legend", labels = c("City", "Highway"), values = c('#fc05fc', '#fc0505')) +
  labs(title = "Most Efficienct Car Manufacturers",
      caption = "Created by : Arjun")
## `summarise()` has grouped output by 'manufacturer'. You can override using the
## `.groups` argument.
base

base + jtools::theme_apa()

base +  mytheme() 

Saving plots

ggsave("nameofplot.jpeg", scale = 1, dpi = 800)

This will save the last plot to your working directory. ggsave() can produce .eps, .pdf, .svg, .wmf, .png, .jpg, .bmp, and .tiff. dpi controls the resolution of the plot.

Glossary/Guide

Guide to using Geoms

  1. Graphical Primitives
  • geom_blank(): display nothing. Most useful for adjusting axes limits using data.
  • geom_point(): points.
  • geom_path(): paths.
  • geom_ribbon(): ribbons, a path with vertical thickness.
  • geom_segment(): a line segment, specified by start and end position.
  • geom_rect(): rectangles.
  • geom_polyon(): filled polygons.
  • geom_text(): text.
  1. One Variable
  • Discrete:
    • geom_bar(): display distribution of discrete variable.
  • Continuous
    • geom_histogram(): bin and count continuous variable, display with bars.
    • geom_density(): smoothed density estimate.
    • geom_dotplot(): stack individual points into a dot plot.
    • geom_freqpoly(): bin and count continuous variable, display with lines.
  1. Two variables:
  • Both continuous:
    • geom_point(): scatterplot.
    • geom_quantile(): smoothed quantile regression.
    • geom_rug(): marginal rug plots.
    • geom_smooth(): smoothed line of best fit.
    • geom_text(): text labels.
  • Show distribution:
    • geom_bin2d(): bin into rectangles and count.
    • geom_density2d(): smoothed 2d density estimate.
    • geom_hex(): bin into hexagons and count.
  • At least one discrete:
    • geom_count(): count number of point at distinct locations
    • geom_jitter(): randomly jitter overlapping points.
  • One continuous, one discrete:
    • geom_bar(stat = “identity”): a bar chart of precomputed summaries.
    • geom_boxplot(): boxplots.
    • geom_violin(): show density of values in each group.
  • One time, one continuous
    • geom_area(): area plot.
    • geom_line(): line plot.
    • geom_step(): step plot.
  • Display uncertainty:
    • geom_crossbar(): vertical bar with center.
    • geom_errorbar(): error bars.
    • geom_linerange(): vertical line.
    • geom_pointrange(): vertical line with center.
  1. Three variables:
  • geom_contour(): contours.
  • geom_tile(): tile the plane with rectangles.
  • geom_raster(): fast version of geom_tile() for equal sized tiles.

Guide to using Scale transformations

Scales

Name Function
exp \(e^{x}\)
identity x
log log(x)
log10 \(log_{10}\)(x)
log2 \(log_{2}\)(x)
logit log(\(\frac{x}{1-x}\))
pow10 \(10^{x}\)
reverse -x
sqrt \(x^{1/2}\)
reciprocal \(x^{-1}\)

Colour guide

  1. Continous colors or fills.
  • Scale_fill_distiller/Scale_color_distiller

  • Scale_fill_gradient/Scale_color_gradient (2 colors)

  • Scale_fill_gradient2/Scale_color_gradient2 (3 colors)

  • Scale_fill_gradientn/Scale_color_gradientn (n Colors, use colorspace package)

  1. Discrete Colors
  • Scale_fill_hue/Scale_color_hue (Default color Scale)

  • Scale_fill_brewer/Scale_color_brewer (Allows you to select pre-built palettes)

  • Scale_fill_manual/Scale_color_manual (Allows you to manually specify the colors)

  • Viridis scales are available in the viridis package. Example = (p + viridis::scale_color_viridis(discrete=TRUE, option=“plasma”)). The viridis scale has multiple predesigned palettes.

  • ggthemes package also contains quite a few color scales. Example = (p + ggthemes::scale_colour_solarized()). There are countless more scales available in the ggthemes packages for your discretion.

Guide to Themes

Themes

Elements Setter Description
plot.background element_rect() plot background
plot.title element_text() plot title
plot.margin margin() margins around plot

Axis Elements

Element Setter Description
axis.line element_line() line parallel to axis
axis.text element_text() tick labels
axis.text.x element_text() x-axis tick labels
axis.text.y element_text() y-axis tick labels
axis.title element_text() axis titles
axis.title.x element_text() x-axis title
axis.title.y element_text() y-axis title
axis.ticks element_line() axis tick marks
axis.ticks.length unit() length of tick marks

Legend Elements

Element Setter Description
legend.background element_rect() legend background
legend.key element_rect() background of legend keys
legend.key.size unit() legend key size
legend.key.height unit() legend key height
legend.key.width unit() legend key width
legend.margin unit() legend margin
legend.text element_text() legend labels
legend.text.align legend label alignment (0 = right, 1 = left)
legend.title element_text() legend name
legend.title.align legend name alignment (0 = right, 1 = left)

Panel Elements

Element Setter Description
panel.background element_rect() panel background (under data)
panel.border element_rect() panel border (over data)
panel.grid.major element_line() major grid lines
panel.grid.major.x element_line() vertical major grid lines
panel.grid.major.y element_line() horizontal major grid lines
panel.grid.minor element_line() minor grid lines
panel.grid.minor.x element_line() vertical minor grid lines
panel.grid.minor.y element_line() horizontal minor grid lines
aspect.ratio numeric plot aspect ratio

Facetting elements

Element Setter Description
strip.background element_rect() background of panel strips
strip.text element_text() strip text
strip.text.x element_text() horizontal strip text
strip.text.y element_text() vertical strip text
panel.margin unit() margin between facets
panel.margin.x unit() margin between facets (vertical)
panel.margin.y unit() margin between facets (horizontal)