Introduction to Graphics in R
Why not Use Base R for Data Visualization?
Base graphics has a pen on paper model. You can only draw on top of the plot, you cannot modify or delete existing content. There is no (user-accessible) representation of the graphics, apart from their appearance on the screen. Base graphics includes both tools for drawing primitives and entire plots. Base graphics functions are generally fast, but have limited scope.
How does GGplot improve upon this?
ggplot2 is an R package for producing statistical, or data, graphics, but it is unlike most other graphics packages because it has a deep underlying grammar. This grammar, based on the Grammar of Graphics (Wilkinson 2005), is made up of a set of independent components that can be composed in many different ways.
Since it is designed to work iteratively, it allows you to start with a layer showing the raw data then add layers of annotations and statistical summaries. It allows you to produce graphics using a structured pattern.
Base R vs GGplot
Base R
GGplot
What is the Grammar of Graphics?
The grammar of graphic attempts to answer a simple question - What is a statistical graphic?
The grammar tells us that a statistical graphic is a mapping from -
- Data - the dataset you plan on utilizing.
- Layers - Geometric representations of objects.
- Scales - These allow you to map aesthetic attributes.
- Coordinate System - A coordinate system which describes how data coordinates are mapped to the plane of the graphic.
- Faceting - Also known as lateccing or trellising. Allows you to break the dataset into subsets.
- Theme - Allows you to control the finer points of display.
It is the combination of these independent components that make up a graphic. Learning the grammar not only will help you create graphics that you know about now, but will also help you to think about new graphics that would be even better.
Basic introduction to GGplot
The goal of this tutorial is to teach you how to produce useful graphics with ggplot2 as quickly as possible.
Dataset Utilized
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
manufacturer refers to the manufacturer name.model refers to the model name.displ refers to engine displacement.year refers to the year of manufacture.cyl refers to the number of cylinders.trans refers to the type of transmission.drv refers to the type of drive train.cty refers to the city miles per gallon.hwy refers to the highway miles per gallon.fl refers to the fuel type.class refers to the type of car.
There are 6 character variables in the dataset.
There are 5 numeric variables in the dataset.
GGPlot Key Components
- Data
- Aesthetic mappings
- Geometric representations
Basic GGplot format
Aesthetic attributes for points
mpg %>% ggplot(aes(x = displ, y = hwy)) +
geom_point(size = 3, shape = 21, col = 'red', alpha = 0.6,
fill = 'orange', stroke = 1.9)
Size manipulates point size.Shape manipulates shape for each data point.Color maps a color to each data point.fill also maps a color to each data point.alpha manipulates data transparency.stroke manipulates the outer ring of each data point.
Map Variables to Aesthetics
GGplot allows you to map certain variables to aesthetics. For example, we can map
Remember that you can specify the aesthetic attribute in geom_point instead of global aesthetics for the plot. You can additionally manipulate other aesthetics or add them to the same plot. Here I chose not to add anything else.
Multiple Aesthetic Attributes
What if I specify a color to the aesthetics?
mpg %>% ggplot(aes(x = displ, y = hwy)) +
geom_point(col = 'blue')
mpg %>% ggplot(aes(x = displ, y = hwy, col = 'blue')) +
geom_point()
In the first graph, the points are given the colour blue. However, in the second plot ggplot scale the color “blue” to a pinkish color and adds a legend.
Facetting
ggplot(mpg, aes(displ, hwy)) +
geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') +
facet_wrap(~class)
Facetting allows you to subset the data and display additional categorical variables for the plot. Essentially you’re creating tables of graphics for comparison. You’ll eventually learn about 2 types of facetting in ggplot
Other commonly used Geoms
Smoother - We use a smoother to help identify a general trend in the data (most commonly used with2 numeric variables).
ggplot(mpg, aes(displ, hwy)) +
geom_smooth(span = 0.8, method = 'loess') # or method = 'lm' or # se = F
Boxplot and Violin Plots - Boxplots and violin plots are typically used when you have 1 numeric and 1 categorical variable. But they can also be used with 2 numeric variables if you cut 1 numeric variable up into suitable intervals.
ggplot(mpg, aes(class, cty)) + geom_boxplot(fill = 'red',col = 'black', notch = T)
ggplot(mpg, aes(class, cty)) + geom_violin(fill = 'red',col = 'black')
ggplot(mpg, aes(class, cty)) + geom_jitter(width = 0.2, col = 'red', shape = 21, fill = 'orange', size = 2)
Bar Plot - We use barplots to obtain frequencies for categorical variables but they can also be used to map certain values for categorical values. For example, gender and average height.
Histogram, Frequency Polynomial and Density Plots - All three provide us with information about the distribution of a single numeric variable.
ggplot(mpg, aes(cty)) + geom_histogram(col = 'black', binwidth = 4, fill = 'khaki')
ggplot(mpg, aes(cty)) + geom_freqpoly(col = 'black', binwidth = 4)
ggplot(mpg, aes(cty)) + geom_density(col = 'black',
fill = 'khaki', alpha = 0.6)
Line Plot - Line Plots are perhaps the most effective when dealing with time series data or understanding how a variable has changed over a period of time.
ggplot(economics, aes(date, unemploy)) +
geom_line()
ggplot(economics, aes(date, unemploy)) +
geom_step()
Alternative to points for large datasets is to use 2d representations - 2d representations can be very effective when dealing with large amounts of data as a scatterplot will appear cluttered in a similar situation.
Some more geoms
df <- data.frame(
x = c(3, 1, 5),
y = c(2, 4, 6),
label = c("a","b","c")
)
p <- ggplot(df, aes(x, y, label = label)) +
labs(x = NULL, y = NULL)
knitr::kable(df)
x | y | label |
---|---|---|
3 | 2 | a |
1 | 4 | b |
5 | 6 | c |
I created a custom dataset to showcase additional geoms and how they work.
Brief Introduction to themes
# install.packages("ggthemes")
library(ggthemes)
# install.packages("jtools")
library(jtools)
ggplot(mpg, aes(displ, hwy)) +
geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') + theme_bw()
ggplot(mpg, aes(displ, hwy)) +
geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') + theme_void()
ggplot(mpg, aes(displ, hwy)) +
geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') + theme_minimal()
ggplot(mpg, aes(displ, hwy)) +
geom_point(size = 4, shape = 21, color = 'red', fill = 'orange') + ggthemes::theme_solarized()
Combining Geoms and labelling
Combining Geoms
Earlier I displayed individual geoms, but the most unique thing about ggplot is that you can just as easily combine them to produce novel graphics or make the point you are trying to get across.
- What type of geoms can we combine?
Combining similar geoms can help illustrate the point further and make the graph more aesthetically pleasing. The most common example is combining a jitter plot with a box plot.
ggplot(mpg, aes(class, cty)) + geom_boxplot(fill = 'red',col = 'black', notch = T) + geom_jitter(width = 0.2)
This is just 1 example, you can combine a number of geoms. More examples are shown below.
Smoother with points
ggplot(mpg, aes(displ, hwy)) +
geom_point(col = 'red', shape = 21, fill ='orange', size = 3) +
geom_smooth(span = 0.8, method = 'lm', col = 'grey20')
Histogram with Frequency polynomial
ggplot(mpg, aes(cty)) +
geom_histogram(col = 'black', binwidth = 4, fill = 'khaki') +
geom_freqpoly(binwidth = 4, col = 'black')
You can combine an infinte amount of geoms to produce the plots you want to. Other support geoms include
Labelling the Plots
Alternative Labelling
ggplot(mpg, aes(displ, hwy)) +
geom_point(col = 'red', shape = 21, fill ='orange', size = 3) +
geom_smooth(span = 0.8, method = 'lm', col = 'grey20')+
labs(x = "Engine Displacement",
y = "Highway miles per gallon",
title = "Does Engine Displacement reduce highway efficiency?",
caption = "Created by - Arjun",
subtitle = "Dataset = mpg")
Font Family
Before we explore labelling plots further, it’s important to know the type of fonts available in R. The R system provides you with
df <- data.frame(x = c(rep(1,6), rep(2,6)), y = c(6:1, 6:1), family = c("sans", "serif", "mono", "Times New Roman", "Georgia", "Arial Rounded MT Bold", "Verdana", "Luminari", "Arial", "Andale Mono", "Brush Script MT", "Tahoma"))
ggplot(df, aes(x, y)) +
geom_text(aes(label = family, family = family)) +
xlim(c(0,4))
Sans, Serif and Mono are available in base R but for the others you need to install the extrafont package.
Font Faces
Adding text to the plots
The simplest way to add text to the plot is to use the geometric representations of text or in short
However, let’s install the
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_text(aes(label = model), hjust = "inner") +
xlim(1, 8)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
ggrepel::geom_text_repel(aes(label = model)) +
xlim(1, 8)
## Warning: ggrepel: 197 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Adding the legend to the data
Adding text to the data
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point() +
annotate(geom = 'text', x = 5, y = 40, label = "We can clearly see a \n negative trend between engine\n displacement and highway fuel efficency", hjust = 0.3)
It’s often better to use
Grouping Variables and Data Uncertainty
Grouping variables in R
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
The dataset
Coloring the lines could be quite informative in such a situation to distinguish the Individuals.
ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line(aes(group = Subject, color = Subject))+
scale_color_hue()
There are still quite a few issues with the graph. However, those will be discussed later. Now let’s add a general trend line to this graph and remove the points.
ggplot(Oxboys, aes(age, height)) +
geom_line(aes(group = Subject), alpha = 0.4) +
geom_smooth(method = "lm", size = 2, se = FALSE, col = 'red')
## `geom_smooth()` using formula 'y ~ x'
Data Uncertainty using error bars
Importance of Factoring the Data
Factoring continous or discrete variables
Sometimes in order to get the desired results we have to change the class of the variable.
## [1] "integer"
For Numeric variables GGplot tries to fill the colors based on a continuous scale instead of treating each data point as an individual entity. To correct this, we can simply factor the variable in question.
GGplot does not allow you to directly manipulate the legend. The key to producing a good legend is to make sure the data is in the correct format.
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point() +
labs(color = 'Number of \nCylinders')
Factors additionally allow you to make sure the data is in the correct order or it allows you to cut up continous variables into segments and treat them as categorical variables.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
# Factor displ to put it in the correct order
ggplot(mpg, aes(fct_infreq(trans), displ)) +
geom_histogram(stat = 'identity', fill = 'violet')
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
# Cutting up a numeric variable to use it as a categorical variable
ggplot(mpg, aes(cut_width(displ, 1), hwy)) +
geom_boxplot(col = 'red', fill = 'violet')
For more information on factoring in detail, I’d refer you to the Forcats Package documentation.
Statistical Transformations and Positional adjustments
Stats
The name of the statistical transformation to use. A statistical transformation performs some useful statistical summary, and is key to histograms and smoothers. To keep the data as is, use the “identity” stat.
You’ll rarely call these functions directly, but they are useful to know about because their documentation often provides more detail about the corresponding statistical transformation.
ggplot(mpg, aes(trans, cty)) +
geom_point() +
geom_point(stat = "summary", fun.y = "mean", colour = "red", size = 4)
## No summary function supplied, defaulting to `mean_se()`
When dealing with distributions stats can be quite useful.
ggplot(mpg, aes(hwy)) +
geom_histogram(binwidth = 5, col = 'red', fill = 'violet')
ggplot(mpg, aes(hwy)) +
geom_histogram(stat = 'density',binwidth = 5, col = 'red', fill = 'violet')
## Warning: Ignoring unknown parameters: binwidth, bins, pad
# If you want to preserve the histogram with density on the y axis
ggplot(mpg, aes(hwy)) +
geom_histogram(aes(y = ..density..),binwidth = 5, col = 'red', fill = 'violet')
# If we have 2 variables and we want to plot the height instead of count.
ggplot(mpg, aes(manufacturer, hwy)) +
geom_histogram(stat = 'identity', col = 'violet', fill = 'violet')
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Positions
Position adjustments apply minor tweaks to the position of elements within a layer. Three adjustments apply primarily to bars:
position_stack() : stack overlapping bars (or areas) on top of each other.position_fill(): stack overlapping bars, scaling so the top is always at 1.position_dodge(): place overlapping bars (or boxplots) side-by-side.
dplot <- ggplot(mpg, aes(class, fill = factor(cyl))) +
xlab(NULL) + ylab(NULL)
dplot + geom_bar(alpha = 2/3)
There are three position adjustments that are primarily useful for points: *
Scales
Introduction to Scales
There are 5 main arguments for any scale. 1.
# This is the same as the code below
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
The two plots are exactly the same. ggplot scales the values by default but you can make changes to them if you wish. It would be tedious to manually add a scale every time you used a new aesthetic, so ggplot2 does it for you. But if you want to override the defaults, you’ll need to add the scale yourself.
You can label the plots using the defined scales. To label the legend you have to assign a label to the aesthetic mapping you are scaling by.
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous(name = "A really awesome x axis label") +
scale_y_continuous(name = "An amazingly great y axis label")+
scale_colour_discrete(name = "Legend")
# Breaks
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_y_continuous(breaks = seq(10,50, by = 5))
# Labeling
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_y_continuous(breaks = seq(10,50, by = 5), labels = paste0(seq(10,50, by = 5), "m/g")) +
scale_color_discrete(labels = paste0(unique(mpg$class), "Mine"))
# Limits
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_y_continuous(breaks = seq(10,50, by = 5), labels = paste0(seq(10,50, by = 5), "m/g"), limits = c(10,50))
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous("Label 1") +
scale_x_continuous("Label 2")
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
This will replace the first scale one with the 2nd one.
The use of + to “add” scales to a plot is a little misleading. When you + a scale, you’re not actually adding it to the plot, but overriding the existing scale.
Scaling Families
- Continuous position scales used to map integer, numeric, and date/time data to x and y position.
Every plot has two position scales, x and y. The most common continuous position scales are scale_x_continuous() and scale_y_continuous(), which linearly map data to the x and y axis.
Every continuous scale takes a trans argument, allowing the use of a variety of transformations:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(trans = "reciprocal")
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(trans = "reverse")
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_reverse()
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_log10()
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_sqrt()
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(trans = 'log2')
The transformation is carried out by a “transformer”, which describes the transformation, its inverse, and how to draw the labels.
There are shortcuts for the most common: scale_x_log10(), scale_x_sqrt() and scale_x_reverse().
In either case, the transformation occurs before any statistical summaries. To transform, after statistical computation, use coord_trans().
Scale labeling tricks
ggplot(diamonds, aes(carat, price)) +
geom_hex() +
scale_x_log10() +
scale_fill_distiller(palette = 3) +
scale_y_continuous(labels = scales::dollar_format())
ggplot(diamonds, aes(carat, price)) +
geom_hex() +
scale_x_log10() +
scale_fill_distiller(palette = 3) +
scale_y_continuous(labels = scales::comma_format())
ggplot(diamonds, aes(carat, price)) +
geom_hex() +
scale_x_log10() +
scale_fill_distiller(palette = 3) +
scale_y_continuous(labels = scales::unit_format(suffix = "K", scale = 1/1000))
- Colour scales, used to map continuous and discrete data to colours.
Continous colors
erupt <- ggplot(faithfuld, aes(waiting, eruptions, fill = density)) +
geom_raster() +
scale_x_continuous(NULL, expand = c(0, 0)) +
scale_y_continuous(NULL, expand = c(0, 0))
mpgplot <- ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(col = cyl))
If you wish to choose prebuilt palettes, you can use scale_color_distiller() or scale_fill_distiller()
erupt + scale_fill_distiller()
erupt + scale_fill_distiller(palette = "RdPu")
erupt + scale_fill_distiller(palette = "YlOrBr")
mpgplot + scale_color_distiller()
mpgplot + scale_color_distiller(palette = 2)
mpgplot + scale_color_distiller(palette = 3)
scale_colour_gradient() and scale_fill_gradient() are perhaps the most effective for scaling continuous colors. However, they are limited by 2 colors.
scale_colour_gradient2() and scale_fill_gradient2() a three-colour gradient,low-med-high (red-white-blue).
scale_colour_gradientn() and scale_fill_gradientn(): a custom n-colour gradient.
erupt + scale_fill_gradient(low = "white", high = "black")
erupt + scale_fill_gradient2(midpoint = 0.02)
erupt + scale_fill_gradient2(low = 'green', mid = 'yellow', high = 'orange', midpoint = 0.02)
erupt + scale_fill_gradientn(colours = terrain.colors(7))
erupt + scale_fill_gradientn(colours = colorspace::heat_hcl(7))
erupt + scale_fill_gradientn(colours = colorspace::diverge_hcl(7))
mpgplot + scale_color_gradient(low = 'red', high = 'blue')
mpgplot + scale_color_gradient2(low = 'red', mid = 'green' ,high = 'blue', midpoint = 6)
Discrete Colors
There are two colors scales I’ll recommend for discrete values and they are the scale_color_brewer() scale and scale_color_hue().
Note: Color or fill depend on how you scaled the values
ggplot(mpg, aes(class, fill = class))+
geom_bar()
# Default color scale
ggplot(mpg, aes(class, fill = class))+
geom_bar() +
scale_fill_hue()
# C stands for chroma, h is hue, and l is luminance
ggplot(mpg, aes(class, fill = class))+
geom_bar() +
scale_fill_hue(c = 160, l = 50, h = c(80, 360))
ggplot(mpg, aes(class, fill = class))+
geom_bar() +
scale_fill_brewer()
ggplot(mpg, aes(class, fill = class))+
geom_bar() +
scale_fill_brewer(palette = 'Set1')
ggplot(mpg, aes(class, fill = class))+
geom_bar() +
scale_fill_brewer(palette = 'Accent')
Manual Colors
First step is to choose a set of colors. I like using the color HEX wheel on google for this. For Points, choose bright colors, for bars choose uniform colors that are more transparent.
colors <- c('#05fc47','#fc8d05','#ff0d00',"#2e02f2",'#f711ba', "#970af5", "#12b6fc")
ggplot(mpg, aes(class, fill = class))+
geom_bar(alpha = 0.6) +
scale_fill_manual(values = colors)
ggplot(mpg, aes(displ, hwy, fill = class))+
geom_point(size = 4, shape = 21, color = 'black') +
scale_fill_manual(values = colors)
Customize Legend colours manually
colors <- c(compact = sample(colours(), 1),
midsize = sample(colours(), 1),
suv = sample(colours(), 1),
`2seater` = sample(colours(), 1),
minivan = sample(colours(), 1),
pickup = sample(colours(), 1),
subcompact = sample(colours(), 1))
ggplot(mpg, aes(displ, hwy, fill = class))+
geom_point(size = 4, shape = 21, color = 'black') +
scale_fill_manual(values = colors)
Guides: legends and axes
ggplot(mpg, aes(displ, hwy, color = class)) +
geom_point(alpha = 0.4, size = 3) +
geom_point(aes(displ, cty, fill = class), shape = 8, size = 3, show.legend = F)+
scale_color_brewer(palette = 'Set1') +
guides(color = guide_legend(title.position = 'top',keywidth = 0.5,direction = 'vertical',override.aes = list(alpha = 1),keyheight = 1.5))
ggplot(mpg, aes(manufacturer)) +
geom_bar(alpha = 0.7, fill = 'violet') +
guides(x = guide_axis(n.dodge = 2,angle = -15, position = 'top'))
ggplot(mpg, aes(displ, hwy, color = class)) +
geom_point(alpha = 0.4) +
scale_y_continuous(breaks = 0:50, trans = 'reverse') +
guides(y = guide_axis(check.overlap = T,angle = -15,n.dodge = 2),
color = guide_legend(title.position = 'top',keywidth = 0.5,direction = 'vertical',override.aes = list(alpha = 1)))
ggplot(mpg, aes(displ, hwy, fill = cyl)) +
geom_point(shape = 21, col = 'black', size = 6) +
scale_fill_continuous(breaks = seq(4,8,0.5), labels = scales::unit_format(prefix = 'X', suffix = 'k')) +
guides(fill = guide_colourbar(title.position = 'top', barwidth = 1.5, reverse = T, barheight = 14, frame.colour = 'black', ticks.linewidth = 2, ticks.colour = 'black'))
Guides allow you to manipulate the legend, color bar, axis as you see fit. They allow a great deal of flexibility and can be very useful in producing the graphic you want.
Facetting
base <- ggplot(mpg, aes(displ, hwy)) +
geom_point() +
xlab(NULL) +
ylab(NULL)
base + facet_wrap(~class, ncol = 3)
Controlling the scales for facet grids
p <- ggplot(mpg, aes(cty, hwy)) +
geom_smooth(method = 'lm') +
geom_jitter(width = 0.1, height = 0.1)
p + facet_wrap(~cyl)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Grouping Variables vs Facetting
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = drv)) +
geom_point(data = mpg %>% keep(is.numeric),size = 1, col = 'grey20', alpha = 0.1) +
facet_wrap(~drv)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = drv), alpha = 0.3) +
geom_point(data = mpg %>% group_by(drv) %>% summarise(displ = mean(displ),hwy = mean(hwy)) %>% rename(drv2 = drv), size = 4, aes(col = drv2)) +
facet_wrap(~drv)
#
There are two types of coordinate system.
- Linear coordinate systems preserve the shape of geoms:
coord_cartesian(): the default Cartesian coordinate system, where the 2d position of an element is given by the combination of the x and y positions.coord_flip(): Cartesian coordinate system with x and y axes flipped.coord_fixed(): Cartesian coordinate system with a fixed aspect ratio.
- Non-linear coordinate systems can change the shapes:a straight line may no longer be straight. The closest distance between two points may no longer be a straight line.
coord_map()/coord_quickmap(): Map projections.coord_polar(): Polar coordinates.coord_trans(): Apply arbitrary transformations to x and y positions, after the data has been processed by the stat.
Caretesian Coordinate
Zooming In
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 196 rows containing non-finite values (stat_smooth).
## Warning: Removed 196 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 196 rows containing non-finite values (stat_smooth).
## Removed 196 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Flip the axes
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Ensring Fixed Scales
Polar Coordinates
base <- ggplot(mpg, aes(factor(1), fill = factor(cyl))) +
geom_bar(width = 1) +
scale_x_discrete(NULL, expand = c(0, 0)) +
scale_y_continuous(NULL, expand = c(0, 0))
# Stacked barchart
base
Using polar coordinates allows us to create pie charts and wind roses (from bar geoms), and radar charts (from line geoms). Polar coordinates should be used for circular data, particularly time or direction, but the perceptual properties are not that good.
I briefly introduced coord_trans() earlier and coord_map() is beyond the scope of this course.
Basic Data Manipulation
Selecting the Rows you want.
Here we are selecting the first 3 rows
Factoring
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : Factor w/ 4 levels "4","5","6","8": 1 1 1 1 3 3 3 1 1 1 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : Factor w/ 4 levels "4","5","6","8": 1 1 1 1 3 3 3 1 1 1 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
Grouping and Summarizing
## # A tibble: 2 × 2
## year m
## <int> <dbl>
## 1 1999 17.0
## 2 2008 16.7
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 7 × 3
## # Groups: year [2]
## year cyl m
## <int> <fct> <dbl>
## 1 1999 4 20.8
## 2 1999 6 16.1
## 3 1999 8 12.2
## 4 2008 4 21.2
## 5 2008 5 20.5
## 6 2008 6 16.4
## 7 2008 8 12.8
Theme
Introduction to Theme System
The theming system is composed of four main components:
Theme elements specify the non-data elements that you can control. For example, the
plot.title element controls the appearance of the plot title.axis.ticks.x , the ticks on the x axis;legend.key.height , the height of the keys in the legend.Each element is associated with an element function, which describes the visual properties of the element. For example,
element_text() sets the font size, colour and face of text elements likeplot.title .The
theme() function which allows you to override the default theme elements by calling element functions, liketheme(plot.title = element_ text(colour = “red”)) .Complete themes, like
theme_grey() set all of the theme elements to values designed to work together harmoniously.
base <- ggplot(mpg, aes(cty, hwy, color = factor(cyl))) +
geom_jitter() +
geom_smooth(method = 'lm',colour = "grey50", size = 2, se = F)
base
## `geom_smooth()` using formula 'y ~ x'
We created the base for the plot.
labelled <- base +
labs(
x = "City mileage/gallon",
y = "Highway mileage/gallon",
colour = "Cylinders",
title = "Highway and city mileage are highly correlated",
caption = 'Created by = Arjun'
) +
scale_colour_brewer(type = "seq", palette = "Spectral")
labelled
## `geom_smooth()` using formula 'y ~ x'
I added labels and colours to the base plot.
## Registering fonts with R
styled <- labelled +
theme_bw() +
theme(
plot.title = element_text(face = "bold", size = 12, family = "Times New Roman"),
legend.background = element_rect(fill = "white", size = 4, colour = "white"),
legend.justification = c(0, 1),
legend.position = c(0.01, 0.98),
axis.ticks = element_line(colour = "grey70", size = 0.2),
panel.grid.major = element_line(colour = "grey70", size = 0.2),
panel.grid.minor = element_blank()
) +
jtools::drop_gridlines()
styled
## `geom_smooth()` using formula 'y ~ x'
Modifying theme components
base_t <- base + labs(title = "This is a ggplot") + xlab(NULL) + ylab(NULL)
base_t + theme(plot.title = element_text(size = 16))
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
- Margins
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Let’s create a custom theme.
mytheme <- function() {
theme_minimal() %+replace%
theme(axis.title = element_text(family = "Georgia", size = 16,color = 'black'),
panel.background = element_rect(fill = "#5e5848"),
plot.background = element_rect(fill = "#f5e8c6"),
axis.title.x = element_text(margin = margin(t = 1, unit = 'lines')),
axis.title.y = element_text(margin = margin(r = 1, unit = 'lines'), angle = 90),
axis.text = element_text(family = "Luminari", size = 12,colour = 'gray20'),
legend.justification = c(0,0),
legend.position = c(0.86,0.90),
legend.title = element_blank(),
legend.text = element_text(family = 'Georgia', size = 22, color = 'white'),
plot.title = element_text(family = 'Georgia',face = 'bold', size = 26, color = 'black', margin = margin(b = 1, unit = 'lines')),
plot.caption = element_text(size = 12, hjust = 1),
panel.grid = element_blank())
}
base <- mpg %>% gather(cty:hwy, key = cityorhwy, value = efficency) %>%
group_by(manufacturer, cityorhwy) %>% summarise(efficency = mean(efficency)) %>%
mutate(manufacturer = str_to_title(manufacturer)) %>% ungroup() %>%
ggplot(aes(fct_reorder(manufacturer, efficency, .desc = T), efficency)) +
geom_bar(aes(fill = cityorhwy),position = 'dodge' ,stat = 'identity', alpha = 0.8) +
geom_text(aes(group = cityorhwy, label = round(efficency,0)), position = position_dodge(width = 0.8), vjust = -0.3, col = 'gray', size = 6) +
scale_y_continuous("Efficiency" ,breaks = seq(0, 35, 5), labels = scales::unit_format(suffix = " miles/gallon")) +
scale_x_discrete("Manufacturer",labels = str_to_title(unique(mpg$manufacturer))) +
scale_fill_manual("Legend", labels = c("City", "Highway"), values = c('#fc05fc', '#fc0505')) +
labs(title = "Most Efficienct Car Manufacturers",
caption = "Created by : Arjun")
## `summarise()` has grouped output by 'manufacturer'. You can override using the
## `.groups` argument.
Saving plots
This will save the last plot to your working directory. ggsave() can produce .eps, .pdf, .svg, .wmf, .png, .jpg, .bmp, and .tiff. dpi controls the resolution of the plot.
Glossary/Guide
Guide to using Geoms
Graphical Primitives
- geom_blank(): display nothing. Most useful for adjusting axes limits using data.
- geom_point(): points.
- geom_path(): paths.
- geom_ribbon(): ribbons, a path with vertical thickness.
- geom_segment(): a line segment, specified by start and end position.
- geom_rect(): rectangles.
- geom_polyon(): filled polygons.
- geom_text(): text.
One Variable
- Discrete:
- geom_bar(): display distribution of discrete variable.
- Continuous
- geom_histogram(): bin and count continuous variable, display with bars.
- geom_density(): smoothed density estimate.
- geom_dotplot(): stack individual points into a dot plot.
- geom_freqpoly(): bin and count continuous variable, display with lines.
Two variables:
- Both continuous:
- geom_point(): scatterplot.
- geom_quantile(): smoothed quantile regression.
- geom_rug(): marginal rug plots.
- geom_smooth(): smoothed line of best fit.
- geom_text(): text labels.
- Show distribution:
- geom_bin2d(): bin into rectangles and count.
- geom_density2d(): smoothed 2d density estimate.
- geom_hex(): bin into hexagons and count.
- At least one discrete:
- geom_count(): count number of point at distinct locations
- geom_jitter(): randomly jitter overlapping points.
- One continuous, one discrete:
- geom_bar(stat = “identity”): a bar chart of precomputed summaries.
- geom_boxplot(): boxplots.
- geom_violin(): show density of values in each group.
- One time, one continuous
- geom_area(): area plot.
- geom_line(): line plot.
- geom_step(): step plot.
- Display uncertainty:
- geom_crossbar(): vertical bar with center.
- geom_errorbar(): error bars.
- geom_linerange(): vertical line.
- geom_pointrange(): vertical line with center.
Three variables:
- geom_contour(): contours.
- geom_tile(): tile the plane with rectangles.
- geom_raster(): fast version of geom_tile() for equal sized tiles.
Guide to using Scale transformations
Scales
Name | Function |
---|---|
exp | \(e^{x}\) |
identity | x |
log | log(x) |
log10 | \(log_{10}\)(x) |
log2 | \(log_{2}\)(x) |
logit | log(\(\frac{x}{1-x}\)) |
pow10 | \(10^{x}\) |
reverse | -x |
sqrt | \(x^{1/2}\) |
reciprocal | \(x^{-1}\) |
Colour guide
- Continous colors or fills.
Scale_fill_distiller /Scale_color_distiller Scale_fill_gradient /Scale_color_gradient (2 colors)Scale_fill_gradient2 /Scale_color_gradient2 (3 colors)Scale_fill_gradientn /Scale_color_gradientn (n Colors, use colorspace package)
- Discrete Colors
Scale_fill_hue /Scale_color_hue (Default color Scale)Scale_fill_brewer /Scale_color_brewer (Allows you to select pre-built palettes)Scale_fill_manual /Scale_color_manual (Allows you to manually specify the colors)Viridis scales are available in the viridis package. Example = (p + viridis::
scale_color_viridis (discrete=TRUE, option=“plasma”)). The viridis scale has multiple predesigned palettes.ggthemes package also contains quite a few color scales. Example = (p + ggthemes::
scale_colour_solarized() ). There are countless more scales available in the ggthemes packages for your discretion.
Guide to Themes
Themes
Elements | Setter | Description |
---|---|---|
plot.background | element_rect() | plot background |
plot.title | element_text() | plot title |
plot.margin | margin() | margins around plot |
Axis Elements
Element | Setter | Description |
---|---|---|
axis.line | element_line() | line parallel to axis |
axis.text | element_text() | tick labels |
axis.text.x | element_text() | x-axis tick labels |
axis.text.y | element_text() | y-axis tick labels |
axis.title | element_text() | axis titles |
axis.title.x | element_text() | x-axis title |
axis.title.y | element_text() | y-axis title |
axis.ticks | element_line() | axis tick marks |
axis.ticks.length | unit() | length of tick marks |
Legend Elements
Element | Setter | Description |
---|---|---|
legend.background | element_rect() | legend background |
legend.key | element_rect() | background of legend keys |
legend.key.size | unit() | legend key size |
legend.key.height | unit() | legend key height |
legend.key.width | unit() | legend key width |
legend.margin | unit() | legend margin |
legend.text | element_text() | legend labels |
legend.text.align | legend label | alignment (0 = right, 1 = left) |
legend.title | element_text() | legend name |
legend.title.align | legend name | alignment (0 = right, 1 = left) |
Panel Elements
Element | Setter | Description |
---|---|---|
panel.background | element_rect() | panel background (under data) |
panel.border | element_rect() | panel border (over data) |
panel.grid.major | element_line() | major grid lines |
panel.grid.major.x | element_line() | vertical major grid lines |
panel.grid.major.y | element_line() | horizontal major grid lines |
panel.grid.minor | element_line() | minor grid lines |
panel.grid.minor.x | element_line() | vertical minor grid lines |
panel.grid.minor.y | element_line() | horizontal minor grid lines |
aspect.ratio | numeric | plot aspect ratio |
Facetting elements
Element | Setter | Description |
---|---|---|
strip.background | element_rect() | background of panel strips |
strip.text | element_text() | strip text |
strip.text.x | element_text() | horizontal strip text |
strip.text.y | element_text() | vertical strip text |
panel.margin | unit() | margin between facets |
panel.margin.x | unit() | margin between facets (vertical) |
panel.margin.y | unit() | margin between facets (horizontal) |