Plotting distributions (ggplot2)

Problem

You want to plot a distribution of data.

Solution

This sample data will be used for the examples below:

  1. set.seed(1234)
  2. dat <- data.frame(cond = factor(rep(c("A","B"), each=200)),
  3. rating = c(rnorm(200),rnorm(200, mean=.8)))
  4. # View first few rows
  5. head(dat)
  6. #> cond rating
  7. #> 1 A -1.2070657
  8. #> 2 A 0.2774292
  9. #> 3 A 1.0844412
  10. #> 4 A -2.3456977
  11. #> 5 A 0.4291247
  12. #> 6 A 0.5060559
  13. library(ggplot2)

Histogram and density plots

The qplot function is supposed make the same graphs as ggplot, but with a simpler syntax. However, in practice, it’s often easier to just use ggplot because the options for qplot can be more confusing to use.

  1. ## Basic histogram from the vector "rating". Each bin is .5 wide.
  2. ## These both result in the same output:
  3. ggplot(dat, aes(x=rating)) + geom_histogram(binwidth=.5)
  4. # qplot(dat$rating, binwidth=.5)
  5. # Draw with black outline, white fill
  6. ggplot(dat, aes(x=rating)) +
  7. geom_histogram(binwidth=.5, colour="black", fill="white")
  8. # Density curve
  9. ggplot(dat, aes(x=rating)) + geom_density()
  10. # Histogram overlaid with kernel density curve
  11. ggplot(dat, aes(x=rating)) +
  12. geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
  13. binwidth=.5,
  14. colour="black", fill="white") +
  15. geom_density(alpha=.2, fill="#FF6666") # Overlay with transparent density plot

plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3

Add a line for the mean:

  1. ggplot(dat, aes(x=rating)) +
  2. geom_histogram(binwidth=.5, colour="black", fill="white") +
  3. geom_vline(aes(xintercept=mean(rating, na.rm=T)), # Ignore NA values for mean
  4. color="red", linetype="dashed", size=1)

plot of chunk unnamed-chunk-4

Histogram and density plots with multiple groups

  1. # Overlaid histograms
  2. ggplot(dat, aes(x=rating, fill=cond)) +
  3. geom_histogram(binwidth=.5, alpha=.5, position="identity")
  4. # Interleaved histograms
  5. ggplot(dat, aes(x=rating, fill=cond)) +
  6. geom_histogram(binwidth=.5, position="dodge")
  7. # Density plots
  8. ggplot(dat, aes(x=rating, colour=cond)) + geom_density()
  9. # Density plots with semi-transparent fill
  10. ggplot(dat, aes(x=rating, fill=cond)) + geom_density(alpha=.3)

plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5

Add lines for each mean requires first creating a separate data frame with the means:

  1. # Find the mean of each group
  2. library(plyr)
  3. cdat <- ddply(dat, "cond", summarise, rating.mean=mean(rating))
  4. cdat
  5. #> cond rating.mean
  6. #> 1 A -0.05775928
  7. #> 2 B 0.87324927
  8. # Overlaid histograms with means
  9. ggplot(dat, aes(x=rating, fill=cond)) +
  10. geom_histogram(binwidth=.5, alpha=.5, position="identity") +
  11. geom_vline(data=cdat, aes(xintercept=rating.mean, colour=cond),
  12. linetype="dashed", size=1)
  13. # Density plots with means
  14. ggplot(dat, aes(x=rating, colour=cond)) +
  15. geom_density() +
  16. geom_vline(data=cdat, aes(xintercept=rating.mean, colour=cond),
  17. linetype="dashed", size=1)

plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6

Using facets:

  1. ggplot(dat, aes(x=rating)) + geom_histogram(binwidth=.5, colour="black", fill="white") +
  2. facet_grid(cond ~ .)
  3. # With mean lines, using cdat from above
  4. ggplot(dat, aes(x=rating)) + geom_histogram(binwidth=.5, colour="black", fill="white") +
  5. facet_grid(cond ~ .) +
  6. geom_vline(data=cdat, aes(xintercept=rating.mean),
  7. linetype="dashed", size=1, colour="red")

plot of chunk unnamed-chunk-7plot of chunk unnamed-chunk-7

See Facets (ggplot2)) for more details.

Box plots

  1. # A basic box plot
  2. ggplot(dat, aes(x=cond, y=rating)) + geom_boxplot()
  3. # A basic box with the conditions colored
  4. ggplot(dat, aes(x=cond, y=rating, fill=cond)) + geom_boxplot()
  5. # The above adds a redundant legend. With the legend removed:
  6. ggplot(dat, aes(x=cond, y=rating, fill=cond)) + geom_boxplot() +
  7. guides(fill=FALSE)
  8. # With flipped axes
  9. ggplot(dat, aes(x=cond, y=rating, fill=cond)) + geom_boxplot() +
  10. guides(fill=FALSE) + coord_flip()

plot of chunk unnamed-chunk-8plot of chunk unnamed-chunk-8plot of chunk unnamed-chunk-8plot of chunk unnamed-chunk-8

It’s also possible to add the mean by using stat_summary.

  1. # Add a diamond at the mean, and make it larger
  2. ggplot(dat, aes(x=cond, y=rating)) + geom_boxplot() +
  3. stat_summary(fun.y=mean, geom="point", shape=5, size=4)

plot of chunk unnamed-chunk-9