ggplot geom_bar where x = multiple columns

Question

How can I go about making a bar plot where the X comes from multiple values of a data frame?

Fake data:

data <- data.frame(col1 = rep(c("A", "B", "C", "B", "C", "A", "A", "B", "B", "A", "C")),
                   col2 = rep(c(2012, 2012, 2012, 2013, 2013, 2014, 2014, 2014, 2015, 2015, 2015)), 
                   col3 = rep(c("Up", "Down", "Up", "Up", "Down", "Left", "Right", "Up", "Right", "Down", "Up")),
                   col4 = rep(c("Y", "N", "N", "N", "Y", "N", "Y", "Y", "Y", "N", "Y")))

What I'm trying to do is plot the number (also, ideally, the percentage) of Y's and N's in col4 based on grouped by col1, col2, and col3.

Overall, if there are 50 rows and 25 of the rows have Y's, I should be able to make a graph that looks like this:

I know a basic barplot with ggplot is:

ggplot(data, aes(x = col1, fil = col4)) + geom_bar()

I'm not looking for how many of col4 is found per col3 by col2, though, so facet_wrap() isn't the trick, I think, but I don't know what to do instead.

It's not really clear to me what you're looking for here. Can you provide a pencil-sketch of intended output given your sample data? — r2evans
– r2evans, Commented Feb 23, 2018 at 2:45
SleepyMiles, if your question is answered, please "accept" your preferred answer below (checkmark to the left of each). You may only "accept" one, but you can (eventually) "upvote" as many as you find worthy. — r2evans
– r2evans, Commented Feb 23, 2018 at 18:00
@r2evans Hard to choose, but I will indeed pick one and upvote them all. On other stacks, generally there is a waiting period before accepting. Is that not how it works on StackOverflow? I know the volume of questions here is significantly larger. — Sleepy Miles
– Sleepy Miles, Commented Feb 23, 2018 at 19:40
@r2evans No apologies necessary! I definitely understand the sentiment. StackOverflow is a little unusual in that so many great posts don't get upvoted, and so many answers are left unaccepted compared to what I was used to in another StackExchange site a long time ago. (I no longer have that account), so I certainly see how a nudge for newer users is appropriate. Worry not, though, I plan on being an active member here. — Sleepy Miles
– Sleepy Miles, Commented Feb 23, 2018 at 22:01

Phil · Accepted Answer · 2018-02-23 03:39:26Z

9

You need to first convert your data frame into a long format, and then use the created variable to set the facet_wrap().

data_long <- tidyr::gather(data, key = type_col, value = categories, -col4)

ggplot(data_long, aes(x = categories, fill = col4)) +
  geom_bar() + 
  facet_wrap(~ type_col, scales = "free_x")

answered Feb 23, 2018 at 3:39

Phil

8,1973 gold badges42 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sleepy Miles Over a year ago

Very nice, thank you! I can't +1 yet (new account), but as soon as I can, I will.

r2evans · Accepted Answer · 2018-02-23 03:27:33Z

3

A very rough approximation, hoping it'll spark conversation and/or give enough to start.

Your data is too small to do much, so I'll extend it.

set.seed(2)
n <- 100
d <- data.frame(
  cat1 = sample(c('A','B','C'), size=n, replace=TRUE),
  cat2 = sample(c(2012L,2013L,2014L,2015L), size=n, replace=TRUE),
  cat3 = sample(c('^','v','<','>'), size=n, replace=TRUE),
  val = sample(c('X','Y'), size=n, replace=TRUE)
)

I'm using dplyr and tidyr here to reshape the data a little:

library(ggplot2)
library(dplyr)
library(tidyr)

d %>%
  tidyr::gather(cattype, cat, -val) %>%
  filter(val=="Y") %>%
  head
# Warning: attributes are not identical across measure variables; they will be dropped
#   val cattype cat
# 1   Y    cat1   A
# 2   Y    cat1   A
# 3   Y    cat1   C
# 4   Y    cat1   C
# 5   Y    cat1   B
# 6   Y    cat1   C

The next trick is to facet it correctly:

d %>%
  tidyr::gather(cattype, cat, -val) %>%
  filter(val=="Y") %>%
  ggplot(aes(val, fill=cattype)) +
  geom_bar() +
  facet_wrap(~cattype+cat, nrow=1)

answered Feb 23, 2018 at 3:27

r2evans

167k8 gold badges92 silver badges176 bronze badges

1 Comment

Sleepy Miles Over a year ago

Wow, thank you so much! It works, and although I don't quite understand cattype and cat, it works! I can't upvote you yet (new account), but whenever I can, I'll come back to do so. Now off to find out how to turn the raw numbers into percentages and format it, but still, wow, thanks!

LachlanO · Accepted Answer · 2018-02-23 05:08:16Z

2

Depending on what you want here, you can also achieve something like what you want using melt from the reshape package.

(NOTE: this solution is very similar to Phil's, and you could convert it to be just let his if you made col4 your fill instead, didn't filter by only "Y"s and included a facet wrap)

Following on from your data setup:

library(reshape)

#Reshape the data to sort it by all the other column's categories
data$col2 <- as.factor(as.character(data$col2))

breakdown <- melt(data, "col4")

#Our x values are the individual values, e.g. A, 2012, Down.
#Our fill is what we want it grouped by, in this case variable, which is our col1, col2, col3 (default column name from melt)
ggplot(subset(breakdown, col4 == "Y"), aes(x = value, fill = variable)) +
  geom_bar() +
  # scale_x_discrete(drop=FALSE) +
  scale_fill_discrete(labels = c("Letters", "Year", "Direction")) +
  ylab("Number of Yes's")

I'm not 100% sure what you want, but perhaps this is more like it?

EDIT To show percentages of Yes's instead we can use ddply from the plyr package to create a data frame which has each of the variables with their yes percentages, then make the barplot plot a value rather than a count.

#The ddply applies a function to a data frame grouped by columns.
#In this case we group by our col1, col2 and col3 as well as the value.
#The function I apply just calculated the percentage, i.e. number of yeses/number of responses
plot_breakdown <- ddply(breakdown, c("variable", "value"), function(x){sum(x$col4 == "Y")/nrow(x)})

#When we plot we not add y = V1 to plot the percentage response
#Also in geom_bar I've now added stat = 'identity' so it doesn't try and plot counts
ggplot(plot_breakdown, aes(x = value, y = V1, fill = variable)) +
  geom_bar(aes(group = factor(variable)), position = "dodge", stat = 'identity') +
  scale_x_discrete(drop=FALSE) +
  scale_fill_discrete(labels = c("Letters", "Year", "Direction")) +
  ylab("Percentage of Yes's") +
  scale_y_continuous(limits = c(0,1), breaks = seq(0,1,0.25), labels = c("0%", "25%", "50%", "75%", "100%"))

The last line I've added to the ggplot is to just make the y axis look a bit more percentage-y :)

In the comments you've mentioned you want to do this as the sample sizes are different and you want to give some kind of fair comparison between categories. My advice is to be careful here. Percentages look good, but can really misconstrue thing if sample sizes are small. To say 0% answered yes when you only got one response is heavily biased, for example. My advice here would be to either exclude columns with what you deem too small a sample size, or take advantage of the colour field.

#Adding an extra column using ddply again which generates a 1 if the sample size is less than 3, and a 0 otherwise
plot_breakdown <- cbind(plot_breakdown,
                        too_small = factor(ddply(breakdown, c("variable", "value"), function(x){ifelse(nrow(x)<3,1,0)})[,3]))

#Same ggplot as before, except with a colour variable now too (outside line of bar)
#Because of this I also added a way to customise the colours which display, and the names of the colour legend
    ggplot(plot_breakdown, aes(x = value, y = V1, fill = variable, colour = too_small)) +
  geom_bar(size = 2, position = "dodge", stat = 'identity') +
  scale_x_discrete(drop=FALSE) +
  labs(fill = "Variable", colour = "Too small?") +
  scale_fill_discrete(labels = c("Letters", "Year", "Direction")) +
  scale_colour_manual(values = c("black", "red"), labels = c("3+ response", "< 3 responses")) +
  ylab("Percentage of Yes's") +
  scale_y_continuous(limits = c(0,1), breaks = seq(0,1,0.25), labels = c("0%", "25%", "50%", "75%", "100%"))

edited Feb 23, 2018 at 5:08

answered Feb 23, 2018 at 4:03

LachlanO

1,1628 silver badges14 bronze badges

8 Comments

Sleepy Miles Over a year ago

Another successful implementation, thank you! For this, what would I need to change to show either percentage of Y's out of combined Y's and N's or some stacked bar?

LachlanO Over a year ago

To clarify, do you want percentages per bar? e.g. in your example A had two yesses and two nos so it would be 50%?

Sleepy Miles Over a year ago

In my actual data, the numbers are more lopsided: 91% "Y". I'm trying to think of the best way to represent that given that e.g. "C" has substantially fewer observations than "B." Raw numbers may make the discrepancy look larger, but percentages of "Y" show the real difference.

LachlanO Over a year ago

All right, I just did a bit update to my answer. See if you like the options I've given :)

LachlanO Over a year ago

Thank you very much, though the extra voting wasn't necessary! Phil's answer is great! I intentionally gave something different. That's what this site is all about. Best of luck with your project!

|

Spencer Castro · Accepted Answer · 2018-02-23 03:29:55Z

1

If you actually group your Y's and N's by the other three columns, there will be one observation in each group. However, if you had repeated Y's and N's you could recode them to 1's and 0's, and get the percentage. Here's an example:

 library(tidyverse)

 data <- data.frame(col1 = rep(c("A", "B", "C", "B", "C", "A", "A", "B", "B", "A", "C")), 
               col2 = rep(c(2012, 2012, 2012, 2013, 2013, 2014, 2014, 2014, 2015, 2015, 2015)), 
               col3 = rep(c("Up", "Down", "Up", "Up", "Down", "Left", "Right", "Up", "Right", "Down", "Up")), 
               col4 = rep(c("Y", "N", "N", "N", "Y", "N", "Y", "Y", "Y", "N", "Y")))


 data %>%
    dplyr::group_by(col1,col2,col3) %>%
    mutate(col4 = ifelse(col4 == "Y", 1,0)) %>%
    dplyr::summarise(percentage = mean(col4)) %>%
    ggplot(aes(x = col1, y = percentage, color = as.factor(col2), fill = col3)) +
    geom_col(position = position_dodge(width = .5))

Example

answered Feb 23, 2018 at 3:29

Spencer Castro

1,4551 gold badge10 silver badges22 bronze badges

1 Comment

Sleepy Miles Over a year ago

The "fake data" was just to show its structure, less than a real, meaningful example. The percentage is a neat trick, though, which works well in my real example, so thanks!

Collectives™ on Stack Overflow

ggplot geom_bar where x = multiple columns

4 Answers 4

1 Comment

1 Comment

8 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

8 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related