2

I'm looking to take a data set that for example where, company1 has "lenovo" products and "dell" products. I would like to break out each part of this data set to show "these are the lenovo customers (were lenovo == 1), which of these also have dell products, or also have samsung products, or maybe only have lenovo products. I'd like this to be shown in a stacked bar chart if possible, or facet grid. Ideal chart would show each product as a singular bar, where the bar is colored by the count of the other products that those customer contain.

For example if there are 100 lenovo customers, and 20 of them also have dell, and 25 also have apple, and 10 of them have samsung, (noting that some of these could overlap, there could be a customer with samsung and dell and that would be counted in the 20 for dell and the 10 for samsung), that the bar would show 20 colored for dell, 25 colored for apple, 10 colored for samsung, and the remainder colored for just lenovo with no additional products. -- then that would be replicated for dell, and which of the dell customers had products with the other groups.. etc.

reproducible data:

a <- paste0("abcd", c(1:185))
dell <- sample(c(1, 0, 0), size = 185, replace = TRUE)
apple <- sample(c(1, 0, 0, 0, 0), size =185, replace = TRUE)
lenovo <- sample(c(1, 0), size = 185, replace = TRUE)
samsung <- sample(c(1, 0), size = 185, replace = TRUE)
df <- data.frame(a, dell, apple, lenovo, samsung)

I've dug into trying to do something like:

ggplot(df, aes(x = Dell)) +
  geom_histogram(stat = 'count', position = "dodge") + 
  geom_text(stat = 'count', aes(label = ..count..), position = position_dodge(width = .9), vjust = -1) +
  scale_y_continuous(labels = comma)

The other way I was trying to do this was where the data was line by line and if there was a customer with dell and apple and samsung, they would be represented by 3 lines. That way I could facet_grid by the product. The problem is that I have a hard time showing that this one line of customer abcd1 that has apple, also has samsung and dell, and visualizing that.

Any help is appreciated!

2 Answers 2

2

The widyr package is designed for taking a tidy dataframe, performing wide matrix manipulation, and re-tidying it. Here I first tidy up your data, creating a column with each product as rows, and a column with the binary status of ownership for each customer. Then drop the "does not have" rows (hasproduct != 0), and do a pairwise count of all the products.

library(tidyr)
library(widyr)
library(ggplot2)
library(dplyr)

data_frame(a = paste0("abcd", c(1:185)),
           dell = sample(c(1, 0, 0), size = 185, replace = TRUE),
           apple = sample(c(1, 0, 0, 0, 0), size = 185, replace = TRUE),
           lenovo = sample(c(1, 0), size = 185, replace = TRUE),
           samsung = sample(c(1, 0), size = 185, replace = TRUE)) %>% 
  gather(key = product, value = hasproduct, -a) %>% 
  filter(hasproduct != 0) %>% 
  widyr::pairwise_count(product, a, diag = T) %>% 
  ggplot(aes(item1, n, fill = item2)) + 
  geom_col(position = "stack")

enter image description here

If you don't want the size of each group computed (Apple owners who own Apple products) then change diag = T to diag = F.

Sign up to request clarification or add additional context in comments.

3 Comments

This is excellent Brian thank you! The only hitch I run into with this, which is more a problem of how I'm representing the data, rather than the visualization is that in cases where someone owns 3 products, it will duplicate it 6 times because there will be 6 unique pairwise versions. Meaning that apple will have the same account listed twice if that account shares with dell and lenovo. Any idea around this?
That's the whole idea of your comparison, isn't it? Which of those 6 pairs would you not want shown? The only alternative that still captures all of the information would be to do something like calculating the probability that an Apple owner is also an owner of each other category, which could be a heatmap. Then you ignore overall group size, which is shown here.
It is the idea of the comparison, kind of..., it overstates the actual accounts that exist. The problem I'm realizing with it is that you would need a permutation of every combination as a color which would be overkill. If there are 8,000 accounts and 7000 have 1 other product and 1000 have 2 other products, the result will show 9000 accounts, 7000 colored as one thing and 1000 duplicated to show the other two products. That duplication (I'm realizing now) is an issue because it looks like there are 9000 accounts instead of 8000. You're answer was perfect btw, it's my use of it that's hard.
0

Is this was you were referring to first? If so, let me know, and I'll edit and explain.

library(tidyverse)
library(rlang)

computers <- c("dell", "apple", "lenovo", "samsung")


all_dfs <-
  map(computers, ~ {
    df %>%
      gather(computer, count, !!.x) %>%
      gather(cat, value, setdiff(computers, !!.x)) %>%
      mutate(cat = ifelse(count == 0, "they use same company", cat))
  }) %>%
  reduce(bind_rows)

all_dfs %>%
  ggplot(aes(computer, fill = cat)) +
  geom_bar()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.