0

I have a dataframe as given below: enter image description here

The table only has values from the upper triangle of a matrix. I want to plot a correlation plot (correlogram) where the colours show the correlation and size according to the similarity scores. I tried plotting using ggplot2:

ggplot(df, aes(x = var1, y = var2, color = cor, size = jaccard)) +
geom_point(aes(), alpha = 0.7) +
scale_size_continuous(name = "Correlation") +
scale_color_continuous(name = "Jaccard Index", low = "blue", high = "red")

When I use the above code, it plots the entire matrix and the plotting is scattered (sample below). enter image description here

I want to make a neat plot with values showing just the upper triangle. How to do this in R?

4
  • Share your dataframe fully dput(mydata), or show us the codes how you created that dataframe. Commented Apr 29, 2024 at 11:01
  • Hi, I have added a picture of my data frame in the post Commented Apr 29, 2024 at 11:23
  • We can't do anything with a picture of your data, post the output of dput(mydata), so we have the same reproducible data. Commented Apr 29, 2024 at 11:27
  • @KamalikaRay I have figured out the issue with your chart. To fix it, all you need to do is convert var1 and var2 into factors along with the desired order of the levels specified. Below, I have replicated the problem using the "mtcars" dataset. One thing which I couldn't figure out was how to calculate "jaccard index," and thus, I decided to use dummy values. I would appreciate it if you could share how to actually calculate "jaccard index." Thank you! Commented Apr 30, 2024 at 18:22

1 Answer 1

0

I have simulated your problem using the "mtcars" dataset. See code below.

install.packages(c("tidyverse", "foreach"))
data(mtcars)

data <- mtcars
colnames(data) <- paste0("var", 1:length(data)) # rename column names as var1, ..., var11 (A, ..., L in pictured data frame)

newdata <- data.frame(column_one = rep(colnames(data)[1:length(data)-1], times = seq(from = length(data)-1, to = 1, by = -1))) # create column 1 of the dataset (var1 in the pictured data frame)

library(foreach)
newdata$column_two <- foreach(i = 2:length(data), .combine="c") %do% {
rep(colnames(data)[i:length(data)], each=1)
} # create column 2 of the dataset (var2 in the pictured data frame)

newdata$column_three <- foreach(i = newdata$column_one, j = newdata$column_two, .combine = "c") %do% {
  cor(data[[i]], data[[j]])
} # calculate correlations and create column 3 of the dataset (correlation in the pictured data frame)

jaccard <- function(a, b) {
  intersection = length(intersect(a, b))
  union = length(a) + length(b) - intersection
  return (intersection/union)
} # https://www.r-bloggers.com/2021/11/how-to-calculate-jaccard-similarity-in-r-2/

newdata$column_four <- foreach(i = newdata$column_one, j = newdata$column_two, .combine = "c") %do% {
  jaccard(data[[i]], data[[j]])
} # calculate jaccard index (jaccard in the pictured data frame)

lapply(newdata, class) # column_one, column_two should be character vectors
# column_three, column_four should be numeric vectors

# if the outcome of lapply() is otherwise, run the below four lines
newdata$column_one <- as.character(newdata$column_one)
newdata$column_two <- as.character(newdata$column_two)
newdata$column_three <- as.numeric(newdata$column_three)
newdata$column_four <- as.numeric(newdata$column_four)

newdata$column_one <- factor(newdata$column_one, levels = c("var1", "var2", "var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11")) # convert column one into factor with the desired order of the levels specified
newdata$column_two <- factor(newdata$column_two, levels = c("var2", "var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11")) # convert column two into factor with the desired order of the levels specified

library(tidyverse)
ggplot(newdata, aes(x = column_one, y = column_two, color = column_three, size = column_four)) +
  geom_point(aes(), alpha = 0.7) +
  scale_size_continuous(name = "Jaccard Index") +
  scale_color_continuous(name = "Correlation", low = "blue", high = "red")

The final bubble chart will look like this: final bubble chart

Additional comment: I noticed a minor issue with your code. You specified plot colors based on correlations (color = cor) but your color scale is labelled as "jaccard index" (scale_color_continuous(name = "Jaccard Index", low = "blue", high = "red"))

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.