0

I need to create a somewhat unusual bar plot in matplotlib and the standard functionality does not seem to offer what I need.

I have clustered some documents and want to show the 5 most important keywords per cluster. The first problem is that I have one group per cluster which consists of 5 individual bars. The second problem is that the labels of these individual bars are important, not the same across groups and not unique either.

I have a makeshift prototype that looks like this:

Link

I just plotted all the individual bars in the right order and separated them by empty entries. The biggest problem (aside from being ugly) is that the only way to identify the cluster is by counting the groups. It would help a lot if the clusters could be identified either by color or something else, but I cannot figure out how to do this.

Edit: Here is some requested toy data as well as the code used to produce the plot I already have.

Toy data:

The following two pandas dataframes are included in an array. The two code blocks include the results from df_list[i].to_csv(). I hope this helps, but for the context of this problem the actual data does not really matter, so you can also just create your own dataframes.

,features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127

and

,features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198

Code:

The approach for the current solution is to combine all the individual dataframes into one dataframe, add empty entries where necessary, and plot the result.

def plot_all_clusters_words(dfs):
    # target structure: word as non unique column, value as other non unique column
    df_dict_list = []
    for df in dfs:
        for index, row in df.iterrows():
            df_dict_list.append({"word": row.features, "value": row.score})
        df_dict_list.append({"word": "", "value": 0})
    df_dict_list = df_dict_list[:-1]
    new_df = pd.DataFrame(df_dict_list)
    new_df.plot.bar(x="word")   
    plt.show()
    return new_df

Note:

I just need a way to easily identify the groups, if you know a different approach than the ones I suggested above, feel free to do so.

1
  • @JohanC In case you aren't notified about the edit by mail, I use this comment. Commented Dec 21, 2020 at 12:32

1 Answer 1

1

Calling plt.bar for each of the dataframes, each with an own label and color, would create the following plot:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from io import StringIO

df1_str = '''features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127'''
df2_str = '''features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198'''
df1 = pd.read_csv(StringIO(df1_str))
df2 = pd.read_csv(StringIO(df2_str))

dfs = [df1, df2]
cluster_names = [f'cluster {i}' for i in range(1, len(dfs) + 1)]
colors = plt.cm.rainbow(np.linspace(0, 1, len(dfs)))
bar_width = 0.8 # width of individual bars
cluster_gap = 0.2 # extra distance between clusters
starts = np.append(0, np.array([len(df) + cluster_gap for df in dfs]).cumsum())
all_tickpos = [s + np.arange(len(df)) for df, s in zip(dfs, starts)]
for df, name, color, tickpos in zip(dfs, cluster_names, colors, all_tickpos):
    plt.bar(tickpos, df['score'], width=bar_width, color=color, label=name)
plt.xticks(np.concatenate(all_tickpos), [f for df in dfs for f in df['features']], rotation=90)
plt.legend()
plt.tight_layout()
plt.show()

resulting plot

Sign up to request clarification or add additional context in comments.

3 Comments

This looks good, but there is a problem: The labels are incorrect. For example, develop appears twice for cluster 1 even though it is not part of it at all. When I experiment a bit on my data, the pattern is the following: It uses only the labels for the last cluster and repeats them n times (where n is the number of clusters, in this case 2). The actual values seem to be correct, only the labels are wrong. Additionally, it only draws up to 8 clusters. From the 9th cluster onwards, it does not draw bars at all, there are only empty (and incorrect) labels.
Thank you for checking out the code. The number of clusters was limited by the number of colors (the 'accent' colormap has only 8 colors). I changed the code to use the exact number of equally spaced colors from a rainbow colormap. For the labels, I mistakenly used the wrong order for the double for-loop. This now is corrected.
Thank you, now it is working correctly even for higher numbers of clusters. I could not have solved this problem on my own, but I should be able to deal with the remaining ones.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.