2

I have data in this form in a text file:

strings  year  avg
--       --    --
abc      2012  1854
abc      2013  2037
abc      2014  1781
pqr      2011  1346
pqr      2012  1667
xyz      2015  1952

I want to make a scatter plot with (distinct) strings on the x-axis, (distinct) year on the y-axis and the size of marker (circle) should be equal to the avg. I am having trouble implementing it in matplotlib, because the scatter function expects a numerical value for x,y (data positions). Because of that, I am unable to assign strings as x and year as y. Do I need to pre-process this data further?

3
  • What version of matplotlib do you have? Commented Jan 22, 2018 at 0:11
  • @DavidG it's 2.0.2 Commented Jan 22, 2018 at 0:17
  • Matplotlib 2.1 does support categorical data plotting. Therefore, if upgrading your version is an option, then that should solve your problem Commented Jan 22, 2018 at 0:18

2 Answers 2

4

Plotting categorical variable scatter with matplotlib >=2.1

In matplotlib 2.1 you may just supply the strings to the scatter function.

strings = ["abc","abc","abc","pqr","pqr","xyz"]
year = list(range(2012,2018))
avg = [1854, 2037,1781,1346,1667,1952]

import matplotlib.pyplot as plt
import numpy as np

plt.scatter(strings, year, s=avg)

plt.show()

Plotting categorical variable scatter with matplotlib < 2.1

In matplotlib below 2.1 you need to plot the data against some index which corresponds to the categories. Then set the labels accordingly.

strings = ["abc","abc","abc","pqr","pqr","xyz"]
year = list(range(2012,2018))
avg = [1854, 2037,1781,1346,1667,1952]

import matplotlib.pyplot as plt
import numpy as np

u, ind = np.unique(strings, return_inverse=True)
plt.scatter(ind, year, s=avg)
plt.xticks(range(len(u)), u)

plt.show()

Output in both cases

enter image description here

Sign up to request clarification or add additional context in comments.

4 Comments

Thankyou for the detailed answer! :)
Your first code snipped isn't working on matplotlib 2.1. Can you figure out what's wrong there?
@SaadH The code is working fine on matplotlib 2.1, that is why I put it in the answer. If you have a problem running it, you may first check your matplotlib version import matplotlib; print(matplotlib.__version__) and then provide a clear problem description: What is the output, is there an error etc.?
Okay, the problem was not with the version but the 2nd line: year = range(2012,2018) changing it to: year = [2012,2013,2014,2011,2012,2015] resolves the issue
1

Even I wanted the same and found an easier way. You can use Seaborn, which is a library based on Matplotlib.

You can give the text on either axis and time/year on the other axis. To get maximum Visualization you can set the limit for both the axis. Lets give 'df' as the name to your dataframe

import seaborn as sns

minYear = df['year'].min()
maxYear = df['year'].max()
pl = sns.catplot(x = strings,y = year, data = df)
pl.set(ylim=(minYear,maxYear))

This will give you the best possible visualization.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.