inconsistency between DataFrame.plot.scatter and DataFrame.plot.density()?

Question

The following example illustrates a strange difference between scatter- and density plots from pandas DataFrame .. or possibly my lack of understanding:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

n = 25
df = pd.DataFrame({'x': np.random.randn(n), 'y': np.random.randn(n), 'season': np.random.choice(['winter', 'summer'], n)})

plot = df.plot.scatter(x='x', y='y')
plot.get_figure().savefig("test_scatter_all.png")
for s in ['winter', 'summer']:
    sdf = df[df['season'] == s]
    plot = sdf.plot.scatter(x='x', y='y')
    plot.get_figure().savefig("test_scatter_" + s + ".png")

plt.clf()

plot = df['y'].plot.density()
plot.get_figure().savefig("test_density_all.png")
for s in ['winter', 'summer']:
    sdf = df[df['season'] == s]
    plot = sdf['y'].plot.density()
    plot.get_figure().savefig("test_density_" + s + ".png")

What surprised me is that the density plots are additive in the sense that the winter-chart includes two densities ('all' and winter) and the summer-chart includes all three densities. On the other hand, the scatter plots includes only their own points, i.e., winter-values in winter-plots etc.
Also, without the plt.clf() command, the density plots would also include points from the last scatter plot (summer).

Why the difference between the two plot types? And does it mean that I should always use plt.clf() before starting a new plot?

And, as a side note, does it actually make any sense to use the plot object the way I do? I see that I can generate the first plot with

df.plot.scatter(x='x', y='y')
plt.savefig("test_scatter_all.png")

as well, so is there any point in capturing the output of the plot() methods? And does it mean that there is always only one active figure object that the plot() methods write to?

ImportanceOfBeingErnest · Accepted Answer · 2018-02-28 11:45:21Z

1

The inconsistency is not between density and scatter, but between the plotting method of a dataframe and the plotting method of a series:

A series, Series.plot, is plotted to the active axes, if there is one, else a new figure is created.
A dataframe, DataFrame.plot, is plotted to a new figure, independent on whether there already exists one.

Example:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'x': np.random.randn(25), 'y': np.random.randn(25), 
                   'season': np.random.choice(['red', 'gold'], 25)})

# This plots the dataframe, and creates two figures
for s in ['red', 'gold']:
    sdf = df[df['season'] == s]
    plot = sdf.plot(kind="line",color=s)
plt.show() 

# This plots a series, and creates a single figure  
for s in ['red', 'gold']:
    sdf = df[df['season'] == s]
    plot = sdf["y"].plot(kind="line",color=s)
plt.show()

Here, sdf.plot creates two figures, while sdf["y"].plot plots to the same axes.

If the problem is to keep a previously plotted density in the plot, you may plot this density, add another one, save the figure and finally remove the second plot, such that you end up with the first density plot, ready to plot something else to it.

import numpy as np
import pandas as pd

df = pd.DataFrame({'x': np.random.randn(25), 'y': np.random.randn(25), 
                   'season': np.random.choice(['red', 'gold'], 25)})

ax = df['y'].plot.density()
for s in ['red', 'gold']:
    sdf = df[df['season'] == s]
    sdf["y"].plot.density(color=s)
    ax.get_figure().savefig("test_density_" + s + ".png")
    ax.lines[-1].remove()

edited Feb 28, 2018 at 11:45

answered Feb 28, 2018 at 10:49

ImportanceOfBeingErnest

342k61 gold badges737 silver badges771 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

ImportanceOfBeingErnest Over a year ago

I'm not sure which of the cases would be the desireable one for you. But if you want to specify what the desired output should be, there is for sure a solution, which I could add to the answer.

Michal Kaut Over a year ago

Thanks, this makes sense. What if I wanted, for each season, the density plot to include two lines: the overall density + the season density. Is there a way to store and re-use the overall density plot, so it is computed only once?

ImportanceOfBeingErnest Over a year ago

For the density, this should be the default behaviour, since you plot a series and hence the plot is added to any previously present plot.

Michal Kaut Over a year ago

Yes, but if I plot the overall density first and then add the winter density, how do I then create the chart for spring, where I want only the overall and the spring density? And I do not want to do the overall density estimation again..

ImportanceOfBeingErnest Over a year ago

I don't see the problem of plotting the density twice. Is this step taking too much time or what would be the reason not wanting to do that?

|

Collectives™ on Stack Overflow

inconsistency between DataFrame.plot.scatter and DataFrame.plot.density()?

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related