The following example illustrates a strange difference between scatter- and density plots from pandas DataFrame .. or possibly my lack of understanding:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n = 25
df = pd.DataFrame({'x': np.random.randn(n), 'y': np.random.randn(n), 'season': np.random.choice(['winter', 'summer'], n)})
plot = df.plot.scatter(x='x', y='y')
plot.get_figure().savefig("test_scatter_all.png")
for s in ['winter', 'summer']:
sdf = df[df['season'] == s]
plot = sdf.plot.scatter(x='x', y='y')
plot.get_figure().savefig("test_scatter_" + s + ".png")
plt.clf()
plot = df['y'].plot.density()
plot.get_figure().savefig("test_density_all.png")
for s in ['winter', 'summer']:
sdf = df[df['season'] == s]
plot = sdf['y'].plot.density()
plot.get_figure().savefig("test_density_" + s + ".png")
What surprised me is that the density plots are additive in the sense that the winter-chart includes two densities ('all' and winter) and the summer-chart includes all three densities.
On the other hand, the scatter plots includes only their own points, i.e., winter-values in winter-plots etc.
Also, without the plt.clf() command, the density plots would also include points from the last scatter plot (summer).
Why the difference between the two plot types?
And does it mean that I should always use plt.clf() before starting a new plot?
And, as a side note, does it actually make any sense to use the plot object the way I do? I see that I can generate the first plot with
df.plot.scatter(x='x', y='y')
plt.savefig("test_scatter_all.png")
as well, so is there any point in capturing the output of the plot() methods? And does it mean that there is always only one active figure object that the plot() methods write to?