0

I have a time series data on which I would like to build a overlayed scatterplot and boxplot. The data is as so:

    TokenUsed   date
0   8   2020-01-05
1   8   2020-01-05
2   8   2020-01-05
3   8   2020-01-05
4   8   2020-01-05
... ... ...
51040   7   2020-02-23
51041   7   2020-02-23
51042   7   2020-02-23
51043   7   2020-02-23
51044   7   2020-02-23

This time series can be neatly shown as a boxplot (I've had trouble with the x-axis being a date, but solved it converting it to string). Now I would like to show only the data on which sum is superior to a threshold (>81) in my case. The code and the resulting image are below:

fig, ax = plt.subplots(figsize = (12,6))  



ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax= ax, whis=[0,100])


ax.axhline(81)

plt.locator_params(axis='x', nbins=10)
plt.show()

Sample box plot (1)

When I add a scatter plot, I get image (2) and by filtering only those >81 I get image(3). What I don't understand is why it can't seem to match the x-axis between the two graphs!

Sample box plot with scatter without filtering (2)

Sample box plot with scatter with filtering (3)

Code:

fig, ax = plt.subplots(figsize = (12,6))  



ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax= ax, whis=[0,100])
# Without filter
ax = sns.scatterplot(x="date", y="TokenUsed", data=df, ax= ax,color=".25")
# Filter
ax = sns.scatterplot(x="date", y="TokenUsed", data=df[df["TokenUsed"]>81], ax= ax,color=".25")

ax.axhline(81)

plt.locator_params(axis='x', nbins=10)
plt.show()

1 Answer 1

1

Answer:

Try editing your filtering such that no rows of df are actually removed. That is, apply a mask specifically on the TokenUsed column, such that values are replaced with NaN (rather than the whole row being removed). Here's how I would implement this:

#make a new copy df, use that to plot
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)
ax = sns.scatterplot(x="date", y="TokenUsed", data=df2, ax= ax,color=".25")

Explanation

Caveat: this is really my understanding of what is going on from my own observations; I am not actually aware of the implementation behind the scenes

seaborn is less aware of the dates then you are expecting. When creating the boxplot and using the date column for the x-axis, seaborn groups the data by each unique value in the date column. It orders these strings and then creates an integer position for each of them (starting from 0). The y-data are then plotted against these integer values, and the x-tick-labels are replaced with the corresponding string value. So in your case, there are 8 unique date strings, and they are plotted at x-positions from 0 to 7. Also, it doesn't actually matter that they look like dates. You could add more string values to the date column; their position relative to prior data would depend on their alphabetical order (e.g. I would guess the string '00-00-0000' would appear first and the string '999' would appear last).

The filter df[df["TokenUsed"]>81] removes any rows where the TokenUsed value is below 81. This means that the filtered DataFrame will not have as many string date values as the original data. This then creates the unexpected result when plotting. In your filtered data, the first date with values above 81 is 2020-02-09. So in the scatterplot call, those values get plotted at x=0, which is confusing because the values from 2020-01-05 were plotted at x=0 in the call to boxplot.

The fix is to make sure all the original dates are still present in the filtered data, but to replace the filtered out values with NaN so nothing gets plotted.

Here is the example I used to test this:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# fake data, only one date has values over 80
dr = ['01-05-2020'] * 100 + ['01-12-2020'] * 100 + ['01-19-2020'] * 100
data = list(np.random.randint(0,80,200)) + list(np.random.randint(50,150,100))
df = pd.DataFrame({'date':dr, 'TokenUsed':data})

fig, ax = plt.subplots(figsize = (12,6))
ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax=ax, whis=[0,100])

df2 = df.copy()
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)

# the fix
df2 = df.copy()
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)
ax = sns.scatterplot(x="date", y="TokenUsed", data=df2, ax= ax,color=".25")

ax.axhline(81)
plt.locator_params(axis='x', nbins=10)
plt.show()

enter image description here

If I use the same filter that you applied, I get the same issue.

Sign up to request clarification or add additional context in comments.

1 Comment

Great! This is exactly what the problem was. I forgot that the mask function preserves the index!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.