1

I'm have two very large dataframes that are identical in size df and df2. One is raw data with the other being filtered. I'm trying to produce 36 subplots with each cell containing the raw and filtered data, and have tried this;

plot_rows = 6
plot_cols = 6
fig = make_subplots(rows=plot_rows, cols=plot_cols)

x = 0
for i in range(1, plot_rows + 1):
    for j in range(1, plot_cols + 1):
        fig.add_trace(go.Scattergl(x=df.index, y=df[df.columns[x]].values,
                                 name = df.columns[x],
                                 mode = 'lines'),
                      row=i,
                      col=j)
        fig.add_trace(go.Scattergl(x=df2.index, y=df2[df2.columns[x]].values,
                                 name = df2.columns[x],
                                 mode = 'lines'),
                      row=i,
                      col=j)
        x = x+1


fig.show()

The process finishes without error and a window is opened, however it is blank with no charts at all. I've also tried to replace;

        fig.add_trace(go.Scattergl(x=df2.index, y=df2[df2.columns[x]].values,
                                 name = df2.columns[x],
                                 mode = 'lines'),
                      row=i,
                      col=j)

With;

        fig.append_trace(go.Scattergl(x=df2.index, y=df2[df2.columns[x]].values,
                                 name = df2.columns[x],
                                 mode = 'lines'),
                      row=i,
                      col=j)

Any help or guidance is really appreciated.

2
  • a few things, why fig.show() for each trace when there is only one figure? very large data frames, also iloc[] would be more efficient. very large 5M+ records? not surprised it's not working, it's not a suitable approach to very large data sets. putting data into memory multiple times Commented Sep 21, 2021 at 21:37
  • Fig.show() was a copying error, sorry. I'm fairly new to python so I'll need to look into iloc[], but this works well for plotting one of them but when I try to do both it's no not throwing errors yet producing an empty window. Is this what I should expect to see if it's a memory related issue? As for the data, I have roughly 350K x 39 sized dataframes. Commented Sep 21, 2021 at 21:44

1 Answer 1

1
  • you have noted large data frames (39 columns, 350k rows)
  • plotly express provides higher level API for faceted figures (sub-plots). This is simpler to use
  • shape data frames to make it simple to use with plotly express
    1. make long dataframe instead of wide by unstack()
    2. values from the index become sub-plot and x-axis
    3. pd.concat() two data frames together
    4. there is far too much data to go into a figure, sample it down selecting every 100th row from source data frames
import numpy as np
import pandas as pd
import plotly.express as px

N = 350 * 10**3
C = 39
# generate a dataset same size as indicated in question
df = pd.DataFrame({c: np.random.uniform(1, 5, N)
                   for c in [f"{'' if (c//26)==0 else chr((c//26)+64)}{chr((c%26)+65)}" for c in range(C)]
                  })
# second data frame, same shape different values
df2 = pd.DataFrame(df.values * np.random.uniform(0.4, 0.6, df.values.shape), columns=df.columns)

# generating a figure with so much data in it will cause issues.  Plot sampled data, 100 data points
# use plotly express to simplify generation of sub-plots
fig = px.line(
    pd.concat(
        [
            df.unstack().reset_index().assign(status="clean"),
            df2.unstack().reset_index().assign(status="raw"),
        ]
    ).loc[lambda d: (d["level_1"]%(N//100)).eq(0)],
    x="level_1",
    y=0,
    facet_col="level_0",
    facet_col_wrap=6,
    color="status",
)
fig

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for this, unfortunatley I require the whole dataset to be plotted as it contains important spikes that last only a few datapoints. I guess the next best option would be to break it down into multiple figures to prevent a memory issue.
that amount of data causes by python kernel to crash... it would be far smarter to look for spikes in the data and plot regions with spikes. a bit more work in pandas
Kernal crash, wow! I don't get anything like that on my end. It might be a more programmatically efficient method to search out spikes in data by way of code, I'd like to be able to do this, but for my purposes the whole dataset needs to be plotted. Interesting to note that if I arrange 12 subplots per figure, it works perfectly with little/no lag! Scattergl seems good with big datasets.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.