0

I have the following dataframe called v.

    date        x1      x2      x3      x4      x5      dname
1   20200705    8119    8013    8133    8031    100806  D1 
2   20200706    8031    7950    8271    8200    443809  D1 
3   20200707    8200    8188    8281    8217    303151  D1 
4   20200708    8217    8200    8365    8334    509629  D1 
5   20200709    8334    8139    8370    8204    588634  D1 
.................................                           
55  20221216    17340   16675   17525   16775   7266    D2 
56  20221219    16690   16395   16770   16495   4393    D2 
57  20221220    16325   16275   17095   16840   5601    D2 
58  20221221    16870   16670   16885   16735   2295    D2 
59  20221222    16725   16470   16850   16485   3359    D2 
.................................                           
125 20200705    9131    9000    9146    9014            D3
126 20200706    9014    8918    9352    9277            D3
127 20200707    9277    9207    9379    9255            D3
128 20200708    9255    9231    9473    9430            D3
129 20200709    9430    9165    9472    9237            D3
.................................                           
500 20221218    1179    1173    1197    1183            D7 
501 20221219    1183    1165    1195    1176            D7 
502 20221220    1176    1151    1229    1216            D7 
503 20221221    1216    1204    1222    1212            D7 
504 20221222    1212    1183    1221    1186            D7 
.................................                           
992                                                     D8 
993 20200721                                        181 D9 
994 20200818                                        50  D9 
995 20200831                                        96  D9 
996 20200925                                        84  D9 
.................................                           
1006    20220705                                    36  D11 
1007    20220718                                    48  D11 
1008    20220728                                    22  D11 
1009    20220818                                    68  D11 
1010    20220923                                   108  D11 

As you can see there are certain columns missing. Sometimes x1 - x4 are missing, sometimes x5 is missing, when they are missing they have a blank space character. Sometimes x2-x3 are missing.

I want to create one dataframe each and group up each frame based on which columns they have. So for example all those rows which have all columns will have is on frame, then those without x5 will have it's own column etc.

Right now I am manually programming each case. Is there a way to dynamically program this behaviour?

Here is my code,

import pandas as pd

v = pd.read_csv(filepath)

d1 = v[v.x5 == " "]
d2 = v[v.x5 != " "]
d3 = v[v.x2 != " " & v.x3 != " "]

I have to manually also go see which combination of missing columns exist before I do that. I have many dataframes like that.

Is there a faster more efficient way to do it so I end up with multiple dataframes like this where each dataframe has the same columns of data not missing.

df1

    date        x1      x2      x3      x4      x5      dname
1   20200705    8119    8013    8133    8031    100806  D1 
2   20200706    8031    7950    8271    8200    443809  D1 
3   20200707    8200    8188    8281    8217    303151  D1 
4   20200708    8217    8200    8365    8334    509629  D1 
5   20200709    8334    8139    8370    8204    588634  D1 
.................................                           
55  20221216    17340   16675   17525   16775   7266    D2 
56  20221219    16690   16395   16770   16495   4393    D2 
57  20221220    16325   16275   17095   16840   5601    D2 
58  20221221    16870   16670   16885   16735   2295    D2 
59  20221222    16725   16470   16850   16485   3359    D2 

df2

    date        x1      x2      x3      x4              dname
125 20200705    9131    9000    9146    9014            D3
126 20200706    9014    8918    9352    9277            D3
127 20200707    9277    9207    9379    9255            D3
128 20200708    9255    9231    9473    9430            D3
129 20200709    9430    9165    9472    9237            D3
.................................                           
500 20221218    1179    1173    1197    1183            D7 
501 20221219    1183    1165    1195    1176            D7 
502 20221220    1176    1151    1229    1216            D7 
503 20221221    1216    1204    1222    1212            D7 
504 20221222    1212    1183    1221    1186            D7 

etc.

7
  • 2
    for 5 columns, you are looking at potentially 120 subsets - missing just one, missing all pairs of twos, and so on. Do you want those many different dataframes? Commented Dec 23, 2022 at 8:01
  • i think theres only 5 subsets. Commented Dec 23, 2022 at 8:03
  • Just to make sure, " " is not a missing value, it's a string containing a space. Are you looking for missing columns, or something more specific? Commented Dec 23, 2022 at 8:05
  • D11 seems to contradict that: it has x5 data, but not x1234. Commented Dec 23, 2022 at 8:08
  • i need to replace the space with blank values first but im getting an error that it can only be used with str values Commented Dec 23, 2022 at 8:09

1 Answer 1

1

If you're trying to check for columns missing values, you can create an indicator column, showing which columns are missing using this code:

df['group'] = df.isna().apply(lambda x: ','.join(set(x[x].to_dict().keys())), axis = 1)

Will give you a df similar to this:

    date        x1      x2      x3      x4      x5          dname   group
1   20200705    8119.0  8013.0  8133.0  8031.0  100806.0    D1  
2   20200706    8031.0  7950.0  8271.0  8200.0  443809.0    D1  
3   20200707            8188.0  8281.0  8217.0  303151.0    D1      x1
4   20200708                    8365.0  8334.0  509629.0    D1      x1,x2
5   20200709                    8370.0  8204.0  588634.0    D1      x1,x2
55  20221216    17340.0         17525.0 16775.0 7266.0      D2      x2
56  20221219    16690.0         16770.0 16495.0 4393.0      D2      x2
57  20221220    16325.0 16275.0 17095.0 16840.0 5601.0      D2  
58  20221221    16870.0 16670.0 16885.0 16735.0 2295.0      D2  
59  20221222    16725.0         16850.0 16485.0 3359.0      D2      x2
125 20200705    9131.0          9146.0  9014.0              D3      x5,x2
126 20200706    9014.0          9352.0  9277.0              D3      x5,x2
127 20200707    9277.0                                      D3  x5,x2,x3,x4
128 20200708    9255.0                                      D3  x5,x2,x3,x4
129 20200709    9430.0                                      D3  x5,x2,x3,x4
500 20221218    1179.0  1173.0  1197.0  1183.0              D3      x5
501 20221219    1183.0  1165.0  1195.0  1176.0              D3      x5
502 20221220    1176.0          1229.0                      D3      x5,x2,x4
503 20221221    1216.0                                      D3  x5,x2,x3,x4
504 20221222    1212.0  1183.0  1221.0                      D3  x5,x4

You can then try to split it using unique values of this indicator column, drop empty columns, and append results to a single list:

output = []
for group in df['group'].unique():
    df_temp = df[df['group'] == group].copy().dropna(axis = 1)
    output.append(df_temp)

Output list should contain n number of dataframes, where n is number of unique combination of missing columns in your original df.

Note: you could replace df.isna() with another statement that returns a dataframe containing True/False values, such as conditional (df == " ").apply(lambda x: ','.join(set(x[x].to_dict().keys())), axis = 1)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.