How to load csv with variables in one column into dataframes

Question

I have a csv with just one column:

demand;;;
1;4;3;2
2;3;4;2
3;3;3;4
4;4;3;2
workhours;;;
1;160;;
2;80;;
3;40;;
4;80;;

How can i load it into python such that the first dataframe will be:

 index    column1 column2 column3
 1          1       4        3     
 2          2       3        4    
 3          3       3        3    
 4          4       4        3

And the second will be

 index    column1 column2 column3
 1          160     0        0    
 2          80      0        0   
 3          40      0        0  
 4          80      0        0

@PiotrŻak mine is much bigger, but to create this simplified version, it is enough to copy the first part, paste it into excel and save as .csv — wychen
– wychen, Commented May 8, 2021 at 13:22

Henry Ecker · Accepted Answer · 2021-05-08 15:07:38Z

3

On approach is to read in the whole csv, and create groups based on where there are NaN in all 3 columns. Then create a dictionary based on those groups:

from io import StringIO
from pprint import pprint

import pandas as pd

df = pd.read_csv(StringIO('''demand;;;
1;4;3;2
2;3;4;2
3;3;3;4
4;4;3;2
workhours;;;
1;160;;
2;80;;
3;40;;
4;80;;
'''), sep=';', index_col=0, names=['column1', 'column2', 'column3'])

# Create Groups where all row values are NaN
groups = tuple(df.groupby((~df.index.str.isnumeric()).cumsum()))
dfs = {}
for i, (_, sub_df) in enumerate(groups):
    # Get Name From Index as Key and set value as rest of the Frame
    dfs[sub_df.index[0]] = sub_df.iloc[1:]

pprint(dfs)

dfs:

{'demand':    column1  column2  column3
1      4.0      3.0      2.0
2      3.0      4.0      2.0
3      3.0      3.0      4.0
4      4.0      3.0      2.0,
 'workhours':    column1  column2  column3
1    160.0      NaN      NaN
2     80.0      NaN      NaN
3     40.0      NaN      NaN
4     80.0      NaN      NaN}

Some groupby options:

Assuming that there are no rows with all NaNs except the row which where table breaks occur:

groups = tuple(df.groupby(df.isna().all(axis=1).cumsum()))

Assuming that all indexes except the row where breaks occur are numeric:

groups = tuple(df.groupby((~df.index.str.isnumeric()).cumsum()))

Assuming there are both occasionally non-numeric indexes and rows with NaNs, but that only rows where both occur are breaks:

groups = tuple(df.groupby((~df.index.str.isnumeric() & df.isna().all(axis=1)).cumsum()))

edited May 8, 2021 at 15:07

answered May 8, 2021 at 14:22

Henry Ecker♦

35.9k19 gold badges48 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Cylldby Over a year ago

Nice solution, but issues will happen in the case where there are 3 NAs in the different columns that are not related to a change of data content.

Henry Ecker Over a year ago

A check that the index is non-numeric as well could be added to be double sure.

aneroid Over a year ago

This is my preferred approach. Using not isnumeric() is a good way to get the labels out, without worrying about valid NA values for certain rows. Btw, in your code, you're not using the i from enumerate so may as just do for _, sub_df in groups:

Cylldby · Accepted Answer · 2021-05-08 14:51:12Z

2

Given the shape of the data (index column containing name of the data -demand, workhours- then indexes) you gave you can try:

df = pd.read_csv("yourcsv.csv", 
                sep=";", 
                names=["column1","column2","column3"]).fillna(0)

And then create a new column label having the name of the data for the whole corresponding chunk:

df["label"] = df["index"].apply(lambda x: x if not x.isnumeric() else pd.NA).fillna(method = "ffill")
df = df[df["index"] != df["label"]]

This will give you a dataframe looking like this

You can then separate the data based on the "label" column:

df_demand = df.loc[df.label=="demand"].set_index('index').drop(['label'], axis=1)
df_workhours = df.loc[df.label=="workhours"].set_index('index').drop(['label'], axis=1)

edited May 8, 2021 at 14:51

answered May 8, 2021 at 13:31

Cylldby

1,9881 gold badge6 silver badges18 bronze badges

3 Comments

wychen Over a year ago

Thx, but the essence was to divide it somehow into multple dataframes. I can do it with df[:28] for example, but i have lots of those divisions where each division is when index is a string.

Cylldby Over a year ago

Sorry, read the question too quickly. Edited my answer accordingly.

aneroid Over a year ago

To create the multiple dataframes, instead of doing an explicit check for certain labels, you could use itertools.groupby or df['label'].unique() to get each group of rows associated with that part.

Alexis Bich · Accepted Answer · 2021-05-08 13:28:37Z

0

You can use

pd.read_csv('filename.csv', index_col = 0, sep = ';')

where 0 in index_col = 0 is the position of the column you want to use as an index

edited May 8, 2021 at 13:28

answered May 8, 2021 at 13:22

Alexis Bich

312 bronze badges

Collectives™ on Stack Overflow

How to load csv with variables in one column into dataframes

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related