1

I have a csv with just one column:

demand;;;
1;4;3;2
2;3;4;2
3;3;3;4
4;4;3;2
workhours;;;
1;160;;
2;80;;
3;40;;
4;80;;

How can i load it into python such that the first dataframe will be:

 index    column1 column2 column3
 1          1       4        3     
 2          2       3        4    
 3          3       3        3    
 4          4       4        3  

And the second will be

 index    column1 column2 column3
 1          160     0        0    
 2          80      0        0   
 3          40      0        0  
 4          80      0        0   
2
  • Can You share that .csv for download? Commented May 8, 2021 at 13:18
  • @PiotrŻak mine is much bigger, but to create this simplified version, it is enough to copy the first part, paste it into excel and save as .csv Commented May 8, 2021 at 13:22

3 Answers 3

3

On approach is to read in the whole csv, and create groups based on where there are NaN in all 3 columns. Then create a dictionary based on those groups:

from io import StringIO
from pprint import pprint

import pandas as pd

df = pd.read_csv(StringIO('''demand;;;
1;4;3;2
2;3;4;2
3;3;3;4
4;4;3;2
workhours;;;
1;160;;
2;80;;
3;40;;
4;80;;
'''), sep=';', index_col=0, names=['column1', 'column2', 'column3'])

# Create Groups where all row values are NaN
groups = tuple(df.groupby((~df.index.str.isnumeric()).cumsum()))
dfs = {}
for i, (_, sub_df) in enumerate(groups):
    # Get Name From Index as Key and set value as rest of the Frame
    dfs[sub_df.index[0]] = sub_df.iloc[1:]

pprint(dfs)

dfs:

{'demand':    column1  column2  column3
1      4.0      3.0      2.0
2      3.0      4.0      2.0
3      3.0      3.0      4.0
4      4.0      3.0      2.0,
 'workhours':    column1  column2  column3
1    160.0      NaN      NaN
2     80.0      NaN      NaN
3     40.0      NaN      NaN
4     80.0      NaN      NaN}

Some groupby options:

  1. Assuming that there are no rows with all NaNs except the row which where table breaks occur:
groups = tuple(df.groupby(df.isna().all(axis=1).cumsum()))
  1. Assuming that all indexes except the row where breaks occur are numeric:
groups = tuple(df.groupby((~df.index.str.isnumeric()).cumsum()))
  1. Assuming there are both occasionally non-numeric indexes and rows with NaNs, but that only rows where both occur are breaks:
groups = tuple(df.groupby((~df.index.str.isnumeric() & df.isna().all(axis=1)).cumsum()))
Sign up to request clarification or add additional context in comments.

3 Comments

Nice solution, but issues will happen in the case where there are 3 NAs in the different columns that are not related to a change of data content.
A check that the index is non-numeric as well could be added to be double sure.
This is my preferred approach. Using not isnumeric() is a good way to get the labels out, without worrying about valid NA values for certain rows. Btw, in your code, you're not using the i from enumerate so may as just do for _, sub_df in groups:
2

Given the shape of the data (index column containing name of the data -demand, workhours- then indexes) you gave you can try:

df = pd.read_csv("yourcsv.csv", 
                sep=";", 
                names=["column1","column2","column3"]).fillna(0)

And then create a new column label having the name of the data for the whole corresponding chunk:

df["label"] = df["index"].apply(lambda x: x if not x.isnumeric() else pd.NA).fillna(method = "ffill")
df = df[df["index"] != df["label"]]

This will give you a dataframe looking like this

enter image description here

You can then separate the data based on the "label" column:

df_demand = df.loc[df.label=="demand"].set_index('index').drop(['label'], axis=1)
df_workhours = df.loc[df.label=="workhours"].set_index('index').drop(['label'], axis=1)

df_workhours

df_demand

3 Comments

Thx, but the essence was to divide it somehow into multple dataframes. I can do it with df[:28] for example, but i have lots of those divisions where each division is when index is a string.
Sorry, read the question too quickly. Edited my answer accordingly.
To create the multiple dataframes, instead of doing an explicit check for certain labels, you could use itertools.groupby or df['label'].unique() to get each group of rows associated with that part.
0

You can use

pd.read_csv('filename.csv', index_col = 0, sep = ';')

where 0 in index_col = 0 is the position of the column you want to use as an index

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.