Split and merge pandas dataframe

Question

I am trying to split and merge a Pandas dataframe.

The columns of the original data frame are arranged like so:

dataTime Record1Field1 ... Record1FieldN Record2Field1 ... Record1FieldN
time1    <<     record 1 data         >> <<       record 2 data       >>

I would like to take split the Record2 fields into a separate data frame tempdf, indexed by the dataTime. tempdf will therefore look something like this:

dataTime Record2Field1 ... Record2FieldN
time1    << record 2 data             >>

Once tempdf is populated, delete the Record2 columns from the original data frame. The first difficulty I'm having is in creating this tempdf which contains the record 2 data.

Then, I would like to rename the columns in tempdf so that they align with the Record1 columns in the original data frame. (This portion I know how to do)

Finally I would like to merge tempdf back into the original data frame.

The end result should look something like this:

dataTime Record1Field1 ... Record1FieldN
time1    <<record 1 data>>
time1    <<record 2 data>>

So far I haven't determined a good method of doing this. Any help is appreciated! Thanks.

@ragesz l'm sorry, I miscommunicated. No, part of the problem I'm having is in creating the tempdf data frame which contains all of the record 2 data. — dgikmo
– dgikmo, Commented Aug 30, 2016 at 17:08
Do the column names, Record2Field.. form a continuous sequence as in range from 1→N? — Nickil Maveli
– Nickil Maveli, Commented Aug 30, 2016 at 17:23
Unfortunately, no. The fields are named in accordance with the data they contain, but they are arranged in the order presented above — dgikmo
– dgikmo, Commented Aug 30, 2016 at 17:34

mayank584 · Accepted Answer · 2021-06-11 11:36:50Z

1

Another way to clean and merge two datasets :

df3 = df1[8:]
df4 = df2[8:]

tmp_col1 = [1,2,3,4,5,6,7,8]
tmp_col2 = [1,2,3,4,5,6,7,8,9]
tmp_col3 = [1,2,3,4,5,6,7]

col_name1= df1.columns[0]
col_name2 = df2.columns[0]
df5 = df3[df3[col_name1].notna()]
df6 = df4[df4[col_name1].notna()]
data = df1.iloc[[2],[6]].values[0]
print(data)
df5.columns = tmp_col1
df6.columns = tmp_col2


df5 = df5[[1,2,3,4,6,7]]
df5 = df5.reset_index()
df5.drop(df5.columns[[0]], axis=1, inplace=True)
df5[8] = pd.Series([data])

df6 = df6[[1,2,3,4,6,9,8]]
df6 = df6.reset_index()
df6.drop(df6.columns[[0]], axis=1, inplace=True)

print(df5)
print(df6)
df5.columns = tmp_col3
df6.columns = tmp_col3
dfs=[df5,df6]

df7 = pd.concat(dfs)
df7.columns = ["","","",""]
print(df7)

answered Jun 11, 2021 at 11:36

mayank584

272 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Nickil Maveli · Accepted Answer · 2016-08-30 18:12:31Z

0

You could get all your Record2 values under the Record1 columns as follows:

Data Setup:

data = StringIO(
'''
dataTime Record1Field1 Record1Field2 Record1Field3 Record2Field1 Record2Field2 Record2Field3
01-01-2015 1 2 3 4 5 6 
''')

df = pd.read_csv(data, delim_whitespace=True, parse_dates=['dataTime'])
print (df)

    dataTime  Record1Field1  Record1Field2  Record1Field3  Record2Field1  \
0 2015-01-01              1              2              3              4   

   Record2Field2  Record2Field3  
0              5              6

Operations:

df.set_index('dataTime', inplace=True)

# Filter column names corresponding to Record2
tempdf = df[[col for col in list(df) if col.startswith('Record2')]]

# Drop those columns after assigning to tempdf
df.drop(tempdf.columns, inplace=True, axis=1)

# Rename the column names for appending
tempdf.columns = [col for col in list(df) if col.startswith('Record1')]

# Concatenate row-wise
print (df.append(tempdf))

            Record1Field1  Record1Field2  Record1Field3
dataTime                                               
2015-01-01              1              2              3
2015-01-01              4              5              6

edited Aug 30, 2016 at 18:12

answered Aug 30, 2016 at 17:44

Nickil Maveli

29.8k10 gold badges86 silver badges88 bronze badges

1 Comment

dgikmo Over a year ago

This did it! I ended up using a regex filter as in the answer provided by @unutbu. Thanks for your help!

MattR · Accepted Answer · 2016-08-30 17:08:41Z

0

try using concat

So trying something like:

Combined = [DataFrame1,DataFrame2]
Together = pandas.concat(Combined)

as one of the others commented - merge may be a good option as well.

answered Aug 30, 2016 at 17:08

MattR

5,1949 gold badges44 silver badges70 bronze badges

3 Comments

dgikmo Over a year ago

Hi Matt. Thanks for your response. I realized that I wasn't asking the question that I needed answering most of all, which was how to create the tempdf. I've edited my post to more clearly explain the issues I'm having.

MattR Over a year ago

Is this going to be a static solution? or will you be implementing this code across different data frames? I ask because you can "hard code" the deletion of the unnecessary columns for a one time solution. Now for your question, the tempdf is coming from part of the original dataframe? is that correct?

dgikmo Over a year ago

This will indeed be a static solution. Also, yes, tempdf will be populated entirely from the original dataframe.

Shijo · Accepted Answer · 2016-08-30 17:16:15Z

0

if you know the columns to be selected , then use

 tempdf = df[['a','b']]

else to select last 2 columns use

 tempdf = df[df.columns[-2:]]

answered Aug 30, 2016 at 17:16

Shijo

9,7913 gold badges23 silver badges31 bronze badges

Comments

unutbu · Accepted Answer · 2016-08-30 17:50:51Z

To answer your immediate question, you could use df.filter with a regex pattern to select the columns of the form Record2FieldN:

In [29]: tempdf = df.filter(regex=r'Record2.*'); tempdf
Out[29]: 
   Record2Field0  Record2Field1  Record2Field2
0              3              8              4
1              2              6              3
2              1              2              2
3              5              9              4

and you could rename the columns using tempdf.rename:

tempdf = tempdf.rename(columns={'Record2Field{}'.format(i):'Record1Field{}'.format(i) for i in range(3)})

and drop the Record2 fields from df with:

df = df.drop(['Record2Field{}'.format(i) for i in range(3)], axis=1)

But there is a better approach to your overall problem: Replace the flat column names RecordMFieldN with a 2-level MultiIndex which splits the Record from the Field. This will give you enough control to stack the data in the desired form:

import numpy as np
import pandas as pd
np.random.seed(2016)

ncols, nrows = 3, 4
def make_dataframe(ncols, nrows):
    columns = ['Record{}Field{}'.format(i, j) for i in range(1,3) 
               for j in range(ncols)]
    df = pd.DataFrame(np.random.randint(10, size=(nrows, 2*ncols)), columns=columns)
    df['dataTime'] = pd.date_range('2000-1-1', periods=nrows)
    return df

df = make_dataframe(ncols, nrows)

# stash the `dataTime` in the row index so we can reassign 
# the column index to `new_index`
result = df.set_index('dataTime')
new_index = pd.MultiIndex.from_product([[1,2], df.columns[:ncols]], 
                                       names=['record', 'field'])
result.columns = new_index

# Now the problem can be solved by stacking.
result = result.stack('record')
result.index = result.index.droplevel('record')

yields

field       Record1Field0  Record1Field1  Record1Field2
dataTime                                               
2000-01-01              3              7              2
2000-01-01              3              8              4
2000-01-02              8              7              9
2000-01-02              2              6              3
2000-01-03              4              1              9
2000-01-03              1              2              2
2000-01-04              8              9              8
2000-01-04              5              9              4

mayank584 · Accepted Answer · 2021-06-10 14:04:01Z

0

Try this code, it works by splitting df based on empty row then adding identifier to datasets and then merging them together.

df_list = np.split(df, df[df.isnull().all(1)].index) 
df0=df_list[0]
data = df0.iloc[[0],[0]].values[0]
df1=df_list[1]
df2= df_list[2]
df1['status'] = ''
df2['status'] = ''
df3 = df2[3:-1]
df4 = df1[3:-1] 
dfs=[df4,df3]
df5= pd.concat(dfs)
col=[]
for i in df.iloc[8]:
    col.append(i)
col.append('status')
df5.columns= col
df5= df5.reset_index()
df5.drop(df5.columns[[0]], axis=1, inplace=True)
df5['ID'] = pd.Series([data])
print(df5)

answered Jun 10, 2021 at 14:04

mayank584

272 bronze badges

Comments

mayank584 · Accepted Answer · 2021-06-11 09:49:50Z

0

If you want to split based on value of column :

col_name = df.columns[0]
ict = df[df[col_name] == 'CT'].index
print(ict)
df_list = np.split(df, ict)
df1 = df_list[0]
df2 = df_list[1]
df1['status'] = ''
df2['status'] = ''

df1 = df1[9:]
df2 = df2[4:-4]

dfs=[df1,df2]
df3= pd.concat(dfs)
col=[]
for i in df.iloc[8]:
    col.append(i)
col.append('status')
df3.columns= col
df3 = df3.reset_index()
df3.drop(df3.columns[[0]], axis=1, inplace=True)
data = df.iloc[[0],[0]].values[0]
df3['ID'] = pd.Series([data])
print(df3)

answered Jun 11, 2021 at 9:49

mayank584

272 bronze badges

Collectives™ on Stack Overflow

Split and merge pandas dataframe

7 Answers 7

Comments

1 Comment

3 Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Comments

1 Comment

3 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related