2

I am trying to split and merge a Pandas dataframe.

The columns of the original data frame are arranged like so:

dataTime Record1Field1 ... Record1FieldN Record2Field1 ... Record1FieldN
time1    <<     record 1 data         >> <<       record 2 data       >>

I would like to take split the Record2 fields into a separate data frame tempdf, indexed by the dataTime. tempdf will therefore look something like this:

dataTime Record2Field1 ... Record2FieldN
time1    << record 2 data             >>

Once tempdf is populated, delete the Record2 columns from the original data frame. The first difficulty I'm having is in creating this tempdf which contains the record 2 data.

Then, I would like to rename the columns in tempdf so that they align with the Record1 columns in the original data frame. (This portion I know how to do)

Finally I would like to merge tempdf back into the original data frame.

The end result should look something like this:

dataTime Record1Field1 ... Record1FieldN
time1    <<record 1 data>>
time1    <<record 2 data>>

So far I haven't determined a good method of doing this. Any help is appreciated! Thanks.

5
  • Am I right that you only have to do a merge? Commented Aug 30, 2016 at 17:04
  • use concat or append Commented Aug 30, 2016 at 17:05
  • @ragesz l'm sorry, I miscommunicated. No, part of the problem I'm having is in creating the tempdf data frame which contains all of the record 2 data. Commented Aug 30, 2016 at 17:08
  • Do the column names, Record2Field.. form a continuous sequence as in range from 1→N? Commented Aug 30, 2016 at 17:23
  • Unfortunately, no. The fields are named in accordance with the data they contain, but they are arranged in the order presented above Commented Aug 30, 2016 at 17:34

7 Answers 7

1

Another way to clean and merge two datasets :

df3 = df1[8:]
df4 = df2[8:]

tmp_col1 = [1,2,3,4,5,6,7,8]
tmp_col2 = [1,2,3,4,5,6,7,8,9]
tmp_col3 = [1,2,3,4,5,6,7]

col_name1= df1.columns[0]
col_name2 = df2.columns[0]
df5 = df3[df3[col_name1].notna()]
df6 = df4[df4[col_name1].notna()]
data = df1.iloc[[2],[6]].values[0]
print(data)
df5.columns = tmp_col1
df6.columns = tmp_col2


df5 = df5[[1,2,3,4,6,7]]
df5 = df5.reset_index()
df5.drop(df5.columns[[0]], axis=1, inplace=True)
df5[8] = pd.Series([data])

df6 = df6[[1,2,3,4,6,9,8]]
df6 = df6.reset_index()
df6.drop(df6.columns[[0]], axis=1, inplace=True)

print(df5)
print(df6)
df5.columns = tmp_col3
df6.columns = tmp_col3
dfs=[df5,df6]

df7 = pd.concat(dfs)
df7.columns = ["","","",""]
print(df7)
Sign up to request clarification or add additional context in comments.

Comments

0

You could get all your Record2 values under the Record1 columns as follows:

Data Setup:

data = StringIO(
'''
dataTime Record1Field1 Record1Field2 Record1Field3 Record2Field1 Record2Field2 Record2Field3
01-01-2015 1 2 3 4 5 6 
''')

df = pd.read_csv(data, delim_whitespace=True, parse_dates=['dataTime'])
print (df)

    dataTime  Record1Field1  Record1Field2  Record1Field3  Record2Field1  \
0 2015-01-01              1              2              3              4   

   Record2Field2  Record2Field3  
0              5              6 

Operations:

df.set_index('dataTime', inplace=True)

# Filter column names corresponding to Record2
tempdf = df[[col for col in list(df) if col.startswith('Record2')]]

# Drop those columns after assigning to tempdf
df.drop(tempdf.columns, inplace=True, axis=1)

# Rename the column names for appending
tempdf.columns = [col for col in list(df) if col.startswith('Record1')]

# Concatenate row-wise
print (df.append(tempdf))

            Record1Field1  Record1Field2  Record1Field3
dataTime                                               
2015-01-01              1              2              3
2015-01-01              4              5              6

1 Comment

This did it! I ended up using a regex filter as in the answer provided by @unutbu. Thanks for your help!
0

try using concat

So trying something like:

Combined = [DataFrame1,DataFrame2]
Together = pandas.concat(Combined)

as one of the others commented - merge may be a good option as well.

3 Comments

Hi Matt. Thanks for your response. I realized that I wasn't asking the question that I needed answering most of all, which was how to create the tempdf. I've edited my post to more clearly explain the issues I'm having.
Is this going to be a static solution? or will you be implementing this code across different data frames? I ask because you can "hard code" the deletion of the unnecessary columns for a one time solution. Now for your question, the tempdf is coming from part of the original dataframe? is that correct?
This will indeed be a static solution. Also, yes, tempdf will be populated entirely from the original dataframe.
0

if you know the columns to be selected , then use

 tempdf = df[['a','b']]

else to select last 2 columns use

 tempdf = df[df.columns[-2:]]

Comments

0

To answer your immediate question, you could use df.filter with a regex pattern to select the columns of the form Record2FieldN:

In [29]: tempdf = df.filter(regex=r'Record2.*'); tempdf
Out[29]: 
   Record2Field0  Record2Field1  Record2Field2
0              3              8              4
1              2              6              3
2              1              2              2
3              5              9              4

and you could rename the columns using tempdf.rename:

tempdf = tempdf.rename(columns={'Record2Field{}'.format(i):'Record1Field{}'.format(i) for i in range(3)})

and drop the Record2 fields from df with:

df = df.drop(['Record2Field{}'.format(i) for i in range(3)], axis=1)

But there is a better approach to your overall problem: Replace the flat column names RecordMFieldN with a 2-level MultiIndex which splits the Record from the Field. This will give you enough control to stack the data in the desired form:

import numpy as np
import pandas as pd
np.random.seed(2016)

ncols, nrows = 3, 4
def make_dataframe(ncols, nrows):
    columns = ['Record{}Field{}'.format(i, j) for i in range(1,3) 
               for j in range(ncols)]
    df = pd.DataFrame(np.random.randint(10, size=(nrows, 2*ncols)), columns=columns)
    df['dataTime'] = pd.date_range('2000-1-1', periods=nrows)
    return df

df = make_dataframe(ncols, nrows)

# stash the `dataTime` in the row index so we can reassign 
# the column index to `new_index`
result = df.set_index('dataTime')
new_index = pd.MultiIndex.from_product([[1,2], df.columns[:ncols]], 
                                       names=['record', 'field'])
result.columns = new_index

# Now the problem can be solved by stacking.
result = result.stack('record')
result.index = result.index.droplevel('record')

yields

field       Record1Field0  Record1Field1  Record1Field2
dataTime                                               
2000-01-01              3              7              2
2000-01-01              3              8              4
2000-01-02              8              7              9
2000-01-02              2              6              3
2000-01-03              4              1              9
2000-01-03              1              2              2
2000-01-04              8              9              8
2000-01-04              5              9              4

Comments

0

Try this code, it works by splitting df based on empty row then adding identifier to datasets and then merging them together.

df_list = np.split(df, df[df.isnull().all(1)].index) 
df0=df_list[0]
data = df0.iloc[[0],[0]].values[0]
df1=df_list[1]
df2= df_list[2]
df1['status'] = ''
df2['status'] = ''
df3 = df2[3:-1]
df4 = df1[3:-1] 
dfs=[df4,df3]
df5= pd.concat(dfs)
col=[]
for i in df.iloc[8]:
    col.append(i)
col.append('status')
df5.columns= col
df5= df5.reset_index()
df5.drop(df5.columns[[0]], axis=1, inplace=True)
df5['ID'] = pd.Series([data])
print(df5)

Comments

0

If you want to split based on value of column :

col_name = df.columns[0]
ict = df[df[col_name] == 'CT'].index
print(ict)
df_list = np.split(df, ict)
df1 = df_list[0]
df2 = df_list[1]
df1['status'] = ''
df2['status'] = ''

df1 = df1[9:]
df2 = df2[4:-4]

dfs=[df1,df2]
df3= pd.concat(dfs)
col=[]
for i in df.iloc[8]:
    col.append(i)
col.append('status')
df3.columns= col
df3 = df3.reset_index()
df3.drop(df3.columns[[0]], axis=1, inplace=True)
data = df.iloc[[0],[0]].values[0]
df3['ID'] = pd.Series([data])
print(df3)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.