Regex on pandas dataframe to change column names, then re-arrrage format of dataframe

Question

I have dataframe with the following format.

Would like to modified the column names and rearranging the dataframe into the following format:-

I have tried the code below to convert the column names from object to list and then strip and split the string. But still have white spaces after doing so. Not sure why.

df_col_list=df.columns.tolist()
list =[]
for elem in df_col_list:
    list.extend(elem.strip().split(':'))
list

Moved to regex to replace the column name to fill those with ID column in the final dataframe format i desire.

well_pattern=re.compile(r'[A-Z]{4}\d{4}')
for item_list in list:
    wellname=re.findall(well_pattern,item_list)
    for n in wellname:
        fld, well_no= n[:4], int(n[4:8])
        item_list = item_list.replace(n, '%s_%d_0' % (fld, well_no))
    print(item_list)

It worked to change 'MNIF0001' to 'MNIF_1_0'. But how do i then use this output to fill the new columns in the final dataframe format..

I am now stuck and not sure how to proceed. Please help

Thanks in advance

jezrael · Accepted Answer · 2019-04-07 07:00:18Z

3

First change pattern for matching groups by r'([A-Z]{4})(\d{4})(.+)' and use Series.str.extract for new helper DataFrame - convert second column to integers, join together and assign back.

Then use Series.str.split for MultiIndex, reshape by DataFrame.stack and data cleaning - DataFrame.rename_axis, DataFrame.reset_index and DataFrame.sort_values:

df = pd.DataFrame({
        'MNIF0001:w':[2] * 5,
        'MNIF0010:w':[4] * 5,
        'MNIF0001:f':[6] * 5,
        'MNIF0010:f':[8] * 5,

}, index=['01-Feb-63','01-Mar-63','01-Apr-63','01-May-63','01-Jun-63'])
df.index.name = 'date'
print (df)
           MNIF0001:w  MNIF0010:w  MNIF0001:f  MNIF0010:f
date                                                     
01-Feb-63           2           4           6           8
01-Mar-63           2           4           6           8
01-Apr-63           2           4           6           8
01-May-63           2           4           6           8
01-Jun-63           2           4           6           8

well_pattern=re.compile(r'([A-Z]{4})(\d{4})(.+)')
df1 = df.columns.to_series().str.extract(well_pattern)
print (df1)
               0     1   2
MNIF0001:w  MNIF  0001  :w
MNIF0010:w  MNIF  0010  :w
MNIF0001:f  MNIF  0001  :f
MNIF0010:f  MNIF  0010  :f

df.columns = df1[0] + '_' + df1[1].astype(int).astype(str) + '_0' + df1[2]
print (df)
           MNIF_1_0:w  MNIF_10_0:w  MNIF_1_0:f  MNIF_10_0:f
date                                                       
01-Feb-63           2            4           6            8
01-Mar-63           2            4           6            8
01-Apr-63           2            4           6            8
01-May-63           2            4           6            8
01-Jun-63           2            4           6            8

df.columns = df.columns.str.split(':', expand=True)
df = df.stack(0).rename_axis(('date','ID')).reset_index().sort_values(['ID','date'])
print (df)
        date         ID  f  w
4  01-Apr-63  MNIF_10_0  8  4
0  01-Feb-63  MNIF_10_0  8  4
8  01-Jun-63  MNIF_10_0  8  4
2  01-Mar-63  MNIF_10_0  8  4
6  01-May-63  MNIF_10_0  8  4
5  01-Apr-63   MNIF_1_0  6  2
1  01-Feb-63   MNIF_1_0  6  2
9  01-Jun-63   MNIF_1_0  6  2
3  01-Mar-63   MNIF_1_0  6  2
7  01-May-63   MNIF_1_0  6  2

EDIT: If need working with ID column only repalace columns to ID:

df.columns = df.columns.str.split(':', expand=True)
df = df.stack(0).rename_axis(('date','ID')).reset_index().sort_values(['ID','date'])

well_pattern=re.compile(r'([A-Z]{4})(\d{4})')
df1 = df['ID'].str.extract(well_pattern)
df['ID'] = df1[0] + '_' + df1[1].astype(int).astype(str) + '_0'
print (df)
        date         ID  f  w
4  01-Apr-63   MNIF_1_0  6  2
0  01-Feb-63   MNIF_1_0  6  2
8  01-Jun-63   MNIF_1_0  6  2
2  01-Mar-63   MNIF_1_0  6  2
6  01-May-63   MNIF_1_0  6  2
5  01-Apr-63  MNIF_10_0  8  4
1  01-Feb-63  MNIF_10_0  8  4
9  01-Jun-63  MNIF_10_0  8  4
3  01-Mar-63  MNIF_10_0  8  4
7  01-May-63  MNIF_10_0  8  4

edited Apr 7, 2019 at 7:00

answered Apr 6, 2019 at 14:25

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ooikiam Over a year ago

Hi Jezrael, got it to work. How do i then apply regular expression on ID column to change the ID from MNIF0001 to MNIF_1_0 and so on. Like i was trying to do with regex in my question.

ooikiam Over a year ago

@jezreal. Thanks buddy, worked like a charm. For my learning, if I were to do the regex on ID column (i.e. find MNIF0001 pattern and replace to MNIF_1_0) on after you do the str,split and df.stack(0) steps. How do i then do the coding? Meaning applying compile,findall,replace on ID column as values/series. Appreciate your help. i can see i doing that in the future on some pandas dataframe.

ooikiam Over a year ago

@ jexrael - let me know if it is unclear what i am asking.

Collectives™ on Stack Overflow

Regex on pandas dataframe to change column names, then re-arrrage format of dataframe

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related