Create dynamic columns using information from existing columns using pandas

Question

I have this Data Frame

dd = pd.DataFrame({'text':["random text with pol # as 111 and ein no as 101",
                           "random text with pol # as 222",
                           "random text with ein # as 333 and ein no as 303"],
                   'label':[
                           [[26,29,"pol"],[44,47,"ein"]],
                           [[26,29,"pol"]],
                           [[26,29,"ein"],[44,47,"ein"]] ]})

Giving this output

                                              text                      label
0  random text with pol # as 111 and ein no as 101  [[26,29,pol],[44,47,ein]]
1                    random text with pol # as 222               [[26,29,pol]
2  random text with ein # as 333 and ein no as 303  [[26,29,ein],[44,47,ein]]

I want this output

                                              text                      label  \
0  random text with pol # as 111 and ein no as 101  [[26,29,pol],[44,47,ein]]   
1                    random text with pol # as 222               [[26,29,pol]   
2  random text with ein # as 303 and ein no as 304  [[26,29,ein],[44,47,ein]]   

   pol ein_1 ein_2  
0  111   101        
1  222              
2        303   304

I want to create columns dynamically using column information label where this column is a list of list one list contains start_index , end_index, label_type . By accessing the text in the text column using the start and end index we can get the actual label.

For eg text : "random text with pol # as 222" and label is '[[26,29,pol]'

so pol = Text[26:29] which is pol = 222

so I have to create pol as a column name and give it value 222.

so Far I could come up with this

dd["pol"] = dd.apply(lambda row: row.text[ row.label[0][0] : row.label[0][1]], axis=1)

This only works if the data is static and every time all data labels comes once and in the same place.

No, it is not a static format as in the example the label has 3 rows with 3 different formats. 1 st row has POL and EIN ; 2 nd row has only one POL; 3rd row has two EIN . — Ajay Chinni
– Ajay Chinni, Commented Jun 14, 2021 at 10:13

Ank · Accepted Answer · 2021-06-14 10:42:38Z

Here is a generic way to this. This will work for labels other than pol and ein as well if you have it in your df, and will generate columns dynamically for all of them:

from collections import defaultdict

def check_labels(row):
    d = defaultdict(list)
    for l in row.label:
        d[l[2]].append(int(row.text[l[0] : l[1]]))
    return d

label_df = dd.apply(check, axis=1).apply(pd.Series)

label_df1 = pd.DataFrame()    
for col in label_df.columns.tolist():
    max_values = int(label_df[col].str.len().max())
    cols = [col+'_'+str(i+1) for i in range(max_values)] if max_values > 1 else [col]
    label_df1[cols] = label_df[col].apply(pd.Series)

dd = pd.concat([dd,label_df1], axis=1)

Output:

>>> dd
                                              text                           label    pol  ein_1  ein_2
0  random text with pol # as 111 and ein no as 101  [[26, 29, pol], [44, 47, ein]]  111.0  101.0    NaN
1                    random text with pol # as 222                 [[26, 29, pol]]  222.0    NaN    NaN
2  random text with ein # as 333 and ein no as 303  [[26, 29, ein], [44, 47, ein]]    NaN  333.0  303.0

Steps:

First create a function check_labels to use information in label column and extract relevant data from text column.

Apply this function on your df, which returns a series of dictionaries containing labels and their extracted values. Apply pandas.Series to it and save results into label_df:

>>> label_df = dd.apply(check, axis=1).apply(pd.Series)
>>> label_df
     pol         ein
0  [111]       [101]
1  [222]         NaN
2    NaN  [333, 303]

Now we need to count max no. of values in each label obtained above, and split data according to that count for each label. So as pol label has maximum 1 value in each row, it will be kept as is, but ein has 2 values in row 2, so it will be split up as ein_1 and ein_2:

>>> label_df1 = pd.DataFrame()    
>>> for col in label_df.columns.tolist():
...     max_values = int(label_df[col].str.len().max())
...     cols = [col+'_'+str(i+1) for i in range(max_values)] if max_values > 1 else [col]
...     label_df1[cols] = label_df[col].apply(pd.Series)
... 
>>> label_df1
     pol  ein_1  ein_2
0  111.0  101.0    NaN
1  222.0    NaN    NaN
2    NaN  333.0  303.0

We now concat this label_df1 back to original df dd to get what you wanted:

>>> dd = pd.concat([dd,label_df1], axis=1)
>>> dd
                                              text                           label    pol  ein_1  ein_2
0  random text with pol # as 111 and ein no as 101  [[26, 29, pol], [44, 47, ein]]  111.0  101.0    NaN
1                    random text with pol # as 222                 [[26, 29, pol]]  222.0    NaN    NaN
2  random text with ein # as 333 and ein no as 303  [[26, 29, ein], [44, 47, ein]]    NaN  333.0  303.0

crayxt · Accepted Answer · 2021-06-14 10:27:46Z

2

I can only do it in several steps

dd_tmp = dd.text.str.extractall(r"(pol|ein) (?:#|no) as (\d+)")

Which outputs

           0    1
  match
0 0      pol  111
  1      ein  101
1 0      pol  222
2 0      ein  333
  1      ein  303

And then, step by step

dd_tmp.columns = ["name", "value"]
dd_tmp = dd_tmp.reset_index()
dd_tmp["name"] = dd_tmp["name"] + "_" + dd_tmp["match"].astype(str)
dd_tmp = dd_tmp.pivot(columns="name", index="level_0", values="value")

And when joined with original dataframe (on not specified, so join by index)

>>> dd.join(dd_tmp)
                                              text                           label ein_0 ein_1 pol_0
0  random text with pol # as 111 and ein no as 101  [[26, 29, pol], [44, 47, ein]]   NaN   101   111
1                    random text with pol # as 222                 [[26, 29, pol]]   NaN   NaN   222
2  random text with ein # as 333 and ein no as 303  [[26, 29, ein], [44, 47, ein]]   333   303   NaN

answered Jun 14, 2021 at 10:27

crayxt

2,4052 gold badges15 silver badges19 bronze badges

1 Comment

crayxt Over a year ago

Haven't considered that label field consists of location of variables of interest and created more complex solution.

Collectives™ on Stack Overflow

Create dynamic columns using information from existing columns using pandas

2 Answers 2

Steps:

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Steps:

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related