New Pandas Columns with Regex Parsing

Question

I am trying to parse text data in Pandas DataFrame based on certain tags and values in another column's fields and store them in their own columns. For example, if I created this dataframe, df:

df = pd.DataFrame([[1,2],['A: this is a value B: this is the b val C: and here is c.','A: and heres another a. C: and another c']])
df = df.T
df.columns = ['col1','col2']


df['tags'] = df['col2'].apply(lambda x: re.findall('(?:\s|)(\w*)(?::)',x))
all_tags = []

for val in df['tags']:
    all_tags = all_tags + val
all_tags = list(set(all_tags))
for val in all_tags:
    df[val] = ''

df:
  col1                                               col2       tags A C B
0    1  A: this is a value B: this is the b val C: and...  [A, B, C]      
1    2           A: and heres another a. C: and another c     [A, C]

How would I populate each of the new "tag" columns with their values from col2 so I get this df:

col1                                               col2           tags  \
0    1  A: this is a value B: this is the b val C: and...  [A, B, C]   
1    2           A: and heres another a. C: and another c     [A, C]   

                  A               C                  B  
0       this is a value  and here is c.  this is the b val  
1  and heres another a.   and another c

akuiper · Accepted Answer · 2017-08-19 17:04:49Z

3

Another option using str.extractall with regex (?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$):

The regex captures the key (?P<key>\w+) before the semi colon and value after the semi colon (?P<val>[^:]*) as two separate columns key and val, the val will match non : characters until it reaches the next key value pair restricted by a look ahead syntax (?=\w+:|$); This assumes the key is always a single word which would be ambiguous otherwise:

import re
pat = re.compile("(?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$)")

pd.concat([
    df,
    (
        df.col2.str.extractall(pat)
          .reset_index('match', drop=True)
          .set_index('key', append=True)
          .val.unstack('key')
    )
], axis=1).fillna('')

Where str.extractall gives:

df.col2.str.extractall(pat)

And then you pivot the result and concatenate with the original data frame.

edited Aug 19, 2017 at 17:04

answered Aug 19, 2017 at 16:58

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Zero · Accepted Answer · 2017-08-19 16:59:43Z

Here's one way

In [683]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
             .apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x])))
          )
Out[683]:
                       A                   B                C
0        this is a value   this is the b val   and here is c.
1   and heres another a.                 NaN    and another c

You could append back the results using join

In [690]: df.join(df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
                    .apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x]))))
Out[690]:
  col1                                               col2       tags  \
0    1  A: this is a value B: this is the b val C: and...  [A, B, C]
1    2           A: and heres another a. C: and another c     [A, C]

                       A                   B                C
0        this is a value   this is the b val   and here is c.
1   and heres another a.                 NaN    and another c

Infact, you could get df['tags'] using string method

In [688]: df.col2.str.findall('(?:\s|)(\w*)(?::)')
Out[688]:
0    [A, B, C]
1       [A, C]
Name: col2, dtype: object

Details:

Split groups into lists

In [684]: df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
Out[684]:
0    [A: this is a value, B: this is the b val, C: ...
1          [A: and heres another a., C: and another c]
Name: col2, dtype: object

Now, to key and value pairs of lists.

In [685]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
             .apply(lambda x: [v.split(':', 1) for v in x]))
Out[685]:
0    [[A,  this is a value], [B,  this is the b val...
1    [[A,  and heres another a.], [C,  and another c]]
Name: col2, dtype: object

Collectives™ on Stack Overflow

New Pandas Columns with Regex Parsing

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related