2

I am trying to parse text data in Pandas DataFrame based on certain tags and values in another column's fields and store them in their own columns. For example, if I created this dataframe, df:

df = pd.DataFrame([[1,2],['A: this is a value B: this is the b val C: and here is c.','A: and heres another a. C: and another c']])
df = df.T
df.columns = ['col1','col2']


df['tags'] = df['col2'].apply(lambda x: re.findall('(?:\s|)(\w*)(?::)',x))
all_tags = []

for val in df['tags']:
    all_tags = all_tags + val
all_tags = list(set(all_tags))
for val in all_tags:
    df[val] = ''

df:
  col1                                               col2       tags A C B
0    1  A: this is a value B: this is the b val C: and...  [A, B, C]      
1    2           A: and heres another a. C: and another c     [A, C]

How would I populate each of the new "tag" columns with their values from col2 so I get this df:

col1                                               col2           tags  \
0    1  A: this is a value B: this is the b val C: and...  [A, B, C]   
1    2           A: and heres another a. C: and another c     [A, C]   

                  A               C                  B  
0       this is a value  and here is c.  this is the b val  
1  and heres another a.   and another c 

2 Answers 2

3

Another option using str.extractall with regex (?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$):

The regex captures the key (?P<key>\w+) before the semi colon and value after the semi colon (?P<val>[^:]*) as two separate columns key and val, the val will match non : characters until it reaches the next key value pair restricted by a look ahead syntax (?=\w+:|$); This assumes the key is always a single word which would be ambiguous otherwise:

import re
pat = re.compile("(?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$)")

pd.concat([
    df,
    (
        df.col2.str.extractall(pat)
          .reset_index('match', drop=True)
          .set_index('key', append=True)
          .val.unstack('key')
    )
], axis=1).fillna('')

enter image description here


Where str.extractall gives:

df.col2.str.extractall(pat)

enter image description here

And then you pivot the result and concatenate with the original data frame.

Sign up to request clarification or add additional context in comments.

Comments

2

Here's one way

In [683]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
             .apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x])))
          )
Out[683]:
                       A                   B                C
0        this is a value   this is the b val   and here is c.
1   and heres another a.                 NaN    and another c

You could append back the results using join

In [690]: df.join(df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
                    .apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x]))))
Out[690]:
  col1                                               col2       tags  \
0    1  A: this is a value B: this is the b val C: and...  [A, B, C]
1    2           A: and heres another a. C: and another c     [A, C]

                       A                   B                C
0        this is a value   this is the b val   and here is c.
1   and heres another a.                 NaN    and another c

Infact, you could get df['tags'] using string method

In [688]: df.col2.str.findall('(?:\s|)(\w*)(?::)')
Out[688]:
0    [A, B, C]
1       [A, C]
Name: col2, dtype: object

Details:

Split groups into lists

In [684]: df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
Out[684]:
0    [A: this is a value, B: this is the b val, C: ...
1          [A: and heres another a., C: and another c]
Name: col2, dtype: object

Now, to key and value pairs of lists.

In [685]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
             .apply(lambda x: [v.split(':', 1) for v in x]))
Out[685]:
0    [[A,  this is a value], [B,  this is the b val...
1    [[A,  and heres another a.], [C,  and another c]]
Name: col2, dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.