Break up a list of strings in a pandas dataframe column into new columns based on first word of each sentence

Question

So I have roughly 40,000 rows of people and their complaints. I am attempting to sort them into their respective columns for analysis, and for other analysts at my company who use other tools can use this data.

DataFrame Example:

df = pd.DataFrame({"person": [1, 2, 3], 
                   "problems": ["body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired", 
                                "soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger", 
                                "none"]})
df     
╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║   ║ person ║                                                     problems                                                     ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired                                         ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║
║ 2 ║      3 ║ none                                                                                                             ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Desired Output:

╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╦════════════════════════════════════════════════════════════════════════════════╦═══════════════════════╦═══════════════╗
║   ║ person ║                                                     problems                                                     ║                                      body                                      ║         mind          ║     soul      ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╬════════════════════════════════════════════════════════════════════════════════╬═══════════════════════╬═══════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired                                         ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE)                              ║ mind: stressed, tired ║ NaN           ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║ body: feels great(lifts weights), overweight(always bulking), missing a finger ║ mind: can't think     ║ soul: missing ║
║ 2 ║      3 ║ none                                                                                                             ║ NaN                                                                            ║ NaN                   ║ NaN           ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╩════════════════════════════════════════════════════════════════════════════════╩═══════════════════════╩═══════════════╝

Things I've tried / where I'm at:

So I've been able to at least separate these with a regex statement that seems to do the job with my real data.

df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")


+---+-------+--------------------------------------------------------------------------------+
|   |       |                                       0                                        |
+---+-------+--------------------------------------------------------------------------------+
|   | match |                                                                                |
| 0 | 0     | body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE)                              |
|   | 1     | mind: stressed, tired                                                          |
| 1 | 0     | soul: missing                                                                  |
|   | 1     | mind: can't think                                                              |
|   | 2     | body: feels great(lifts weights), overweight(always bulking), missing a finger |
| 2 | 0     | none                                                                           |
+---+-------+--------------------------------------------------------------------------------+

I'm a regex beginner, so I expect this could probably be done better. My original regex pattern was r'([^;]+)', but I was trying to exclude the space after the semi-colons.

So I'm at a loss. I played with:

df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)").unstack(), which "works"(doesn't error out) with my example here.

But with my real data, I get an error: "ValueError: Index contains duplicate entries, cannot reshape"

Even if it worked with my real data, I'd still have to figure out how to get these 'categories'(body, mind, soul) into assigned columns.

I'd probably have better luck if I could word this question better. I'm trying to really self-learn here, so I'll appreciate any leads even if they're not a complete solution.

I'm kind of sniffing a trail that maybe I can do this somehow with a groupby or multiIndex know-how. Kind of new to programming, so I'm still feeling my way around in the dark. I would appreciate any tips or ideas anyone has to offer. Thank you!

EDIT: I just want to come back and mention the error I was getting in my real data "ValueError: Index contains duplicate entries, cannot reshape" when using @WeNYoBen's solution:

(df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")[0]
.str.split(':',expand=True)
.set_index(0,append=True)[1]
.unstack()
.groupby(level=0)
.first())

It turned out I had some groups with multiple colons. For example:

df = pd.DataFrame({"person": [1, 2, 3], 
                   "problems": ["body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, energy: tired", 
                                "soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger", 
                                "none"]})




╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║   ║ person ║                                                     problems                                                     ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, energy: tired                                 ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║
║ 2 ║      3 ║ none                                                                                                             ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

See the first row update reflecting the edge case I discovered ; mind: stressed, energy: tired.

I was able to fix this by altering my regex to say the beginning of the match must be the beginning of the string or be preceded with a semi-colon.

splits = [r'(^)(.+?)[:]', r'(;)(.+?)[:]']
str.split('|'.join(splits)

After that I just had to re-tweak the set_index portion to get @WeNYoBen's helpful solution to work, so I'll stick with this one.

I should add that yesterday I did figure out how to get the full list of all these groupings in my real data, which is around 10 total. sorted(df.problems.str.extractall(r'(^|;)(.+?)[:]').reset_index()[1].str.strip().unique()) — dakro
– dakro, Commented Aug 9, 2019 at 19:33
Are the three categories always the same? Body, mind, soul... — Martin Noah
– Martin Noah, Commented Aug 9, 2019 at 19:38
If they are always three category, then you could probably try matching from body ... ; and likewise for each category. — ifly6
– ifly6, Commented Aug 9, 2019 at 19:39
df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")[0].str.split(':',expand=True).set_index(0,append=True)[1].unstack().groupby(level=0).first() your own functionafter polish is better than applt + lambda — BENY
– BENY, Commented Aug 9, 2019 at 20:11
@sobek I'm actually using both of your solutions as a stepping stone. Trying to work out another edge case I discovered where a semi-colon is used within a phrase, just two occurrences out of tens of thousands of rows :/ You guys rock though. Learned a lot this weekend — dakro
– dakro, Commented Aug 12, 2019 at 16:38

sobek · Accepted Answer · 2019-08-09 20:42:40Z

4

It's not elegant but it gets the job done:

df['split'] = df.problems.str.split(';')
df['mind'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'mind' in category]))
df['body'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'body' in category]))
df['soul'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'soul' in category]))
df.drop('split', inplace=True)

You could probably wrap

df[cat] = df.split.apply(lambda x: ''.join([category for category in x if cat in category]))

in a function and run it on your dataframe for each cat (e.g. cats=['mind', 'body', 'soul', 'whathaveyou', 'etc.'].

Edit:

As @ifly6 has pointed out, there may be intersections of keywords in the strings that users enter. To be safe, the function should be altered to

df[cat] = df.split.apply(lambda x: ''.join([category for category in x if category.startswith(cat)]))

edited Aug 9, 2019 at 20:42

answered Aug 9, 2019 at 20:01

sobek

1,43612 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

dakro Over a year ago

Thank you so much! I can work with this! Man I really need to spend time understanding lambda.

sobek Over a year ago

If there's a specific thing you don't understand, ask and i'll try to clarify my answer.

dakro Over a year ago

I appreciate the offer! I'm still so new to programming in general I just need to revisit this from square one.

ifly6 Over a year ago

What if the string is like "mind: he thinks his body is out of shape"?

sobek Over a year ago

Note that I have edited my answer to deal with this edge case (see bottom).

|

Collectives™ on Stack Overflow

Break up a list of strings in a pandas dataframe column into new columns based on first word of each sentence

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related