Python : Split string every three words in dataframe

Question

I've been searching around for a while now, but I can't seem to find the answer to this small problem.

I have this code that is supposed to split the string after every three words:

import pandas as pd
import numpy as np

df1 = {
    'State':['Arizona AZ asdf hello abc','Georgia GG asdfg hello def','Newyork NY asdfg hello ghi','Indiana IN asdfg hello jkl','Florida FL ASDFG hello mno']}

df1 = pd.DataFrame(df1,columns=['State'])
df1

def splitTextToTriplet(df):
    text = df['State'].str.split()
    n = 3
    grouped_words = [' '.join(str(text[i:i+n]) for i in range(0,len(text),n))]
    return grouped_words

splitTextToTriplet(df1)

Currently the output is as such:

['0     [Arizona, AZ, asdf, hello, abc]\n1    [Georgia, GG, asdfg, hello, def]\nName: State, dtype: object 2    [Newyork, NY, asdfg, hello, ghi]\n3    [Indiana, IN, asdfg, hello, jkl]\nName: State, dtype: object 4    [Florida, FL, ASDFG, hello, mno]\nName: State, dtype: object']

But I am actually expecting this output in 5 rows, one column on dataframe:

['Arizona AZ asdf', 'hello abc']
['Georgia GG asdfg', 'hello def']
['Newyork NY asdfg', 'hello ghi']
['Indiana IN asdfg', 'hello jkl']
['Florida FL ASDFG', 'hello mno']

how can I change the regex so it produces the expected output?

mozway · Accepted Answer · 2022-01-27 09:09:12Z

1

For efficiency, you can use a regex and str.extractall + groupby/agg:

(df1['State']
 .str.extractall(r'((?:\w+\b\s*){1,3})')[0]
 .groupby(level=0).agg(list)
)

output:

0     [Arizona AZ asdf , hello abc]
1    [Georgia GG asdfg , hello def]
2    [Newyork NY asdfg , hello ghi]
3    [Indiana IN asdfg , hello jkl]
4    [Florida FL ASDFG , hello mno]

regex:

(             # start capturing
(?:\w+\b\s*)  # words
{1,3}         # the maximum, up to three
)             # end capturing

answered Jan 27, 2022 at 9:09

mozway

267k13 gold badges55 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

BioGeek · Accepted Answer · 2022-01-27 09:09:53Z

1

You can do:

def splitTextToTriplet(row):
    text = row['State'].split()
    n = 3
    grouped_words = [' '.join(text[i:i+n]) for i in range(0,len(text),n)]
    return grouped_words

df1.apply(lambda row: splitTextToTriplet(row), axis=1)

which gives as output the following Dataframe:

	0
0	['Arizona AZ asdf', 'hello abc']
1	['Georgia GG asdfg', 'hello def']
2	['Newyork NY asdfg', 'hello ghi']
3	['Indiana IN asdfg', 'hello jkl']
4	['Florida FL ASDFG', 'hello mno']

answered Jan 27, 2022 at 9:09

BioGeek

23k23 gold badges90 silver badges156 bronze badges

Collectives™ on Stack Overflow

Python : Split string every three words in dataframe

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related