0

I've been searching around for a while now, but I can't seem to find the answer to this small problem.

I have this code that is supposed to split the string after every three words:

import pandas as pd
import numpy as np

df1 = {
    'State':['Arizona AZ asdf hello abc','Georgia GG asdfg hello def','Newyork NY asdfg hello ghi','Indiana IN asdfg hello jkl','Florida FL ASDFG hello mno']}

df1 = pd.DataFrame(df1,columns=['State'])
df1

def splitTextToTriplet(df):
    text = df['State'].str.split()
    n = 3
    grouped_words = [' '.join(str(text[i:i+n]) for i in range(0,len(text),n))]
    return grouped_words

splitTextToTriplet(df1)

Currently the output is as such:

['0     [Arizona, AZ, asdf, hello, abc]\n1    [Georgia, GG, asdfg, hello, def]\nName: State, dtype: object 2    [Newyork, NY, asdfg, hello, ghi]\n3    [Indiana, IN, asdfg, hello, jkl]\nName: State, dtype: object 4    [Florida, FL, ASDFG, hello, mno]\nName: State, dtype: object']

But I am actually expecting this output in 5 rows, one column on dataframe:

['Arizona AZ asdf', 'hello abc']
['Georgia GG asdfg', 'hello def']
['Newyork NY asdfg', 'hello ghi']
['Indiana IN asdfg', 'hello jkl']
['Florida FL ASDFG', 'hello mno']

how can I change the regex so it produces the expected output?

2 Answers 2

1

For efficiency, you can use a regex and str.extractall + groupby/agg:

(df1['State']
 .str.extractall(r'((?:\w+\b\s*){1,3})')[0]
 .groupby(level=0).agg(list)
)

output:

0     [Arizona AZ asdf , hello abc]
1    [Georgia GG asdfg , hello def]
2    [Newyork NY asdfg , hello ghi]
3    [Indiana IN asdfg , hello jkl]
4    [Florida FL ASDFG , hello mno]

regex:

(             # start capturing
(?:\w+\b\s*)  # words
{1,3}         # the maximum, up to three
)             # end capturing
Sign up to request clarification or add additional context in comments.

Comments

1

You can do:

def splitTextToTriplet(row):
    text = row['State'].split()
    n = 3
    grouped_words = [' '.join(text[i:i+n]) for i in range(0,len(text),n)]
    return grouped_words

df1.apply(lambda row: splitTextToTriplet(row), axis=1)

which gives as output the following Dataframe:

0
0 ['Arizona AZ asdf', 'hello abc']
1 ['Georgia GG asdfg', 'hello def']
2 ['Newyork NY asdfg', 'hello ghi']
3 ['Indiana IN asdfg', 'hello jkl']
4 ['Florida FL ASDFG', 'hello mno']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.