python pandas substring based on columns values

Question

Given the following df:

data = {'Description':  ['with lemon', 'lemon', 'and orange', 'orange'],
        'Start': ['6', '1', '5', '1'],
       'Length': ['5', '5', '6', '6']}
df = pd.DataFrame(data)
print (df)

I would like to substring the "Description" based on what is specified in the other columns as start and length, here the expected output:

data = {'Description':  ['with lemon', 'lemon', 'and orange', 'orange'],
        'Start': ['6', '1', '5', '1'],
       'Length': ['5', '5', '6', '6'],
       'Res':  ['lemon', 'lemon', 'orange', 'orange']}
df = pd.DataFrame(data)
print (df)

Is there a way to make it dynamic or another compact way?

df['Res'] = df['Description'].str[1:2]

mozway · Accepted Answer · 2022-07-24 12:42:32Z

3

You need to loop, a list comprehension will be the most efficient (python ≥3.8 due to the walrus operator, thanks @I'mahdi):

df['Res'] = [s[(start:=int(a)-1):start+int(b)] for (s,a,b)
             in zip(df['Description'], df['Start'], df['Length'])]

Or using pandas for the conversion (thanks @DaniMesejo):

df['Res'] = [s[a:a+b] for (s,a,b) in 
             zip(df['Description'],
                 df['Start'].astype(int)-1,
                 df['Length'].astype(int))]

output:

  Description Start Length     Res
0  with lemon     6      5   lemon
1       lemon     1      5   lemon
2  and orange     5      6  orange
3      orange     1      6  orange

handling non-integers / NAs

df['Res'] = [s[a:a+b] if pd.notna(a) and pd.notna(b) else 'NA'
             for (s,a,b) in 
             zip(df['Description'],
                 pd.to_numeric(df['Start'], errors='coerce').convert_dtypes()-1,
                 pd.to_numeric(df['Length'], errors='coerce').convert_dtypes()
                )]

output:

    Description Start Length     Res
0    with lemon     6      5   lemon
1         lemon     1      5   lemon
2    and orange     5      6  orange
3        orange     1      6  orange
4  pinapple xxx    NA     NA      NA
5      orangiie    NA     NA      NA

edited Jul 24, 2022 at 12:42

answered Jul 24, 2022 at 7:29

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Mahdi F. Over a year ago

python.version >= 3.8

Dani Mesejo Over a year ago

I think is better if you sum start + length at pandas level and then iterate, right?

mozway Over a year ago

@I'mahdi Correct, for python < 3.8 use a function or duplicate the convertion of a to int ;)

mozway Over a year ago

@Dani also true, I added a variant ;)

highbury Over a year ago

what about if in some case the Start is NA? How can I skip it?

|

Tim Biegeleisen · Accepted Answer · 2022-07-24 07:25:32Z

0

Given that the fruit name of interest always seems to be the final word in the description column, you might be able to use a regex extract approach here.

data["Res"] = data["Description"].str.extract(r'(\w+)$')

answered Jul 24, 2022 at 7:25

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

Comments

RiveN · Accepted Answer · 2022-07-25 08:46:40Z

0

You can use .map to cycle through the Series. Use split(' ') to separate the words if there is space and get the last word in the list [-1].

df['RES'] = df['Description'].map(lambda x: x.split(' ')[-1])

edited Jul 25, 2022 at 8:46

RiveN

2,65911 gold badges17 silver badges29 bronze badges

answered Jul 24, 2022 at 8:03

Brugor

11 bronze badge

Collectives™ on Stack Overflow

python pandas substring based on columns values

3 Answers 3

handling non-integers / NAs

10 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

handling non-integers / NAs

10 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related