2

Given the following df:

data = {'Description':  ['with lemon', 'lemon', 'and orange', 'orange'],
        'Start': ['6', '1', '5', '1'],
       'Length': ['5', '5', '6', '6']}
df = pd.DataFrame(data)
print (df)

I would like to substring the "Description" based on what is specified in the other columns as start and length, here the expected output:

data = {'Description':  ['with lemon', 'lemon', 'and orange', 'orange'],
        'Start': ['6', '1', '5', '1'],
       'Length': ['5', '5', '6', '6'],
       'Res':  ['lemon', 'lemon', 'orange', 'orange']}
df = pd.DataFrame(data)
print (df)

Is there a way to make it dynamic or another compact way?

df['Res'] = df['Description'].str[1:2]

3 Answers 3

3

You need to loop, a list comprehension will be the most efficient (python ≥3.8 due to the walrus operator, thanks @I'mahdi):

df['Res'] = [s[(start:=int(a)-1):start+int(b)] for (s,a,b)
             in zip(df['Description'], df['Start'], df['Length'])]

Or using pandas for the conversion (thanks @DaniMesejo):

df['Res'] = [s[a:a+b] for (s,a,b) in 
             zip(df['Description'],
                 df['Start'].astype(int)-1,
                 df['Length'].astype(int))]

output:

  Description Start Length     Res
0  with lemon     6      5   lemon
1       lemon     1      5   lemon
2  and orange     5      6  orange
3      orange     1      6  orange

handling non-integers / NAs

df['Res'] = [s[a:a+b] if pd.notna(a) and pd.notna(b) else 'NA'
             for (s,a,b) in 
             zip(df['Description'],
                 pd.to_numeric(df['Start'], errors='coerce').convert_dtypes()-1,
                 pd.to_numeric(df['Length'], errors='coerce').convert_dtypes()
                )]

output:

    Description Start Length     Res
0    with lemon     6      5   lemon
1         lemon     1      5   lemon
2    and orange     5      6  orange
3        orange     1      6  orange
4  pinapple xxx    NA     NA      NA
5      orangiie    NA     NA      NA
Sign up to request clarification or add additional context in comments.

10 Comments

python.version >= 3.8
I think is better if you sum start + length at pandas level and then iterate, right?
@I'mahdi Correct, for python < 3.8 use a function or duplicate the convertion of a to int ;)
@Dani also true, I added a variant ;)
what about if in some case the Start is NA? How can I skip it?
|
0

Given that the fruit name of interest always seems to be the final word in the description column, you might be able to use a regex extract approach here.

data["Res"] = data["Description"].str.extract(r'(\w+)$')

Comments

0

You can use .map to cycle through the Series. Use split(' ') to separate the words if there is space and get the last word in the list [-1].

df['RES'] = df['Description'].map(lambda x: x.split(' ')[-1])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.