1

I'm trying to extract a text out of a column so I can move to another column using a pattern in python but I miss some results at the same time I need to keep the unextracted strings as they are>

My code is:

import pandas as pd
df = pd.DataFrame({
    'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})

pattern = r'(\d+(\,[0-9]+)?\-\d+(\,[a-zA-Z])?\d+)'

df['result'] = df['col'].str.extract(pattern)[0]
print(df)

My output is:

col     result
0     item1 (30-10)      30-10
1   item2 (200-100)    200-100
2    item3 (100 FS)        NaN
3      item4 (100+)        NaN
4  item1 (1000-2000)  1000-2000

My output should be:

col     result        newcolumn
0       item1         (30-10)
1       item2         (200-100)
2       item3         (100 FS)
3       item4         (100+)
4       item1         (1000-2000)
1

3 Answers 3

2

You can use this:

df['newcolumn'] = df.col.str.extract(r'(\(.+\))')
df['result'] = df['col'].str.extract(r'(\w+)')

Output:

                 col    newcolumn result
0      item1 (30-10)      (30-10)  item1
1    item2 (200-100)    (200-100)  item2
2     item3 (100 FS)     (100 FS)  item3
3       item4 (100+)       (100+)  item4
4  item1 (1000-2000)  (1000-2000)  item1

Explanation:

The first expression gets the content within parenthesis (including the parenthesis themselves). The second gets the first word.

Sign up to request clarification or add additional context in comments.

Comments

1

You can also do this with .str.split in a single line:

 df[['result', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)

Output:

                 col result    newcolumn
0      item1 (30-10)  item1      (30-10)
1    item2 (200-100)  item2    (200-100)
2     item3 (100 FS)  item3     (100 FS)
3       item4 (100+)  item4       (100+)
4  item1 (1000-2000)  item1  (1000-2000)

You must use expand=True if your strings have a non-uniform number of splits (see also How to split a dataframe string column into two columns?).

EDIT: If you want to 'drop' the old column, you can also overwrite it and rename it:

 df[['col', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)
 df = df.rename(columns={"col": "result"})

which exactly gives you the result you specified was intended:

  result    newcolumn
0  item1      (30-10)
1  item2    (200-100)
2  item3     (100 FS)
3  item4       (100+)
4  item1  (1000-2000)

Comments

1

You can extract the parts of interest by grouping them within one regular expression. The regex pattern now matches item\d as first group and anything inside the brackets with \(.*\) as the second one.

import pandas as pd
df = pd.DataFrame({
    'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})

pattern = "(item\d*)\s(\(.*\))"

df['items'] = df['col'].str.extract(pattern)[0]
df['result'] = df['col'].str.extract(pattern)[1]

print(df)

Output:

                 col  items      result
0      item1 (30-10)  item1      (30-10)
1    item2 (200-100)  item2    (200-100)
2     item3 (100 FS)  item3     (100 FS)
3       item4 (100+)  item4       (100+)
4  item1 (1000-2000)  item1  (1000-2000)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.