Extract a text out of a column using pattern in python

Question

I'm trying to extract a text out of a column so I can move to another column using a pattern in python but I miss some results at the same time I need to keep the unextracted strings as they are>

My code is:

import pandas as pd
df = pd.DataFrame({
    'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})

pattern = r'(\d+(\,[0-9]+)?\-\d+(\,[a-zA-Z])?\d+)'

df['result'] = df['col'].str.extract(pattern)[0]
print(df)

My output is:

col     result
0     item1 (30-10)      30-10
1   item2 (200-100)    200-100
2    item3 (100 FS)        NaN
3      item4 (100+)        NaN
4  item1 (1000-2000)  1000-2000

My output should be:

col     result        newcolumn
0       item1         (30-10)
1       item2         (200-100)
2       item3         (100 FS)
3       item4         (100+)
4       item1         (1000-2000)

Check this out stackoverflow.com/questions/9989334/…

giorgos
– giorgos

2020-12-19 19:41:09 +00:00
Commented Dec 19, 2020 at 19:41 — giorgos
– giorgos, Commented Dec 19, 2020 at 19:41

Cainã Max Couto da Silva · Accepted Answer · 2020-12-19 19:43:17Z

2

You can use this:

df['newcolumn'] = df.col.str.extract(r'(\(.+\))')
df['result'] = df['col'].str.extract(r'(\w+)')

Output:

                 col    newcolumn result
0      item1 (30-10)      (30-10)  item1
1    item2 (200-100)    (200-100)  item2
2     item3 (100 FS)     (100 FS)  item3
3       item4 (100+)       (100+)  item4
4  item1 (1000-2000)  (1000-2000)  item1

Explanation:

The first expression gets the content within parenthesis (including the parenthesis themselves). The second gets the first word.

edited Dec 19, 2020 at 19:43

answered Dec 19, 2020 at 19:35

Cainã Max Couto da Silva

4,9691 gold badge15 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

buddemat · Accepted Answer · 2021-03-13 09:47:45Z

You can also do this with .str.split in a single line:

 df[['result', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)

Output:

                 col result    newcolumn
0      item1 (30-10)  item1      (30-10)
1    item2 (200-100)  item2    (200-100)
2     item3 (100 FS)  item3     (100 FS)
3       item4 (100+)  item4       (100+)
4  item1 (1000-2000)  item1  (1000-2000)

You must use expand=True if your strings have a non-uniform number of splits (see also How to split a dataframe string column into two columns?).

EDIT: If you want to 'drop' the old column, you can also overwrite it and rename it:

 df[['col', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)
 df = df.rename(columns={"col": "result"})

which exactly gives you the result you specified was intended:

  result    newcolumn
0  item1      (30-10)
1  item2    (200-100)
2  item3     (100 FS)
3  item4       (100+)
4  item1  (1000-2000)

Mig B · Accepted Answer · 2020-12-19 19:42:11Z

You can extract the parts of interest by grouping them within one regular expression. The regex pattern now matches item\d as first group and anything inside the brackets with \(.*\) as the second one.

import pandas as pd
df = pd.DataFrame({
    'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})

pattern = "(item\d*)\s(\(.*\))"

df['items'] = df['col'].str.extract(pattern)[0]
df['result'] = df['col'].str.extract(pattern)[1]

print(df)

Output:

                 col  items      result
0      item1 (30-10)  item1      (30-10)
1    item2 (200-100)  item2    (200-100)
2     item3 (100 FS)  item3     (100 FS)
3       item4 (100+)  item4       (100+)
4  item1 (1000-2000)  item1  (1000-2000)

Collectives™ on Stack Overflow

Extract a text out of a column using pattern in python

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related