extracting and replacing substring using regex in a pandas dataframe

Question

I have this:

                                                Title  
Num                                                      
0    <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span>   
1         <span class="o-label--tiny">PROTÉINES</span>   
2          <span class="o-label--tiny">GLUCIDES</span> 

<class 'pandas.core.frame.DataFrame'> Num Index(['Title'], dtype='object')

This is what I want:

            Title  
Num                                                      
0  VALEUR ÉNERGÉTIQUE   
1           PROTÉINES   
2            GLUCIDES

This is the regex I developed:

(<span class=\"o-label--tiny\">)([a-zA-Z]+\s*\w*)(</span>)

Testing it I see it matches the whole initial string and has groups for the different substrings. In the end, I want group(2) in my dataframe column. (My examples below show the explicit regex but I have also tried these with the re.compile result which doesnt work either to get me to the my final result).

This is what I have tried:

df['Title'] = df['Title'].replace({'<span class=\"o-label--tiny\">': ''}, inplace=True, regex=True)

The result:

Try number 2:

df['Title'] = df['Title'].str.replace('<span class=\"o-label--tiny\">', repl = '')

Result number 2:

Try number 3:

df['Title'] = df[lambda df: df.columns[0]].str.extract('(>[a-zA-Z]+\s*\w*)', expand=False)

Result 3:

I really dont see what I am doing wrong and any help getting to my desired result would be appreciated. Thank you!

jezrael · Accepted Answer · 2017-10-19 11:09:34Z

1

Use str.extract:

df['Title']=df['Title'].str.extract('<span class=\"o-label--tiny\">(.*)</span>',expand=False)
print (df)
                  Title
Num                    
0    VALEUR ÉNERGÉTIQUE
1             PROTÉINES
2              GLUCIDES

If possible different tags or classes:

df['Title'] = df['Title'].str.extract('>(.*)<',expand=False)
print (df)
                  Title
Num                    
0    VALEUR ÉNERGÉTIQUE
1             PROTÉINES
2              GLUCIDES

edited Oct 19, 2017 at 11:09

answered Oct 19, 2017 at 10:56

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Bharath M Shetty Over a year ago

I was about to add that as a comment. You edited it. I delete my answer

ChiChi Over a year ago

@jezrael:I tried you code but neither worked for me. I still got NaN instead of the correct strings. ??

ChiChi Over a year ago

Could there be something unusual about my strings that I have not considered?

jezrael Over a year ago

@ChiChi - I have no idea - is possible send me on my email your real data by pickle file? df[['Title']].to_pickle('data.pkl') ?

ChiChi Over a year ago

@jezrael - yes I will do this.Thanks very much for the offer to look more closely at it

|

PythonProgrammi · Accepted Answer · 2017-10-19 11:04:13Z

0

regex

I don't want to enter in the df thing, but I wish this is useful:

import re

stringa = """
0    <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span>
1         <span class="o-label--tiny">PROTÉINES</span>
2          <span class="o-label--tiny">GLUCIDES</span>
"""

pattern1 = "[0-9]"
pattern = ">(.*)<"

found = re.findall(pattern1, stringa)
found2 = re.findall(pattern, stringa)

for f in range(len(found)):
    print(found[f] + " " + found2[f])

output

0 VALEUR ÉNERGÉTIQUE
1 PROTÉINES
2 GLUCIDES

answered Oct 19, 2017 at 11:04

PythonProgrammi

23.6k3 gold badges44 silver badges35 bronze badges

Collectives™ on Stack Overflow

extracting and replacing substring using regex in a pandas dataframe

2 Answers 2

6 Comments

regex

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

regex

Comments

Your Answer

Sign up or log in

Post as a guest

Related