3

I have this:

                                                Title  
Num                                                      
0    <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span>   
1         <span class="o-label--tiny">PROTÉINES</span>   
2          <span class="o-label--tiny">GLUCIDES</span> 

<class 'pandas.core.frame.DataFrame'> Num Index(['Title'], dtype='object')

This is what I want:

            Title  
Num                                                      
0  VALEUR ÉNERGÉTIQUE   
1           PROTÉINES   
2            GLUCIDES 

This is the regex I developed:

(<span class=\"o-label--tiny\">)([a-zA-Z]+\s*\w*)(</span>)

Testing it I see it matches the whole initial string and has groups for the different substrings. In the end, I want group(2) in my dataframe column. (My examples below show the explicit regex but I have also tried these with the re.compile result which doesnt work either to get me to the my final result).

This is what I have tried:

df['Title'] = df['Title'].replace({'<span class=\"o-label--tiny\">': ''}, inplace=True, regex=True)

The result:

   Title                                                
Num                                                         
0    None  
1    None  
2    None  

Try number 2:

df['Title'] = df['Title'].str.replace('<span class=\"o-label--tiny\">', repl = '')

Result number 2:

   Title  
Num                                                         
0     NaN  
1     NaN  
2     NaN

Try number 3:

df['Title'] = df[lambda df: df.columns[0]].str.extract('(>[a-zA-Z]+\s*\w*)', expand=False)

Result 3:

   Title  
Num                                                         
0     NaN  
1     NaN  
2     NaN

I really dont see what I am doing wrong and any help getting to my desired result would be appreciated. Thank you!

2 Answers 2

1

Use str.extract:

df['Title']=df['Title'].str.extract('<span class=\"o-label--tiny\">(.*)</span>',expand=False)
print (df)
                  Title
Num                    
0    VALEUR ÉNERGÉTIQUE
1             PROTÉINES
2              GLUCIDES

If possible different tags or classes:

df['Title'] = df['Title'].str.extract('>(.*)<',expand=False)
print (df)
                  Title
Num                    
0    VALEUR ÉNERGÉTIQUE
1             PROTÉINES
2              GLUCIDES
Sign up to request clarification or add additional context in comments.

6 Comments

I was about to add that as a comment. You edited it. I delete my answer
@jezrael:I tried you code but neither worked for me. I still got NaN instead of the correct strings. ??
Could there be something unusual about my strings that I have not considered?
@ChiChi - I have no idea - is possible send me on my email your real data by pickle file? df[['Title']].to_pickle('data.pkl') ?
@jezrael - yes I will do this.Thanks very much for the offer to look more closely at it
|
0

regex

I don't want to enter in the df thing, but I wish this is useful:

import re

stringa = """
0    <span class="o-label--tiny">VALEUR ÉNERGÉTIQUE</span>
1         <span class="o-label--tiny">PROTÉINES</span>
2          <span class="o-label--tiny">GLUCIDES</span>
"""

pattern1 = "[0-9]"
pattern = ">(.*)<"

found = re.findall(pattern1, stringa)
found2 = re.findall(pattern, stringa)

for f in range(len(found)):
    print(found[f] + " " + found2[f])

output

0 VALEUR ÉNERGÉTIQUE
1 PROTÉINES
2 GLUCIDES

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.