0

I extracted a raw string from a Q&A forum. I have a string like this:

s = 'Take about 2 + <font color="blue"><font face="Times New Roman">but double check with teacher <font color="green"><font face="Arial">before you do'

I want to extract this substring "<font color="blue"><font face="Times New Roman">" and assign it to a new variable. I am able to remove it with regex but I don't know how to assign it to a new variable. I am new to regex.

import re
s1 = re.sub('<.*?>', '', s)

This is removes the sub but I'd like to keep the removed sub for the record, ideally reassign it to a varialbe.

How can I do this? I may prefer regular expressions.

1
  • 2
    Why don't you use an HTML parser like beautifulsoup? Commented Feb 10, 2020 at 5:14

2 Answers 2

1

Though bs4 is more approprate for webscraping but if you are okay with regex for your case you could do following

>>> import re
>>> s = 'Take about 2 + <font color="blue"><font face="Times New Roman">but double check with teacher <font color="green"><font face="Arial">before you do'
>>> regex = re.compile('<.*?>')
>>> regex.findall(s)
['<font color="blue">', '<font face="Times New Roman">', '<font color="green">', '<font face="Arial">']
>>> regex.sub('', s)
'Take about 2 + but double check with teacher before you do'
Sign up to request clarification or add additional context in comments.

1 Comment

Works like a charm. Thank you @saurabh
0

Regex is not exactly the easiest tool to parse HTML components. You can try using BeautifulSoup to parse the components and make your substring.

from bs4 import BeautifulSoup

s = """Take about 2 + <font color="blue">
       <font face="Times New Roman">but double check with teacher <font color="green">
       <font face="Arial">before you do"""


soup = BeautifulSoup(s, "html.parser")

Print the html:

Take about 2 +
<font color="blue">
 <font face="Times New Roman">
  but double check with teacher
  <font color="green">
   <font face="Arial">
    before you do
   </font>
  </font>
 </font>
</font>

Extract components:

soup.font.font['face']
> 'Times New Roman'
soup.font["color"]
> 'blue'

Now make and save your substring as a variable:

variable = f"<font color={soup.font.font['face']}><font face={soup.font.font['face']}>"

This will give you:

"<font color="blue"><font face="Times New Roman">"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.