Python string operation, extract text between html tags

Question

I have a string:

<font face="ARIAL,HELVETICA" size="-2">  
JUL 28         </font>

(it outputs over two lines, so there must be a \n in there.

I wish to extract the string that's in between the <font></font> tags. In this case, it's JUL 28, but it might be another date or some other number.

1) The best way to extract the value from between the font tags? I was thinking I could extract everything in between "> and </.

edit: second question removed.

You're probably right. Let's ignore the second one. I'll worry about that later. — Flux Capacitor
– Flux Capacitor, Commented Oct 27, 2011 at 3:58

kojiro · Accepted Answer · 2011-10-27 04:06:06Z

16

While it may be possible to parse arbitrary HTML with regular expressions, it's often a death trap. There are great tools out there for parsing HTML, including BeautifulSoup, which is a Python lib that can handle broken as well as good HTML fairly well.

>>> from BeautifulSoup import BeautifulSoup as BSHTML
>>> BS = BSHTML("""
... <font face="ARIAL,HELVETICA" size="-2">  
... JUL 28         </font>"""
... )
>>> BS.font.contents[0].strip()
u'JUL 28'

Then you just need to parse the date:

>>> datetime.strptime(BS.font.contents[0].strip(), '%b %d')
>>> datetime.datetime(1900, 7, 28, 0, 0)
datetime.datetime(1900, 7, 28, 0, 0)

answered Oct 27, 2011 at 4:06

kojiro

77.8k20 gold badges151 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Flux Capacitor Over a year ago

Nice! This seems much less complicated than the regex way.

kojiro Over a year ago

@FluxCapacitor A word of warning: My second argument to strptime above is actually a locale-specific example. Please refer to the documentation for more details if you need a locale-agnostic or different locale solution.

fahhem · Accepted Answer · 2011-10-27 04:00:28Z

6

You have a bunch of options here. You could go for an all-out xml parser like lxml, though you seem to want a domain-specific solution. I'd go with a multiline regex:

import re
rex = re.compile(r'<font.*?>(.*?)</font>',re.S|re.M)
...
data = """<font face="ARIAL,HELVETICA" size="-2">  
JUL 28         </font>"""

match = rex.match(data)
if match:
    text = match.groups()[0].strip()

Now that you have text, you can turn it into a date pretty easily:

from datetime import datetime
date = datetime.strptime(text, "%b %d")

answered Oct 27, 2011 at 4:00

fahhem

4664 silver badges8 bronze badges

2 Comments

fahhem Over a year ago

You commented on AnthonyHurst's answer that this is from a website. I've used lxml's html parsing with a lot of success recently, I highly recommend it.

Flux Capacitor Over a year ago

Thanks! I had seen something similar with regular expressions in another question, but wasn't able to make it work. Your solution worked for me perfectly. The downside is that I only sort of understand what's going on with it.

yasouser · Accepted Answer · 2020-02-05 21:36:44Z

2

Python has a library called HTMLParser. Also see the following question posted in SO which is very similar to what you are looking for:

How can I use the python HTMLParser library to extract data from a specific div tag?

edited Feb 5, 2020 at 21:36

answered Oct 27, 2011 at 4:03

yasouser

5,2172 gold badges29 silver badges42 bronze badges

2 Comments

Davide Andrea Over a year ago

Broken link. It should be docs.python.org/3/library/html.parser.html or docs.python.org/2/library/htmlparser.html#module-HTMLParser

yasouser Over a year ago

Fixed the link. Thanks

Óscar López · Accepted Answer · 2011-10-27 04:03:27Z

1

Or, you could simply use Beautiful Soup:

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping

answered Oct 27, 2011 at 4:03

Óscar López

237k38 gold badges321 silver badges391 bronze badges

1 Comment

Brendan Long Over a year ago

Probably overkill but a good choice if there's more HTML parsing to be done.

AnthonyHurst · Accepted Answer · 2011-10-27 03:51:19Z

0

Is grep an option?

grep "<[^>]*>(.*)<\/[^>]*>" file

The (.*) should match your content.

answered Oct 27, 2011 at 3:51

AnthonyHurst

1337 bronze badges

2 Comments

Flux Capacitor Over a year ago

I'm doing all this in Python... I used scrapy to scrape a webpage and drill down to arrive at the string above.

AnthonyHurst Over a year ago

sorry then I couldn't assist you better. you could always use the re (regular expression) library to grab the same thing.

Victor Olex · Accepted Answer · 2011-10-27 04:03:45Z

0

Use Scrapy's XPath selectors as documented at http://doc.scrapy.org/en/0.10.3/topics/selectors.html

Alternatively you can utilize an HTML parser such as BeautifulSoup especially if want to operate on the document in an object oriented manner.

http://pypi.python.org/pypi/BeautifulSoup/3.2.0

answered Oct 27, 2011 at 4:03

Victor Olex

1,5081 gold badge13 silver badges28 bronze badges

Collectives™ on Stack Overflow

Python string operation, extract text between html tags

6 Answers 6

2 Comments

2 Comments

2 Comments

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

2 Comments

2 Comments

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related