11

I have a string:

<font face="ARIAL,HELVETICA" size="-2">  
JUL 28         </font>

(it outputs over two lines, so there must be a \n in there.

I wish to extract the string that's in between the <font></font> tags. In this case, it's JUL 28, but it might be another date or some other number.

1) The best way to extract the value from between the font tags? I was thinking I could extract everything in between "> and </.

edit: second question removed.

3
  • Note, the <font face="........> tag is not ALWAYS the same. Commented Oct 27, 2011 at 3:47
  • 1
    This should probably be two separate questions.. Commented Oct 27, 2011 at 3:50
  • You're probably right. Let's ignore the second one. I'll worry about that later. Commented Oct 27, 2011 at 3:58

6 Answers 6

16

While it may be possible to parse arbitrary HTML with regular expressions, it's often a death trap. There are great tools out there for parsing HTML, including BeautifulSoup, which is a Python lib that can handle broken as well as good HTML fairly well.

>>> from BeautifulSoup import BeautifulSoup as BSHTML
>>> BS = BSHTML("""
... <font face="ARIAL,HELVETICA" size="-2">  
... JUL 28         </font>"""
... )
>>> BS.font.contents[0].strip()
u'JUL 28'

Then you just need to parse the date:

>>> datetime.strptime(BS.font.contents[0].strip(), '%b %d')
>>> datetime.datetime(1900, 7, 28, 0, 0)
datetime.datetime(1900, 7, 28, 0, 0)
Sign up to request clarification or add additional context in comments.

2 Comments

Nice! This seems much less complicated than the regex way.
@FluxCapacitor A word of warning: My second argument to strptime above is actually a locale-specific example. Please refer to the documentation for more details if you need a locale-agnostic or different locale solution.
6

You have a bunch of options here. You could go for an all-out xml parser like lxml, though you seem to want a domain-specific solution. I'd go with a multiline regex:

import re
rex = re.compile(r'<font.*?>(.*?)</font>',re.S|re.M)
...
data = """<font face="ARIAL,HELVETICA" size="-2">  
JUL 28         </font>"""

match = rex.match(data)
if match:
    text = match.groups()[0].strip()

Now that you have text, you can turn it into a date pretty easily:

from datetime import datetime
date = datetime.strptime(text, "%b %d")

2 Comments

You commented on AnthonyHurst's answer that this is from a website. I've used lxml's html parsing with a lot of success recently, I highly recommend it.
Thanks! I had seen something similar with regular expressions in another question, but wasn't able to make it work. Your solution worked for me perfectly. The downside is that I only sort of understand what's going on with it.
2

Python has a library called HTMLParser. Also see the following question posted in SO which is very similar to what you are looking for:

How can I use the python HTMLParser library to extract data from a specific div tag?

2 Comments

Fixed the link. Thanks
1

Or, you could simply use Beautiful Soup:

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping

1 Comment

Probably overkill but a good choice if there's more HTML parsing to be done.
0

Is grep an option?

grep "<[^>]*>(.*)<\/[^>]*>" file

The (.*) should match your content.

2 Comments

I'm doing all this in Python... I used scrapy to scrape a webpage and drill down to arrive at the string above.
sorry then I couldn't assist you better. you could always use the re (regular expression) library to grab the same thing.
0

Use Scrapy's XPath selectors as documented at http://doc.scrapy.org/en/0.10.3/topics/selectors.html

Alternatively you can utilize an HTML parser such as BeautifulSoup especially if want to operate on the document in an object oriented manner.

http://pypi.python.org/pypi/BeautifulSoup/3.2.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.