1
>>> text = '<a data-lecture-id="47"\n   data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n   href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n   data-modal=".course-modal-frame"\n   rel="lecture-link"\n   class="lecture-link">\nAnother diversion: The softmax output function [7 min]</a>'

>>> import re
>>> re.findall(r'data-lecture-id="(\d+)"|(.*)</a>',a)
>>> [('47', ''), ('', 'Another diversion: The softmax output function [7 min]')]

How do i extract the data out like this:

>>> ['47', 'Another diversion: The softmax output function [7 min]']

I think there should be some smarter regex expressions.

1
  • 1
    Is there a reason it has to be a smarter regex, rather than, say, not using a regex in the first place? Commented Mar 27, 2013 at 7:52

3 Answers 3

2

It is not recommended to parse HTML with reguar expressions. You can give a try to the xml.dom.minidom module:

from xml.dom.minidom import parseString

xml = parseString('<a data-lecture-id="47"\n   data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n   href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n   data-modal=".course-modal-frame"\n   rel="lecture-link"\n   class="lecture-link">\nAnother diversion: The softmax output function [7 min]</a>')
anchor = xml.getElementsByTagName("a")[0]
print anchor.getAttribute("data-lecture-id"), anchor.childNodes[0].data
Sign up to request clarification or add additional context in comments.

Comments

2

you use itertools

import re
from itertools import chain, ifilter

raw_found = re.findall(r'data-lecture-id="(\d+)"|(.*)</a>', text)

# simple
found = [x for x in chain(*raw_found) if x]

# or faster
found = [x for x in ifilter(None, chain(*raw_found))]

# or more compact, also just as fast
found = list(ifilter(None, chain(*raw_found)))

print found

Output:

['47', 'Another diversion: The softmax output function [7 min]']

3 Comments

I know some people hate filter(None, it), but I think it's more readable than [x for x in it if x]. (Not a complaint/correction/whatever; the OP should know how to read/write it both ways.)
@abarnert Honestly I've never seen that used before, I must admit it seems more pythonic, I'll have to research the advantages / disadv of both. or itertools.ifilter definitely sexy there.
Well, the main disadvantage is that not everyone knows what it means. There's also the fact that many people who come from certain functional languages thing it's a bastardization of what filter should mean, while many who don't come from those languages hate filter (and map and reduce) in the first place. The only advantage is that it's more concise, and easier to read if you already know what it means.
0

I find a solution myself:

>>> re.findall('r'data-lecture-id="(\d+)"[\s\S]+>([\s\S]+)</a>',a)
>>> [('47', '\nAnother diversion: The softmax output function [7 min]')]

Looks better, but still have to iterate it to extract a simple list...

1 Comment

If you want to "flatten" a two-deep sequence like this, that's itertools.chain.from_iterable(x) (or, if it's an actual sequence rather than an arbitrary iterable, just itertools.chain(*x)). Serdalis's answer already explains this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.