Python regex find two groups

Question

>>> text = '<a data-lecture-id="47"\n   data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n   href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n   data-modal=".course-modal-frame"\n   rel="lecture-link"\n   class="lecture-link">\nAnother diversion: The softmax output function [7 min]</a>'

>>> import re
>>> re.findall(r'data-lecture-id="(\d+)"|(.*)</a>',a)
>>> [('47', ''), ('', 'Another diversion: The softmax output function [7 min]')]

How do i extract the data out like this:

>>> ['47', 'Another diversion: The softmax output function [7 min]']

I think there should be some smarter regex expressions.

Is there a reason it has to be a smarter regex, rather than, say, not using a regex in the first place? — abarnert
– abarnert, Commented Mar 27, 2013 at 7:52

Community · Accepted Answer · 2017-05-23 12:11:06Z

2

It is not recommended to parse HTML with reguar expressions. You can give a try to the xml.dom.minidom module:

from xml.dom.minidom import parseString

xml = parseString('<a data-lecture-id="47"\n   data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n   href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n   data-modal=".course-modal-frame"\n   rel="lecture-link"\n   class="lecture-link">\nAnother diversion: The softmax output function [7 min]</a>')
anchor = xml.getElementsByTagName("a")[0]
print anchor.getAttribute("data-lecture-id"), anchor.childNodes[0].data

edited May 23, 2017 at 12:11

CommunityBot

11 silver badge

answered Mar 27, 2013 at 7:51

A. Rodas

20.8k8 gold badges70 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Serdalis · Accepted Answer · 2013-03-27 08:09:56Z

2

you use itertools

import re
from itertools import chain, ifilter

raw_found = re.findall(r'data-lecture-id="(\d+)"|(.*)</a>', text)

# simple
found = [x for x in chain(*raw_found) if x]

# or faster
found = [x for x in ifilter(None, chain(*raw_found))]

# or more compact, also just as fast
found = list(ifilter(None, chain(*raw_found)))

print found

Output:

['47', 'Another diversion: The softmax output function [7 min]']

edited Mar 27, 2013 at 8:09

answered Mar 27, 2013 at 7:34

Serdalis

10.5k2 gold badges42 silver badges58 bronze badges

3 Comments

abarnert Over a year ago

I know some people hate filter(None, it), but I think it's more readable than [x for x in it if x]. (Not a complaint/correction/whatever; the OP should know how to read/write it both ways.)

Serdalis Over a year ago

@abarnert Honestly I've never seen that used before, I must admit it seems more pythonic, I'll have to research the advantages / disadv of both. or itertools.ifilter definitely sexy there.

abarnert Over a year ago

Well, the main disadvantage is that not everyone knows what it means. There's also the fact that many people who come from certain functional languages thing it's a bastardization of what filter should mean, while many who don't come from those languages hate filter (and map and reduce) in the first place. The only advantage is that it's more concise, and easier to read if you already know what it means.

WoooHaaaa · Accepted Answer · 2013-03-27 07:44:41Z

0

I find a solution myself:

>>> re.findall('r'data-lecture-id="(\d+)"[\s\S]+>([\s\S]+)</a>',a)
>>> [('47', '\nAnother diversion: The softmax output function [7 min]')]

Looks better, but still have to iterate it to extract a simple list...

answered Mar 27, 2013 at 7:44

WoooHaaaa

20.6k33 gold badges99 silver badges141 bronze badges

1 Comment

abarnert Over a year ago

If you want to "flatten" a two-deep sequence like this, that's itertools.chain.from_iterable(x) (or, if it's an actual sequence rather than an arbitrary iterable, just itertools.chain(*x)). Serdalis's answer already explains this.

Collectives™ on Stack Overflow

Python regex find two groups

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related