Parse html5 data-* attributes in python?

Question

Is it possible to access the data-* portion of an html element from python? I'm using scrapy and the data-* is not available in a selector object, though the raw data is available in a Request object.

If I dump the html using wget -O page http://page.com then I can see the data in the file. It's something like <a href="blah" data-mine="a;slfkjasd;fklajsdfl;ahsdf">blahlink</a>

I can edit the data-mine portion in an editor, so I know it's there ... it just seems like well-behaved parsers are dropping it.

As you can see, I'm confused.

Steven Almeroth · Accepted Answer · 2013-02-15 19:24:11Z

1

Yeah, lxml does not expose the attribute names for some reason, and Talvalin is right, html5lib does:

stav@maia:~$ python
Python 2.7.3 (default, Aug  1 2012, 05:14:39) [GCC 4.6.3] on linux2
>>> import html5lib
>>> html = '''<a href="blah" target="_blank" data-mine="a;slfkjasd;fklajsdfl;ahsdf"
... data-yours="truly">blahlink</a>'''
>>> for x in html5lib.parse(html, treebuilder='lxml').xpath('descendant::*/@*'):
...     print '%s = "%s"' % (x.attrname, x)
...
href = "blah"
target = "_blank"
data-mine = "a;slfkjasd;fklajsdfl;ahsdf"
data-yours = "truly"

answered Feb 15, 2013 at 19:24

Steven Almeroth

8,2522 gold badges54 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

user67416 · Accepted Answer · 2013-02-16 00:40:15Z

1

I did it like this without using a third-party library:

import re
data_email_pattern = re.compile(r'data-email="([^"]+)"')
match = data_email_pattern.search(response.body)
if match:
    print(match.group(1))
    ...

answered Feb 16, 2013 at 0:40

user67416

1 Comment

Steven Almeroth Over a year ago

Beware the Zalgo.

Talvalin · Accepted Answer · 2013-02-15 07:24:13Z

0

I've not tried it, but there is html5lib (http://code.google.com/p/html5lib/) which can be used in conjunction with Beautiful Soup instead of scrapy's built-in selectors.

answered Feb 15, 2013 at 7:24

Talvalin

7,8972 gold badges33 silver badges40 bronze badges

1 Comment

Talvalin Over a year ago

Having said that, if you could provide a link to the page you're trying to scrape then I'll happily test it out now. :)

Collectives™ on Stack Overflow

Parse html5 data-* attributes in python?

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related