1

Is it possible to access the data-* portion of an html element from python? I'm using scrapy and the data-* is not available in a selector object, though the raw data is available in a Request object.

If I dump the html using wget -O page http://page.com then I can see the data in the file. It's something like <a href="blah" data-mine="a;slfkjasd;fklajsdfl;ahsdf">blahlink</a>

I can edit the data-mine portion in an editor, so I know it's there ... it just seems like well-behaved parsers are dropping it.

As you can see, I'm confused.

3 Answers 3

1

Yeah, lxml does not expose the attribute names for some reason, and Talvalin is right, html5lib does:

stav@maia:~$ python
Python 2.7.3 (default, Aug  1 2012, 05:14:39) [GCC 4.6.3] on linux2
>>> import html5lib
>>> html = '''<a href="blah" target="_blank" data-mine="a;slfkjasd;fklajsdfl;ahsdf"
... data-yours="truly">blahlink</a>'''
>>> for x in html5lib.parse(html, treebuilder='lxml').xpath('descendant::*/@*'):
...     print '%s = "%s"' % (x.attrname, x)
...
href = "blah"
target = "_blank"
data-mine = "a;slfkjasd;fklajsdfl;ahsdf"
data-yours = "truly"
Sign up to request clarification or add additional context in comments.

Comments

1

I did it like this without using a third-party library:

import re
data_email_pattern = re.compile(r'data-email="([^"]+)"')
match = data_email_pattern.search(response.body)
if match:
    print(match.group(1))
    ...

1 Comment

Beware the Zalgo.
0

I've not tried it, but there is html5lib (http://code.google.com/p/html5lib/) which can be used in conjunction with Beautiful Soup instead of scrapy's built-in selectors.

1 Comment

Having said that, if you could provide a link to the page you're trying to scrape then I'll happily test it out now. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.