CSS aware intelligent html parser for python

Question

I'm looking for a HTML parser which is css aware and works same way a browser renders html. I'm actually looking for equivalent of element.innerText (DOM-JS). Let me give a example. consider the following html,

<style>
.AAA { display:inline;}
.BBB { display:none;}
.CCC { display:inline ;}
</style>
<span id="sarim">

    <span class="AAA">a</span>
    <span style="display:none">b</span>
    c
    <span class="CCC">d</span>
    <div style="display:inline">e</div>
    <span class="BBB">f</span>

</span>

Now If i run the above html in a browser and run document.getElementById('sarim').innerText is returns "a c d e". Thats exactly what i need. But if i use a html parser and strip the html tags it would return "abcdef". I need a parser which will automatically ignore "b" and "f" reading their css property.

Any idea which parser supports this ? I tried Beautiful soap,

hiddenelements = sarim.findAll(True, {'style' : 'display:none'})
for p in hiddenelements:
    p.extract()

Now sarim.text returns the text but this only works for inline style and this is manual process which fails for the css class based styles, and as the classes will be random, i'm looking for a intelligent parser which will automatically do this.

I got a failsafe idea to run a headless wekbit (phantomjs.org) and use element.innerText to retrive the visible text, Any better idea ?

xiaowl · Accepted Answer · 2012-07-25 11:12:14Z

1

How about Python-Webkit It's a Python binding of webkit.

The Python Webkit DOM Project makes python a full peer of javascript when it comes to accessing and manipulating the full features available to Webkit, such as HTML5. Everything that can be done with javascript, such as getElementsbyTagName and appendChild, event callbacks through onclick, timeout callbacks through window.setTimeout, and even AJAX using XMLHttpRequest, can also be done from python.

answered Jul 25, 2012 at 11:12

xiaowl

5,2273 gold badges29 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sarim Over a year ago

That is actually pythonwebkit, my target platform is OSX, so avoiding that :(

Jamie Mason · Accepted Answer · 2012-11-27 13:15:40Z

0

I've made a CSS aware HTML minifier using PhantomJS at https://github.com/JamieMason/Asterisk - it would be easy to fork and modify it for your purpose.

The main work is done using https://github.com/JamieMason/Asterisk/blob/master/src/browser.js, for my use-case I inspect the styles to generate HTML output - but you could return the innerText instead.

answered Nov 27, 2012 at 13:15

Jamie Mason

4,2212 gold badges35 silver badges42 bronze badges

Collectives™ on Stack Overflow

CSS aware intelligent html parser for python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related