2

I'm looking for a HTML parser which is css aware and works same way a browser renders html. I'm actually looking for equivalent of element.innerText (DOM-JS). Let me give a example. consider the following html,

<style>
.AAA { display:inline;}
.BBB { display:none;}
.CCC { display:inline ;}
</style>
<span id="sarim">

    <span class="AAA">a</span>
    <span style="display:none">b</span>
    c
    <span class="CCC">d</span>
    <div style="display:inline">e</div>
    <span class="BBB">f</span>

</span>

Now If i run the above html in a browser and run document.getElementById('sarim').innerText is returns "a c d e". Thats exactly what i need. But if i use a html parser and strip the html tags it would return "abcdef". I need a parser which will automatically ignore "b" and "f" reading their css property.

Any idea which parser supports this ? I tried Beautiful soap,

hiddenelements = sarim.findAll(True, {'style' : 'display:none'})
for p in hiddenelements:
    p.extract()

Now sarim.text returns the text but this only works for inline style and this is manual process which fails for the css class based styles, and as the classes will be random, i'm looking for a intelligent parser which will automatically do this.

I got a failsafe idea to run a headless wekbit (phantomjs.org) and use element.innerText to retrive the visible text, Any better idea ?

2 Answers 2

1

How about Python-Webkit It's a Python binding of webkit.

The Python Webkit DOM Project makes python a full peer of javascript when it comes to accessing and manipulating the full features available to Webkit, such as HTML5. Everything that can be done with javascript, such as getElementsbyTagName and appendChild, event callbacks through onclick, timeout callbacks through window.setTimeout, and even AJAX using XMLHttpRequest, can also be done from python.

Sign up to request clarification or add additional context in comments.

1 Comment

That is actually pythonwebkit, my target platform is OSX, so avoiding that :(
0

I've made a CSS aware HTML minifier using PhantomJS at https://github.com/JamieMason/Asterisk - it would be easy to fork and modify it for your purpose.

The main work is done using https://github.com/JamieMason/Asterisk/blob/master/src/browser.js, for my use-case I inspect the styles to generate HTML output - but you could return the innerText instead.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.