I'm looking for a HTML parser which is css aware and works same way a browser renders html. I'm actually looking for equivalent of element.innerText (DOM-JS). Let me give a example. consider the following html,
<style>
.AAA { display:inline;}
.BBB { display:none;}
.CCC { display:inline ;}
</style>
<span id="sarim">
<span class="AAA">a</span>
<span style="display:none">b</span>
c
<span class="CCC">d</span>
<div style="display:inline">e</div>
<span class="BBB">f</span>
</span>
Now If i run the above html in a browser and run document.getElementById('sarim').innerText is returns "a c d e". Thats exactly what i need. But if i use a html parser and strip the html tags it would return "abcdef". I need a parser which will automatically ignore "b" and "f" reading their css property.
Any idea which parser supports this ? I tried Beautiful soap,
hiddenelements = sarim.findAll(True, {'style' : 'display:none'})
for p in hiddenelements:
p.extract()
Now sarim.text returns the text but this only works for inline style and this is manual process which fails for the css class based styles, and as the classes will be random, i'm looking for a intelligent parser which will automatically do this.
I got a failsafe idea to run a headless wekbit (phantomjs.org) and use element.innerText to retrive the visible text, Any better idea ?