9

I'm doing webpage layout analysis in python. A fundamental task is to programmatically measure the elements' sizes given HTML source codes, so that we could obtain statistical data of content/ad ratio, ad block position, ad block size for the webpage corpus.

An obvious approach is to use the width/height attributes, but they're not always available. Besides, things like width: 50% needs to be calculated after loading into DOM. So I guess loading the HTML source code into a window-size-predefined-browser (like mechanize although I'm not sure if window's size could be set) is a good way to try, but mechanize doesn't support the return of an element size anyway.

Is there any universal way (without width/height attributes) to do it in python, preferably with some library?

Thanks!

3
  • Man, I can't even get my elements to render to the same size in IE and Firefox. If there is an "official" way to calculate dimensions, you can bet that half the market ignores that and does it their own way. Commented Mar 27, 2013 at 16:33
  • 1
    Just to point you into a direction -- might wanna look into what WebKit and the other renderers offer as output. Obviously won't get Trident, but WK / Gecko might be good enough... Commented Mar 27, 2013 at 16:57
  • @Kevin Your concern is certainly valid. But for a (empirical) research purpose, I'll stick to any browser that could do this. I understand that in IE and Firefox some elements are not rendered as the same size and I've suffered, too. But is it really huge difference? I'm not worried about several pixels drift here :) Commented Mar 27, 2013 at 16:57

2 Answers 2

3

I suggest You to take a look at Ghost - webkit web client written in python. It has JavaScript support so you can easily call JavaScript functions and get its return value. Example shows how to find out google text box width:

>>> from ghost import Ghost
>>> ghost = Ghost()
>>> ghost.open('https://google.lt')
>>> width, resources = ghost.evaluate("document.getElementById('gbqfq').offsetWidth;")
>>> width
541.0  # google text box width 541px
Sign up to request clarification or add additional context in comments.

1 Comment

It's very helpful. However, I wish Ghost has an API document.
0

To properly get all the final sizes, you need to render the contents, taking in account all CSS style sheets, and possibly all javascript. Therefore, the only ways to get the sizes from a Python program are to have a full web browser implementation in Python, use a library that can do so, or pilot a browser off-process, remotely.

The later approach can be done with use of the Selenium tools - check how you can get the result of javascript expressions from within a Python program here: Can Selenium web driver have access to javascript global variables?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.