1

I want to parse html code in python and tried beautiful soup and pyquery already. The problem is that those parsers modify original code e.g insert some tag or etc. Is there any parser out there that do not change the code?


I tried HTMLParser but no success! :( It doesn't modify the code and just tells me where tags are placed. But it fails in parsing web pages like mail.live.com Any idea how to parse a web page just like a browser?

4
  • 2
    The DOM tree generated by any parser may indeed be different from the original, as the original documents more often than not are malformed in some way. What do you actually need to do -- that is, what would you do with the parsed data? Commented Dec 13, 2012 at 11:49
  • I am doing a research on different websites and I want to do some maniplulations like inserting javascript code in them or bolding some of contents etc. but when I used pyquery some of pages turned to blank pages! Commented Dec 13, 2012 at 12:06
  • Hmm -- well, I haven't heard of a HTML parsing library that would guarantee a re-serialized document to be exactly the same as the parsed-in document. Are the changes BeautifulSoup/PyQuery/whatever do actually problematic? Commented Dec 13, 2012 at 12:12
  • Yes it is. As I said for some websites parsing itself turned the page to a blank page! html code is there but browser just shows a blank page. Commented Dec 13, 2012 at 12:18

3 Answers 3

1

You can use BeautifulSoup to extract just text and not modify the tags. Its in their documentation.

Same question here: How to extract text from beautiful soup

Sign up to request clarification or add additional context in comments.

2 Comments

So if I want beautiful soup not to modify all tags, I need to include every tag in QUOTE_TAGS?! and if I do this, is it still possible to parse the html?
This won't work cause in a QOUT_TAG beautiful sees just text so there won't be any parsing operations down there.
0

No, to this moment there is no such HTML parser and every parser has it's own limitations.

Comments

0

Have you tried the webkit engine with Python bindings?

See this: https://github.com/niwibe/phantompy

You can traverse the real DOM of the parsed web page and do what you need to do.

2 Comments

No I didn't. Are you sure parsed HTML and original HTML are the same after HTML had been parsed? phantompy is under development anyway and they said they implemented Live DOM Access just for proof of concept.
I am not sure, but you can be sure that you have the same DOM that a browser has. I was refering to your specific request "Any idea how to parse a web page just like a browser?" - this is the idea how to do this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.