python html parser which doesn't modify actual markup?

Question

I want to parse html code in python and tried beautiful soup and pyquery already. The problem is that those parsers modify original code e.g insert some tag or etc. Is there any parser out there that do not change the code?

I tried HTMLParser but no success! :( It doesn't modify the code and just tells me where tags are placed. But it fails in parsing web pages like mail.live.com Any idea how to parse a web page just like a browser?

The DOM tree generated by any parser may indeed be different from the original, as the original documents more often than not are malformed in some way. What do you actually need to do -- that is, what would you do with the parsed data? — AKX
– AKX, Commented Dec 13, 2012 at 11:49
I am doing a research on different websites and I want to do some maniplulations like inserting javascript code in them or bolding some of contents etc. but when I used pyquery some of pages turned to blank pages! — Mehraban
– Mehraban, Commented Dec 13, 2012 at 12:06
Hmm -- well, I haven't heard of a HTML parsing library that would guarantee a re-serialized document to be exactly the same as the parsed-in document. Are the changes BeautifulSoup/PyQuery/whatever do actually problematic? — AKX
– AKX, Commented Dec 13, 2012 at 12:12
Yes it is. As I said for some websites parsing itself turned the page to a blank page! html code is there but browser just shows a blank page. — Mehraban
– Mehraban, Commented Dec 13, 2012 at 12:18

Community · Accepted Answer · 2017-05-23 12:03:24Z

1

You can use BeautifulSoup to extract just text and not modify the tags. Its in their documentation.

Same question here: How to extract text from beautiful soup

edited May 23, 2017 at 12:03

CommunityBot

11 silver badge

answered Dec 13, 2012 at 11:47

user723556

Sign up to request clarification or add additional context in comments.

2 Comments

Mehraban Over a year ago

So if I want beautiful soup not to modify all tags, I need to include every tag in QUOTE_TAGS?! and if I do this, is it still possible to parse the html?

Mehraban Over a year ago

This won't work cause in a QOUT_TAG beautiful sees just text so there won't be any parsing operations down there.

Mehraban · Accepted Answer · 2013-08-21 06:44:12Z

0

No, to this moment there is no such HTML parser and every parser has it's own limitations.

answered Aug 21, 2013 at 6:44

Mehraban

3,3625 gold badges41 silver badges63 bronze badges

Comments

Jiri · Accepted Answer · 2013-08-21 06:57:18Z

0

Have you tried the webkit engine with Python bindings?

See this: https://github.com/niwibe/phantompy

You can traverse the real DOM of the parsed web page and do what you need to do.

answered Aug 21, 2013 at 6:57

Jiri

16.6k7 gold badges56 silver badges68 bronze badges

2 Comments

Mehraban Over a year ago

No I didn't. Are you sure parsed HTML and original HTML are the same after HTML had been parsed? phantompy is under development anyway and they said they implemented Live DOM Access just for proof of concept.

Jiri Over a year ago

I am not sure, but you can be sure that you have the same DOM that a browser has. I was refering to your specific request "Any idea how to parse a web page just like a browser?" - this is the idea how to do this.

Collectives™ on Stack Overflow

python html parser which doesn't modify actual markup?

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related