1

I'm trying to parse some HTML with C++ to extract all urls from the HTML (the urls can be inside the href and src attributes).

I tried to use Webkit to do the heavy work for me but for some reason when I load a frame with HTML the generated document is all wrong (if I make Webkit get the page from the web the generated document is just fine but Webkit also downloads all images, styles, and scripts and I don't want that)

Here is what I tried to do:

frame->setHtml(HTML);
QWebElement document = frame->documentElement();
QList<QWebElement> imgs = document.findAll("a"); // Doesn't find all links
QList<QWebElement> imgs = document.findAll("img"); // Doesn't find all images
QList<QWebElement> imgs = document.findAll("script");// Doesn't find all scripts
qDebug() << document.toInnerXml(); // Print a completely messed-up document with several missing elements

What am I doing wrong? Is there an easy way to parse HTML with Qt? (Or some other lightweight library)

3
  • 1
    1. What "generated document"? 2. What do you mean by "all wrong"? 3. What is the expected behavior? 4. What is the actual behavior? Commented May 22, 2011 at 5:48
  • @Billy ONeal - When I load the frame with HTML the document structure inside the frame is missing several elements. (this does not happen if I load the page from the web using page->load(url)). Commented May 22, 2011 at 5:52
  • @ Billy ONeal - When I print the loaded document I can see that it has just some elements of the original HTML. If you put this code in a simple program, compile it you'll see what I'm talking about. Commented May 22, 2011 at 5:55

1 Answer 1

2

You can always use XPath expressions to make your parsing life easier, take a look at this for instance.

or you can do something like this

QWebView* view = new QWebView(parent);
view.load(QUrl("http://www.your_site.com"));
QWebElementCollection elements = view.page().mainFrame().findAllElements("a");
Sign up to request clarification or add additional context in comments.

1 Comment

This only works if the HTML was loaded from the web. If I load the HTML manually it will break on the malformed tags that are present on 90% of the websites.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.