2

I'm trying to parse HTML on an external page and read its contents (eg. get "title" element from google.com). XmlDataSource does not appear to be working because it's not clean XML, does anybody know how to do this?

Thank you.

0

3 Answers 3

5

You should use Html Agility Pack.

Sign up to request clarification or add additional context in comments.

Comments

3

If it's something simple, can you just do some basic string parsing? It's not the most efficient but if works well enough.

First get your html (in case this is part of what you needed):

WebClient client = new WebClient();
string webhtml = client.DownloadString(strURL);

If you have a repeating pattern, you can then use .Split to divide it up.

Now just use .IndexOf (or .LastIndexOf) and .Substring to parse as needed. If you need to do this a lot, or iteratively, you can create a function where you pass the html and the start and end delimiters - plus a few other parameters as needed. You'll need to offset the start delimiter by adding the length of the string to the index but otherwise it's fairly straightforward.

Comments

0

Use Sgml Reader (http://sourceforge.net/projects/dekiwiki/files/SgmlReader/) if you are interested in treating HTML like XML for parsing. While this may be overkill for getting the title, it will be faster than other similar methods when parsing large HTML pages.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.