8

Knowing that I can't use HTMLAgilityPack, only straight .NET, say I have a string that contains some HTML that I need to parse and edit in such ways:

  • find specific controls in the hierarchy by id or by tag
  • modify (and ideally create) attributes of those found elements

Are there methods available in .net to do so?

8
  • 1
    I know... use regex Commented Feb 27, 2012 at 22:42
  • 4
    I don't know... don't use regex stackoverflow.com/questions/1732348/… Commented Feb 27, 2012 at 22:44
  • 1
    If your HTML happens to be XHTML, then you could use the standard XML libraries for parsing, traversing, and modifying it. Commented Feb 27, 2012 at 22:46
  • 1
    The short answer is no. The Agility Pack is the closest thing there is to a sanctioned (.NET) HTML parser. Commented Feb 27, 2012 at 22:47
  • 1
    Why "I can't use HTMLAgilityPack" ? Seems silly to rule out a very good (and even free) tool. Commented Feb 27, 2012 at 22:48

4 Answers 4

5

HtmlDocument

GetElementById

HtmlElement

You can create a dummy html document.

WebBrowser w = new WebBrowser();
w.Navigate(String.Empty);
HtmlDocument doc = w.Document;
doc.Write("<html><head></head><body><img id=\"myImage\" src=\"c:\"/><a id=\"myLink\" href=\"myUrl\"/></body></html>");
Console.WriteLine(doc.Body.Children.Count);
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Console.WriteLine(doc.GetElementById("myLink").GetAttribute("href"));
Console.ReadKey();

Output:

2

file:///c:

about:myUrl

Editing elements:

HtmlElement imageElement = doc.GetElementById("myImage");
string newSource = "d:";
imageElement.OuterHtml = imageElement.OuterHtml.Replace(
        "src=\"c:\"",
        "src=\"" + newSource + "\"");
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));

Output:

file:///d:

Sign up to request clarification or add additional context in comments.

4 Comments

This requires you to load up the document in a Winforms control.
Correct me if I'm wrong but this requires a webBrowser control and doesn't allow for direct HTML string parsing.
@JellyAma, yes, but isn't it what you seem to want in "modify (and ideally create) attributes of those found elements"?
@Alexei, most importantly, I need to parse strings of HTML.
1

Assuming you're dealing with well formed HTML, you could simply treat the text as an XML document. The framework is loaded with features to do exactly what you're asking.

http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx

1 Comment

Try to parse this well formed html. <html><body>line1 <br> line2</body></html>
1

Aside from the HTML Agility Pack, and porting HtmlUnit over to C#, what sounds like solid solutions are:

  • Most obviously - use regex. (System.Text.RegularExpressions)
  • Using an XML Parser. (because HTML is a system of tags treat it like an XML document?)
  • Linq?

One thing I do know is that parsing HTML like XML may cause you to run into a few problems. XML and HTML are not the same. Read about it: here

Also, here is a post about Linq vs Regex.

1 Comment

0

You can look at how HTML Agility Pack works, however, it is .Net. You can reflect the assembly and see that it is using the MFC and could be reproduced if you so wanted, but you'd be doing nothing more than moving the assembly, not making it any more .Net.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.