0

i want to be able to take html code and render plain text out of it.

another words this would be my input

<h3>some text</h3>

i want the result to look like this:

some text

how would i do it?

4 Answers 4

3

I would suggest trying the HTML Agility Pack for .NET:

Html Agility Pack - Codeplex

Attemtping to parse through HTML with anything else is, for the most part, unreliable.

Whatever you do, DON'T TRY TO PARSE HTML WITH REGEX!

Sign up to request clarification or add additional context in comments.

2 Comments

I think that HtmlAgilityPack is not needed for this simple task. See my answer.
@sashaeve And see my updated answer. For a simple example like this, RegEx might work...but this is just an example. My guess is his real problem is much more complex and that SO post explains IN DEPTH why you can't parse HTML with RegEx.
1

Use regex.

String result = Regex.Replace(your_text_goes_here, @"<[^>]*>", String.Empty);

4 Comments

@sashaeve: This is not reliable enough to render HTML
@James: Why not? All depends on what complexity of HTML will be used as source. If such simple as in example - this will be enough.
yes maybe so (as I have suggested myself) however I am assuming that the HTML would be a little more complex than what has been provided in the example.
Regex will only get you in trouble, just use a proper parser. Your argument "this will work on the example", doesn't sound right in my ears. I mean then string StripHtml(string input){return "some text";}, would be a valid answer as well. Much simpler and still no need for regex. Just use Html Agility Pack and save yourself the headaches.
0

You would need to use some form of HTML parser. You could use an existing Regex or build your own. However, they aren't always 100% reliable. I would suggest using a 3rd party utility like HtmlAgilityPack (I have used this one and would recommend it)

Comments

0

Poor Man's HTML Parser

        string s =
            @"
            <html>
            <body>
            <h1>My First Heading</h1>
            <p>My first paragraph.</p>
            </body>
            </html> 
        ";

        foreach (var item in s.Split(new char[]{'<'}))
        {
            int x = item.IndexOf('>');

            if (x != -1)
            {
                Console.WriteLine(item.Substring(x).Trim('>'));
            }
        }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.