1

I built an extension to convert HTML formatted text to something better for a list view. It removes all HTML tags except it replaces <h> and <p>s with <br /> to keep readability on the list view. It also shortens the text for longer posts. I put it on my razor view with HTML.Raw(model.text).

public static string FixHTML(string input, int? strLen)
        {
            string s = input.Trim();
            s = Regex.Replace(s, "</p.*?>", "<br />");
            s = Regex.Replace(s, "</h.*?>", "<br />");
            s = s.Replace("<br />", "*ret$990^&");
            s = Regex.Replace(s, "<.*?>", String.Empty);
            s = Regex.Replace(s, "</.*", String.Empty);
            s = s.Replace("*ret$990^&", "<br />");
            int i = (strLen ?? s.Length);
            s = s.Substring(0,(i > s.Length ? s.Length : i));
            return(s);
        }

PROBLEM: if the last character gets cut off mid <br /> it messes up the displayed text. Example it gets cut off at blah blah blah <br then the display isnt nice. How can I use REGEX (or even string replace) to find only the last occurence of <b.... and only if it doesnt have a closing >.

I was thinking of something like:

s = string.Format(s.Substring(0, s.Length-6) + Regex.Replace(s.Substring(s.Length - 6), "<.*", string.Empty));

That will probably work but my whole converter seems like it is using a to of code to do something that should be relatively simple.

How can I do this?

4
  • 1
    Using regex to parse HTML is not recommended. Commented Jan 18, 2018 at 20:29
  • Is there anything that IS recommended to "clean" HTML? What I am doing above works, but I agree its not pretty. Commented Jan 18, 2018 at 20:40
  • Possible duplicate of RegEx match open tags except XHTML self-contained tags Commented Jan 18, 2018 at 21:26
  • I would suggest a library such as HtmlAgilityPack to parse through and change your HTML Commented Jan 18, 2018 at 22:21

1 Answer 1

2

Try this:

s = Regex.Replace(s, "(<|<b|<br|<br/)$", "", RegexOptions.None);
Sign up to request clarification or add additional context in comments.

2 Comments

An alternate regex that would catch all incomplete html tags (not just br) at the end of a string would be "<[^>]*$".
@Rudism - definitely a good solution, the only problem might be if the "<" character appeared in the text not as part of a tag

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.