6

I need to extract text from an HTML file using C#. I am trying to use HTMLAgilityPack but I am seeing some parse errors (tags not closed). I am using these two options:

        htmlDoc.OptionFixNestedTags = true;
        htmlDoc.OptionAutoCloseOnEnd = true;

Is there any "Fix all" type option. I don't care about the errors, I just want the content or close.

1 Answer 1

4

Maybe this is workaround but once I had to extract text from HTML I used regex:

result = Regex.Replace(result, @"<(.|\n)*?>", String.Empty);
result = Regex.Replace(result, @"^\n*", String.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = Regex.Replace(result, @"\n*$", String.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);
result = result.Replace("\n", " ");
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.