0

I have an HTML document in .txt format containing multiple tables and other texts and I am trying to delete any HTML (anything within "<>") if it's inside a table (between <table> and </table>). For example:

===================
other text
<other HTML>
<table>
<b><u><i>bold underlined italic text</b></u></i>
</table>
other text
<other HTML>
==============

The final output would be as the following. Note that only HTML within and are removed.

==============
other text
<other HTML>
<table>
bold underlined italic text        
</table>
other text
<other HTML>
=============

Any help is greatly appreciated!

3
  • I think parts of your question have disappeared due to HTML-tag parsing of your text. Try putting the tags in single ticks (`), like so: <html> Commented Dec 21, 2010 at 18:00
  • 8
    For starters: Don't parse HTML with RegEx. Commented Dec 21, 2010 at 18:06
  • 1
    Using regexes for this would involve a bunch of assumptions. You can run into problems if, for example, you assume that anything between < and > is a tag, even if it's not valid HTML. So a mathematical expression like x<y and z>2 could cause problems. If you can state a bunch of assumptions we can follow, then someone can likely provide a satisfactory regex. But it's probably better not to use regexes at all as zzzzBov suggests. Commented Dec 21, 2010 at 18:18

2 Answers 2

4

Use the HTMLDocument Class Instead of Regex

Imports System.Windows.Forms.HtmlDocument
Imports System.IO.File

Public Class Form1

    Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
        Dim myHTMLString As String

        Dim myDoc As HtmlDocument
        Dim myTables As HtmlElementCollection
        Dim myTable As HtmlElement

        Dim myAllTags As HtmlElementCollection
        Dim myHTMLTag As HtmlElement

        myHTMLString = ReadAllText("C:\Users\Geoffrey Van Wyk\Desktop\myPage1.txt")
        WebBrowser1.DocumentText = myHTMLString

        myDoc = WebBrowser1.Document.OpenNew(True)
        myDoc.Write(myHTMLString)

        myTables = myDoc.GetElementsByTagName("table")
        myTable = myTables.Item(0)

        For Each child As HtmlElement In myTable.Children
            child.OuterText = child.InnerText
        Next

        myAllTags = myDoc.GetElementsByTagName("html")
        myHTMLTag = myAllTags.Item(0)

        WriteAllText("C:\Users\Geoffrey Van Wyk\Desktop\myPage2.txt", myHTMLTag.OuterHtml)
    End Sub
End Class

I have tested it. It works.

Sign up to request clarification or add additional context in comments.

Comments

2
input = Regex.Replace(input, @"<table>(.|\n)*?</table>", string.Empty, RegexOptions.Singleline);

Here input is the string that contains html. This regex will remove all the tags and text that are between start table and end /table tag. Try it !!!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.