Regex to delete HTML within <table> tags

Question

I have an HTML document in .txt format containing multiple tables and other texts and I am trying to delete any HTML (anything within "<>") if it's inside a table (between <table> and </table>). For example:

===================
other text
<other HTML>
<table>
<b><u><i>bold underlined italic text</b></u></i>
</table>
other text
<other HTML>
==============

The final output would be as the following. Note that only HTML within and are removed.

==============
other text
<other HTML>
<table>
bold underlined italic text        
</table>
other text
<other HTML>
=============

Any help is greatly appreciated!

I think parts of your question have disappeared due to HTML-tag parsing of your text. Try putting the tags in single ticks (`), like so: <html> — Victor Zamanian
– Victor Zamanian, Commented Dec 21, 2010 at 18:00
Using regexes for this would involve a bunch of assumptions. You can run into problems if, for example, you assume that anything between < and > is a tag, even if it's not valid HTML. So a mathematical expression like x<y and z>2 could cause problems. If you can state a bunch of assumptions we can follow, then someone can likely provide a satisfactory regex. But it's probably better not to use regexes at all as zzzzBov suggests. — Mitch Schwartz
– Mitch Schwartz, Commented Dec 21, 2010 at 18:18

Geoffrey · Accepted Answer · 2010-12-21 21:28:49Z

Use the HTMLDocument Class Instead of Regex

Imports System.Windows.Forms.HtmlDocument
Imports System.IO.File

Public Class Form1

    Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
        Dim myHTMLString As String

        Dim myDoc As HtmlDocument
        Dim myTables As HtmlElementCollection
        Dim myTable As HtmlElement

        Dim myAllTags As HtmlElementCollection
        Dim myHTMLTag As HtmlElement

        myHTMLString = ReadAllText("C:\Users\Geoffrey Van Wyk\Desktop\myPage1.txt")
        WebBrowser1.DocumentText = myHTMLString

        myDoc = WebBrowser1.Document.OpenNew(True)
        myDoc.Write(myHTMLString)

        myTables = myDoc.GetElementsByTagName("table")
        myTable = myTables.Item(0)

        For Each child As HtmlElement In myTable.Children
            child.OuterText = child.InnerText
        Next

        myAllTags = myDoc.GetElementsByTagName("html")
        myHTMLTag = myAllTags.Item(0)

        WriteAllText("C:\Users\Geoffrey Van Wyk\Desktop\myPage2.txt", myHTMLTag.OuterHtml)
    End Sub
End Class

I have tested it. It works.

Waseem Fastian · Accepted Answer · 2012-12-06 06:56:30Z

2

input = Regex.Replace(input, @"<table>(.|\n)*?</table>", string.Empty, RegexOptions.Singleline);

Here input is the string that contains html. This regex will remove all the tags and text that are between start table and end /table tag. Try it !!!

answered Dec 6, 2012 at 6:56

Waseem Fastian

331 silver badge10 bronze badges

Collectives™ on Stack Overflow

Regex to delete HTML within <table> tags

2 Answers 2

Use the HTMLDocument Class Instead of Regex

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Use the HTMLDocument Class Instead of Regex

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related