0

I want to extract all table rows from an HTML page. But using the pattern @"<tr>([\w\W]*)</tr>" is not working. It's giving one result which is first occurence of <tr> to last occurrence of </tr>. But I want every occurrence of <tr>...</tr> value. Can anyone please tell me how I can do this?

2 Answers 2

5

[\w\W]* matches greedily so it will match from the first <tr> to the last </tr>.

A regex approach won't work well because HTML is not a regular language. If you really wanted to try to use a lazy modifier such as "<tr>(.*?)</tr>" with the RegexOptions.Singleline flag, however this isn't guaranteed to work in all cases.

For parsing HTML you need an HTML parser. Try HTML Agility Pack.

Sign up to request clarification or add additional context in comments.

3 Comments

And we all know what happens when you try to parse html with a regex... stackoverflow.com/questions/1732348/…
Another question is there anyway so that I can do it using regex ?
This page shows a quick example of how the HTML Agility Pack library can be used: htmlagilitypack.codeplex.com/wikipage?title=Examples
2

I do agree with Mark: you should to use HTML Agility Pack library.

About your regex, you should to go with something like:

@"<tr>([\s\S]*?)</tr>"

That's a non greedy pattern, and you should to get one match for every TR.

1 Comment

Another question... Can you provide me any link or book name where I can learn this all regex [C#] property properly ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.