1

Possible Duplicate:
Best methods to parse HTML with PHP

I'm having a bit of trouble matching table rows with preg. Here is my expression:

<TR[a-z\=\"a-z0-9 ]*>([\{\}\(\)\^\=\$\&\.\_\%\#\!\@\=\<\>\:\;\,\~\`\'\*\?\/\+\|\[\]\|\-a-zA-Z0-9À-ÿ\n\r ]*)<\/TR>

As you can see, it tries to mach everything in-between TR tags (including all symbols.) That part works great, however when dealing with multiple table rows, it often takes multiple table rows as ONE match, rather than a match for each table row:

<TR>
 <TD>test</TD>
</TR>
<TR>
 <TD>test2</TD>
</TR>

yields:

Array
    (
        [0] => <TD>test</TD>
               <TD>test2</TD>
    )

rather than what I want it to:

Array
    (
        [0] => <TD>test</TD>
        [1] => <TD>test2</TD>
    )

I realize that the reason for this is because it's match the symbols, and the search naturally takes the rest of the rows until it hits the last one.

So basically, I'm wondering if someone can help me add to the expression so that it will exclude anything with "TR" in between the TR tags, as to prevent it from matching multiple rows.

5
  • Use the PHP DOM to do this, not regex. Using regex to parse HTML is generally considered a bad idea. A (somewhat entertaining) take on it: stackoverflow.com/questions/1732348/… Commented Sep 2, 2011 at 21:04
  • 1
    (related) Best Methods to parse HTML Commented Sep 2, 2011 at 21:04
  • 1
    Do you have an option to use a PHP HTML Parser instead of regex? Commented Sep 2, 2011 at 21:04
  • 1
    Instead of manual anyting: there are readymade html table extraction libraries for php. Commented Sep 2, 2011 at 21:09
  • It doesn't answer your question, but don't do this: [\{\}\(\)\^\=\$\&\.\_\%\#\!\@\=\<\>\:\;\,\~\`\'\*\?\/\+\|\[\]\|\-a-zA-Z0-9À-ÿ\n\r ] Because it's a horrible mess and you don't need to put backslashes before almost all of those. You only need escape: [ and ] and \ and - (when not first/last) and ^ (when first). Here's a much easier to read version. [{}()^=$&._%#!@<>:;,~`'*?/+\[\]|\-a-zA-Z0-9À-ÿ\n\r ] Commented Sep 4, 2011 at 21:52

2 Answers 2

4

Use lazy matching in your regex: <tr.*?</tr>

But as others have mentioned, it's more robust to use a proper parser if you can.

Sign up to request clarification or add additional context in comments.

1 Comment

I have tried simple html parser and ganon but both failed on broken HTML which i have got to parse.
2

Try using global search:

preg_match_all("/<td>([^<]+)/", $html, $matches);

2 Comments

That almost works, however I need everything in between the <tr> tags, not just individual items from the td tags. Instead of just excluding the "<" from the "[^<]" in your expression, would it somehow be possible to exclude the string "TR" or even "<TR>"?
try setting the sim flags and replace td by tr in regex: /<tr>([<]+)/sim

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.