1

I have got HTML source code, and i must get some information text in the HTML. I can not use DOM, because the document isn't well-formed.

Maybe, the source could change later, I can not be aware of this situation. So, the solution of this problem must be advisible for most situation.

Im getting source with curl, and i will edit it with preg_match_all function and regular expressions.

Source :
...
<TR Class="Head1">
<TD width="15%"><font size="12">Name</font></TD>
<TD>:&nbsp;</TD>
<TD align="center"><font color="red">Alex</font></TD>
<TD width="25%"><b>Job</b></TD>
<TD>:&nbsp;</B></TD>
<TD align="center" width="25%"><font color="red">Doctor</font></TD>
</TR>
...
...
<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>:&nbsp;</TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD>&nbsp;</B></TD>
<TD width="40%">&nbsp;</TD>
</TR>
...

As we have seen, the source is not well-formed. In fact, terrible! But there is nothing I can do. The source is longer than this.

How can I get the data from the source? I can delete all of HTML codes, but how can i know sequence of data? What can I do with preg_match_all and regex? What else can I do?

Im waiting for your help.

1
  • 2
    Have you tried to use DOM? You can suppress errors using @ and even if it isn't well formed it still works Commented Jan 26, 2011 at 23:39

4 Answers 4

2

If you can use the DOM this is far better than regexes. Take a look a PHP Tidy - it's designed to manage badly formed HTML.

Sign up to request clarification or add additional context in comments.

1 Comment

+1 - I added PHP Tidy to my answer when I remembered that TagSoup is in Java (and this question is in PHP) but you had it in your answer first.
1

You can use DOMDocument to load badly formed HTML:

$doc = new DOMDocument();
@$doc->loadHTML('<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>:&nbsp;</TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD>&nbsp;</B></TD>
<TD width="40%">&nbsp;</TD>
</TR>');


$tds = @$doc->getElementsByTagName('td');
foreach ($tds as $td) {
 echo $td->textContent, "\n";
}

I'm suppressing warnings in the above code for brevity.

Output:

Age
: 
32
data
  <!-- space -->
  <!-- space -->

Using regex to parse HTML can be a futile effort as HTML is not a regular language.

1 Comment

As you said, I think regex is not useful for this. Non-well-formed html document could be processed by Tidy and DOM, or only SimpleHTMLDom.
0

Don't use RegEx. The link is funny but not informative, so the long and short of it is that HTML markup is not a regular language, hence cannot be parsed simply using regular expressions.

You could use RegEx to parse individual 'tokens' ( a single open tag; a single attribute name or value...) as part of a recursive parsing algorithm, but you cannot use a magic RegEx to parse HTML all on its own.

Or you could use a parser.

Since the markup isn't valid, maybe you could use TagSoup or PHP:Tidy.

3 Comments

Alright, are TagSoup and Tidy installed on server by default?
I'm not under the impression as such; as a matter of fact TagSoup is a Java tool (my bad!) although tidy is apparently bundled with PHP
Non-well-formed html document could convert to well-formed html by the Tidy, then DOMDocument could use. Thx for all.
0
$regex = <<<EOF
<TR Class="Head2">\s+<TD width="15%" align="left">Age</B></TD>\s+<TD>:&nbsp;</TD>\s+<TD align="center"><font color="red">(\d+)</font></TD>\s+<TD width="15%"><font size="10">(\w+)</TD></font>\s+<TD>&nbsp;</B></TD>\s+<TD width="40%">&nbsp;</TD>\s+</TR>
EOF;

preg_match_all($regex, $text, $result);

var_dump($result)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.