How Can I Get Data From HTML Source Code with PHP and RegEx?

Question

I have got HTML source code, and i must get some information text in the HTML. I can not use DOM, because the document isn't well-formed.

Maybe, the source could change later, I can not be aware of this situation. So, the solution of this problem must be advisible for most situation.

Im getting source with curl, and i will edit it with preg_match_all function and regular expressions.

Source :
...
<TR Class="Head1">
<TD width="15%">Name</TD>
<TD>: </TD>
<TD align="center">Alex</TD>
<TD width="25%">Job</TD>
<TD>: </TD>
<TD align="center" width="25%">Doctor</TD>
</TR>
...
...
<TR Class="Head2">
<TD width="15%" align="left">Age</TD>
<TD>: </TD>
<TD align="center">32</TD>
<TD width="15%">data</TD>
<TD> </TD>
<TD width="40%"> </TD>
</TR>
...

As we have seen, the source is not well-formed. In fact, terrible! But there is nothing I can do. The source is longer than this.

How can I get the data from the source? I can delete all of HTML codes, but how can i know sequence of data? What can I do with preg_match_all and regex? What else can I do?

Im waiting for your help.

Have you tried to use DOM? You can suppress errors using @ and even if it isn't well formed it still works — Jake N
– Jake N, Commented Jan 26, 2011 at 23:39

Richard H · Accepted Answer · 2011-01-26 23:39:32Z

2

If you can use the DOM this is far better than regexes. Take a look a PHP Tidy - it's designed to manage badly formed HTML.

answered Jan 26, 2011 at 23:39

Richard H

39.3k38 gold badges115 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Richard JP Le Guen Over a year ago

+1 - I added PHP Tidy to my answer when I remembered that TagSoup is in Java (and this question is in PHP) but you had it in your answer first.

webbiedave · Accepted Answer · 2011-01-27 00:18:53Z

1

You can use DOMDocument to load badly formed HTML:

$doc = new DOMDocument();
@$doc->loadHTML('<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>:&nbsp;</TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD>&nbsp;</B></TD>
<TD width="40%">&nbsp;</TD>
</TR>');


$tds = @$doc->getElementsByTagName('td');
foreach ($tds as $td) {
 echo $td->textContent, "\n";
}

I'm suppressing warnings in the above code for brevity.

Output:

Age
: 
32
data
  <!-- space -->
  <!-- space -->

Using regex to parse HTML can be a futile effort as HTML is not a regular language.

answered Jan 27, 2011 at 0:18

webbiedave

49k8 gold badges92 silver badges104 bronze badges

1 Comment

maozturk Over a year ago

As you said, I think regex is not useful for this. Non-well-formed html document could be processed by Tidy and DOM, or only SimpleHTMLDom.

Community · Accepted Answer · 2017-05-23 12:31:04Z

0

Don't use RegEx. The link is funny but not informative, so the long and short of it is that HTML markup is not a regular language, hence cannot be parsed simply using regular expressions.

You could use RegEx to parse individual 'tokens' ( a single open tag; a single attribute name or value...) as part of a recursive parsing algorithm, but you cannot use a magic RegEx to parse HTML all on its own.

Or you could use a parser.

Since the markup isn't valid, maybe you could use TagSoup or PHP:Tidy.

edited May 23, 2017 at 12:31

CommunityBot

11 silver badge

answered Jan 26, 2011 at 23:38

Richard JP Le Guen

28.8k8 gold badges93 silver badges121 bronze badges

3 Comments

maozturk Over a year ago

Alright, are TagSoup and Tidy installed on server by default?

Richard JP Le Guen Over a year ago

I'm not under the impression as such; as a matter of fact TagSoup is a Java tool (my bad!) although tidy is apparently bundled with PHP

maozturk Over a year ago

Non-well-formed html document could convert to well-formed html by the Tidy, then DOMDocument could use. Thx for all.

Ming-Tang · Accepted Answer · 2011-01-26 23:42:01Z

0

$regex = <<<EOF
<TR Class="Head2">\s+<TD width="15%" align="left">Age</B></TD>\s+<TD>:&nbsp;</TD>\s+<TD align="center"><font color="red">(\d+)</font></TD>\s+<TD width="15%"><font size="10">(\w+)</TD></font>\s+<TD>&nbsp;</B></TD>\s+<TD width="40%">&nbsp;</TD>\s+</TR>
EOF;

preg_match_all($regex, $text, $result);

var_dump($result)

answered Jan 26, 2011 at 23:42

Ming-Tang

17.7k8 gold badges40 silver badges78 bronze badges

Collectives™ on Stack Overflow

How Can I Get Data From HTML Source Code with PHP and RegEx?

4 Answers 4

1 Comment

1 Comment

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related