Using regexes to find result from HTML table

Question

I am stuck with some regular expression problem.

I have a huge file in html and i need to extract some text (Model No.) from the file.

<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr> 
.......

<table>/.....
<td colspan="2" align="center" class="thumimages"><b>SK1998</b></td></tr> 

.... so on

and this is a huge page with all webpage built in table and divless...

The class "thumimages" almost repeats in all td, so leaves no way to differentiate the require content from the page.

There are about 10000 model No and i need to extract them.

is there any way do do this with regrex... like

"/<td colspan="2" align="center" class="thumimages"><b>{[1-9]}</b></td></tr>/"

and return an array of all the matched results. Note I have tried HTML parsing but the document contains to many html validation errors.

any help would be greatly appreciated...

Don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. — Andy Lester
– Andy Lester, Commented Jun 16, 2013 at 19:31

Community · Accepted Answer · 2020-06-20 09:12:55Z

Description

This will match all td fields with class="thumimages" and retreive the contents of the inner b tag. The inner text need to have some value to it, and any leading or trailing spaces will be removed.

<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>

enter image description here

Groups

Group 0 gets the entire td tag from open tag to close tag

gets the open quote around the class value to ensure the correct closing capture is also found
get the desired text

PHP Code Example:

Input text

<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr> 
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>     </b></td></tr> 


<table>/.....
<td colspan="2" align="center" class="thumimages"><b>   SK1998    </b></td></tr>

Code

<?php
$sourcestring="your source string";
preg_match_all('/<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

$matches Array:
(
    [0] => Array
        (
            [0] => <td colspan="2" align="center" class="thumimages"><b>SK10014</b></td>
            [1] => <td colspan="2" align="center" class="thumimages"><b>   SK1998    </b></td>
        )

    [1] => Array
        (
            [0] => "
            [1] => "
        )

    [2] => Array
        (
            [0] => SK10014
            [1] => SK1998
        )

)

Casimir et Hippolyte · Accepted Answer · 2013-06-16 21:10:14Z

1

Method with DOMDocument:

// $html stands for your html content
$doc = new DOMDocument();
@$doc->loadHTML($html);
$td_nodes = $doc->getElementsByTagName('td');

foreach($td_nodes as $td_node){
    if ($td_node->getAttribute('class')=='thumimages')
        echo $td_node->firstChild->textContent.'<br/>';
 }

Method with regex:

$pattern = <<<'LOD'
~
<td (?>[^>c]++|\bc(?!lass\b))+ # begining of td tag until the word "class" 
class \s*+ = \s*+              # "class=" with variable spaces around the "="
(["']?+) thumimages\b \1       # "thumimages" between quotes or not 
(?>[^>]++|(?<!b)>)+>           # all characters until the ">" from "<b>"
\s*+  \K                       # any spaces and pattern reset

[^<\s]++                    # all chars that are not a "<" or a space
~xi
LOD;

preg_match_all($pattern, $html, $matches);

echo '<pre>' . print_r($matches[0], true);

edited Jun 16, 2013 at 21:10

answered Jun 16, 2013 at 20:34

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

1 Comment

Ro Yo Mi Over a year ago

I agree that HTML parsing is probably the best solution, however the requester did leave a comment on another answer here saying that the html source code was poorly formatted and was dropping validation errors.

transilvlad · Accepted Answer · 2013-06-16 19:31:54Z

0

/(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)/i

This works.

answered Jun 16, 2013 at 19:31

transilvlad

14.6k13 gold badges48 silver badges81 bronze badges

3 Comments

Gaurav Mehra Over a year ago

I am getting a blank arrays with this.. Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) [3] => Array ( ) )....

Gaurav Mehra Over a year ago

I used preg_match_all('|(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)|i', $content, $matchesarray);

transilvlad Over a year ago

I think you need to escape with a \ certain html characters like / " and perhaps =

Khawer Zeshan · Accepted Answer · 2013-06-16 19:36:37Z

0

You can use php DOMDocument Class

<?php
    $dom = new DOMDocument();
    @$dom->loadHTMLFile('load.html');
    $xpath = new DOMXPath($dom);

     foreach($xpath->query('//tr') as $tr){
        echo $xpath->query('.//td[@class="thumimages"]', $tr)->item(0)->nodeValue.'<br/>';
     }
?>

answered Jun 16, 2013 at 19:36

Khawer Zeshan

9,6568 gold badges43 silver badges63 bronze badges

1 Comment

Gaurav Mehra Over a year ago

Tried it but the document contains to many html validation errors.

Collectives™ on Stack Overflow

Using regexes to find result from HTML table

4 Answers 4

Description

Groups

PHP Code Example:

Comments

1 Comment

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Description

Groups

PHP Code Example:

Comments

1 Comment

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related