0

I am stuck with some regular expression problem.

I have a huge file in html and i need to extract some text (Model No.) from the file.

<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr> 
.......

<table>/.....
<td colspan="2" align="center" class="thumimages"><b>SK1998</b></td></tr> 

.... so on

and this is a huge page with all webpage built in table and divless...

The class "thumimages" almost repeats in all td, so leaves no way to differentiate the require content from the page.

There are about 10000 model No and i need to extract them.

is there any way do do this with regrex... like

"/<td colspan="2" align="center" class="thumimages"><b>{[1-9]}</b></td></tr>/"

and return an array of all the matched results. Note I have tried HTML parsing but the document contains to many html validation errors.

any help would be greatly appreciated...

1
  • 1
    Don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. Commented Jun 16, 2013 at 19:31

4 Answers 4

2

Description

This will match all td fields with class="thumimages" and retreive the contents of the inner b tag. The inner text need to have some value to it, and any leading or trailing spaces will be removed.

<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>

enter image description here

Groups

Group 0 gets the entire td tag from open tag to close tag

  1. gets the open quote around the class value to ensure the correct closing capture is also found
  2. get the desired text

PHP Code Example:

Input text

<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr> 
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>     </b></td></tr> 


<table>/.....
<td colspan="2" align="center" class="thumimages"><b>   SK1998    </b></td></tr> 

Code

<?php
$sourcestring="your source string";
preg_match_all('/<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
 

Matches

$matches Array:
(
    [0] => Array
        (
            [0] => <td colspan="2" align="center" class="thumimages"><b>SK10014</b></td>
            [1] => <td colspan="2" align="center" class="thumimages"><b>   SK1998    </b></td>
        )

    [1] => Array
        (
            [0] => "
            [1] => "
        )

    [2] => Array
        (
            [0] => SK10014
            [1] => SK1998
        )

)
Sign up to request clarification or add additional context in comments.

Comments

1

Method with DOMDocument:

// $html stands for your html content
$doc = new DOMDocument();
@$doc->loadHTML($html);
$td_nodes = $doc->getElementsByTagName('td');

foreach($td_nodes as $td_node){
    if ($td_node->getAttribute('class')=='thumimages')
        echo $td_node->firstChild->textContent.'<br/>';
 }

Method with regex:

$pattern = <<<'LOD'
~
<td (?>[^>c]++|\bc(?!lass\b))+ # begining of td tag until the word "class" 
class \s*+ = \s*+              # "class=" with variable spaces around the "="
(["']?+) thumimages\b \1       # "thumimages" between quotes or not 
(?>[^>]++|(?<!b)>)+>           # all characters until the ">" from "<b>"
\s*+  \K                       # any spaces and pattern reset

[^<\s]++                    # all chars that are not a "<" or a space
~xi
LOD;

preg_match_all($pattern, $html, $matches);

echo '<pre>' . print_r($matches[0], true);

1 Comment

I agree that HTML parsing is probably the best solution, however the requester did leave a comment on another answer here saying that the html source code was poorly formatted and was dropping validation errors.
0
/(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)/i

This works.

3 Comments

I am getting a blank arrays with this.. Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) [3] => Array ( ) )....
I used preg_match_all('|(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)|i', $content, $matchesarray);
I think you need to escape with a \ certain html characters like / " and perhaps =
0

You can use php DOMDocument Class

<?php
    $dom = new DOMDocument();
    @$dom->loadHTMLFile('load.html');
    $xpath = new DOMXPath($dom);

     foreach($xpath->query('//tr') as $tr){
        echo $xpath->query('.//td[@class="thumimages"]', $tr)->item(0)->nodeValue.'<br/>';
     }
?>

1 Comment

Tried it but the document contains to many html validation errors.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.