Extract string of HTML code with PHP

Question

This expression only gets the values between angle brackets > < when they are numeric. I want to get them in any case.

function GetProducts($file){
    $regex = "|class=\"producto\"[^>]+>([0-9]*)</[^>]+>|U";
    if(!is_file($file)) return false;
    preg_match_all($regex,file_get_contents($file), $result);
    foreach($result[1] as $key =>$value) $result[$key] = (int) $value;
    return $result;
}

This is my HTML code:

<a class="producto" href="ver.asp?id=4013">A86028</a></span><!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">1027C</a></span><!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">5611 4020</a></span>
<!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">396-4185</a></span>
<!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">834006-5-7</a></span>
<!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">5601GR 4325GR</a></span>
<!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">2182CR(2)</a></span>
<!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">1458-54-63-55</a></span>
<!-- /a --></td></tr>

My desired output is:

Array ([1] => 1027 [2] => 5611 [3] => 5396 [4] => 834006 [5] => 5601 [6] => 2182 [7] => 1458)

Array ( [1] => 1027 [2] => 5611 [3] => 5396 [4] => 834006 [5] => 5601 [6] => 2182 [7] => 1458 ) — Javier Sega
– Javier Sega, Commented Sep 11, 2014 at 21:01

user557597 · Accepted Answer · 2014-09-11 20:37:14Z

2

This might work, but as people say parsing html with regex is problematic.

 # class="producto"[^>]+>([^<]*)</[^>]+>

 class="producto" [^>]+ >
 ( [^<]* )
 </ [^>]+ >

answered Sep 11, 2014 at 20:37

user557597

Sign up to request clarification or add additional context in comments.

2 Comments

LSerni Over a year ago

To quote the bountied answer of the very post that so berates HTML regex parsing, While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML. And this is the case here.

user557597 Over a year ago

Yeah, I could throw down a 15k regex to parse html and its still problematic. Especially entities and substitutions. I rationalize this pertains even to a known set of html.

hwnd · Accepted Answer · 2014-09-11 22:02:12Z

1

You've asked for a pure regular expression here, but it's not the right tool for parsing HTML.

function _matcher ($m, $str) {
  if (preg_match('/^\d+/', $str, $matches))
    $m[] = $matches[0];
  return $m;
}

$dom = new DOMDocument;
$dom->loadHTML($html); 
$xpath = new DOMXPath($dom);

foreach ($xpath->query('//a[@class="producto"]') as $link) {
   $vals[] = $link->nodeValue;
}

print_r(array_reduce($vals, '_matcher', array()));

Output ( Working Demo )

Array
(
    [0] => 1027
    [1] => 5611
    [2] => 396
    [3] => 834006
    [4] => 5601
    [5] => 2182
    [6] => 1458
)

edited Sep 11, 2014 at 22:02

answered Sep 11, 2014 at 21:37

hwnd

70.9k4 gold badges100 silver badges135 bronze badges

Comments

Federico Piazza · Accepted Answer · 2014-09-11 21:06:00Z

0

You can use a regex like this:

([\w\s-\(\)]+)</

Working demo

enter image description here

The idea is to capture alphanumeric, dashes and paretheses before your .

answered Sep 11, 2014 at 21:06

Federico Piazza

31.2k15 gold badges91 silver badges133 bronze badges

Collectives™ on Stack Overflow

Extract string of HTML code with PHP

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related