PHP: get an array of specific tag content

Question

I have an html string like this one (this is not the entirely html):

<h2>Title A</h2>
  <p>aaaaaa</p>
  <p>bbbbbb</p>
<h2>Title B</h2>
  <p>aaaaaa</p>
  <p>bbbbbb</p>
<h2>Title C</h2>
  <p>aaaaaa</p>
  <p>bbbbbb</p>

And I would like to get an array with only titles (from h2 tag):

array(Title A, title B, Title C);

I am using php.

I have tried

strip_tags(string,'<h2>')

but I am getting the title followed by the content after the <p> tags.

Baba · Accepted Answer · 2012-09-28 09:44:14Z

15

You can try using DOMDocument

    $html = '<h2>Title A</h2>
      <p>aaaaaa</p>
      <p>bbbbbb</p>
    <h2>Title B</h2>
      <p>aaaaaa</p>
      <p>bbbbbb</p>
    <h2>Title C</h2>
      <p>aaaaaa</p>
      <p>bbbbbb</p>';

$dom = new \DOMDocument();
$dom->loadHTML($html);

$items = $dom->getElementsByTagName('h2');

for($i = 0; $i < $items->length; $i ++) {
    echo $items->item($i)->nodeValue . PHP_EOL;
}

Output

Title A
Title B
Title C

edited Sep 28, 2012 at 9:44

answered Sep 28, 2012 at 9:23

Baba

95.3k29 gold badges172 silver badges222 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Baba Over a year ago

@sics nice .. but that only worked because of the enclosed HTML tag .. +1

Milos Cuculovic Over a year ago

Thank you very much @Baba, I am using this in symfony2 and I am getting an error like DOMDocument' not found...

hakre Over a year ago

I doubt that symfony works w/o DomDocument, so keep namespaces into account when you write your code, I updated my answer.

Baba Over a year ago

@hakra i totally agree ..... +1 for the namespace tip ... have also updated my code for namespace issues

Baba Over a year ago

ok try replacing new DOMDocument(); with new \DOMDocument(); .. it might be a namespace issue as hakra suggested

|

hakre · Accepted Answer · 2012-09-28 10:10:22Z

3

PHP has good libraries for HTML parsing already build in, here a parser with xpath:

$h2 = array_map(
    'strval', simplexml_import_dom(\DomDocument::loadHTML($html))->xpath('//h2')
);

Output:

array(3) {
  [0]=>
  string(7) "Title A"
  [1]=>
  string(7) "Title B"
  [2]=>
  string(7) "Title C"
}

See as well the other DOMDocument related answer, if you hear HTML and PHP just think DomDocument.

$doc = new DomDocument;
$doc->loadHTML($html);
$h2  = array_map(
    'strval', simplexml_import_dom($doc)->xpath('//h2')
);

edited Sep 28, 2012 at 10:10

answered Sep 28, 2012 at 9:32

hakre

200k55 gold badges454 silver badges865 bronze badges

5 Comments

Baba Over a year ago

Impressive Code in just 3 lines

hakre Over a year ago

You get strict standard warnings however because of calling loadHTML statically, just saying.

Baba Over a year ago

+1 its like you are reading my mind ..... How can you modify the code to remove that error ???

hakre Over a year ago

@Baba: see the edit, I just make use of simplexml here to get easy access to xpath.

Baba Over a year ago

@hakra guessed as much ... was thinking you would come up with another crazy one line code .... :)

user399666 · Accepted Answer · 2012-09-28 09:22:21Z

1

You should use a parser such as DomDocument to parse the HTML.

answered Sep 28, 2012 at 9:22

user399666

20k7 gold badges49 silver badges68 bronze badges

1 Comment

Milos Cuculovic Over a year ago

Thank you @Wayne, there is no php method to do this easier?

sics · Accepted Answer · 2012-09-28 09:28:19Z

1

Instead of DOMDocument you can use SimpleXML

http://codepad.viper-7.com/Esairr

$html = '
    <html>
        <h2>Title A</h2>
        <p>aaaaaa</p>
        <p>bbbbbb</p>
        <h2>Title B</h2>
        <p>aaaaaa</p>
        <p>bbbbbb</p>
        <h2>Title C</h2>
        <p>aaaaaa</p>
        <p>bbbbbb</p>
    </html>';
$xml = new SimpleXMLElement($html);

echo "<pre>";
print_r($xml->h2);
echo "</pre>";

output

SimpleXMLElement Object
(
    [0] => Title A
    [1] => Title B
    [2] => Title C
)

answered Sep 28, 2012 at 9:28

sics

1,3181 gold badge11 silver badges25 bronze badges

2 Comments

hakre Over a year ago

Problem with this is that simplexml has problems to load HTML. DomDocument can load HTML better. That is why I have combined

sics Over a year ago

yes, @hakra 's solution is better. nevertheless if you tend to use this approach be aware that you can only pass "well formed" html into the constructor. en.wikipedia.org/wiki/Well-formed_element

Deep Frozen · Accepted Answer · 2012-09-28 09:25:27Z

0

you could use preg_match_all:

preg_match_all("/<h2>(.*?)</h2>/si", $sResource, $aTitles);
print_r($aTitles[1]);

It is discouraged to parse HTML with PHP like this though, because of specific chars, newlines etc. that can intefere with your script. DOM Parser will be a good and easy alternative for this.

answered Sep 28, 2012 at 9:25

Deep Frozen

2,0752 gold badges26 silver badges43 bronze badges

3 Comments

Deep Frozen Over a year ago

Please give me the reason why -1. I just give the OP an option, but I did add that regex is discouraged and DOM Parser is a better option.

Leri Over a year ago

But you've still used regex to parse html. BTW, I am not downvoter.

Deep Frozen Over a year ago

@PLB everybody has been a noob before, and yes, I have done it this way before. It does work if you use the right wildcards the right way, and when the page is static it won't be a problem anytime soon. Like I said, I'm just giving him an option.

Collectives™ on Stack Overflow

PHP: get an array of specific tag content

5 Answers 5

11 Comments

5 Comments

1 Comment

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

11 Comments

5 Comments

1 Comment

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related