1

I have an html string like this one (this is not the entirely html):

<h2>Title A</h2>
  <p>aaaaaa</p>
  <p>bbbbbb</p>
<h2>Title B</h2>
  <p>aaaaaa</p>
  <p>bbbbbb</p>
<h2>Title C</h2>
  <p>aaaaaa</p>
  <p>bbbbbb</p>

And I would like to get an array with only titles (from h2 tag):

array(Title A, title B, Title C);

I am using php.

I have tried

strip_tags(string,'<h2>')

but I am getting the title followed by the content after the <p> tags.

5 Answers 5

15

You can try using DOMDocument

    $html = '<h2>Title A</h2>
      <p>aaaaaa</p>
      <p>bbbbbb</p>
    <h2>Title B</h2>
      <p>aaaaaa</p>
      <p>bbbbbb</p>
    <h2>Title C</h2>
      <p>aaaaaa</p>
      <p>bbbbbb</p>';

$dom = new \DOMDocument();
$dom->loadHTML($html);

$items = $dom->getElementsByTagName('h2');

for($i = 0; $i < $items->length; $i ++) {
    echo $items->item($i)->nodeValue . PHP_EOL;
}

Output

Title A
Title B
Title C
Sign up to request clarification or add additional context in comments.

11 Comments

@sics nice .. but that only worked because of the enclosed HTML tag .. +1
Thank you very much @Baba, I am using this in symfony2 and I am getting an error like DOMDocument' not found...
I doubt that symfony works w/o DomDocument, so keep namespaces into account when you write your code, I updated my answer.
@hakra i totally agree ..... +1 for the namespace tip ... have also updated my code for namespace issues
ok try replacing new DOMDocument(); with new \DOMDocument(); .. it might be a namespace issue as hakra suggested
|
3

PHP has good libraries for HTML parsing already build in, here a parser with xpath:

$h2 = array_map(
    'strval', simplexml_import_dom(\DomDocument::loadHTML($html))->xpath('//h2')
);

Output:

array(3) {
  [0]=>
  string(7) "Title A"
  [1]=>
  string(7) "Title B"
  [2]=>
  string(7) "Title C"
}

See as well the other DOMDocument related answer, if you hear HTML and PHP just think DomDocument.


$doc = new DomDocument;
$doc->loadHTML($html);
$h2  = array_map(
    'strval', simplexml_import_dom($doc)->xpath('//h2')
);

5 Comments

Impressive Code in just 3 lines
You get strict standard warnings however because of calling loadHTML statically, just saying.
+1 its like you are reading my mind ..... How can you modify the code to remove that error ???
@Baba: see the edit, I just make use of simplexml here to get easy access to xpath.
@hakra guessed as much ... was thinking you would come up with another crazy one line code .... :)
1

You should use a parser such as DomDocument to parse the HTML.

1 Comment

Thank you @Wayne, there is no php method to do this easier?
1

Instead of DOMDocument you can use SimpleXML

http://codepad.viper-7.com/Esairr

$html = '
    <html>
        <h2>Title A</h2>
        <p>aaaaaa</p>
        <p>bbbbbb</p>
        <h2>Title B</h2>
        <p>aaaaaa</p>
        <p>bbbbbb</p>
        <h2>Title C</h2>
        <p>aaaaaa</p>
        <p>bbbbbb</p>
    </html>';
$xml = new SimpleXMLElement($html);

echo "<pre>";
print_r($xml->h2);
echo "</pre>";

output

SimpleXMLElement Object
(
    [0] => Title A
    [1] => Title B
    [2] => Title C
)

2 Comments

Problem with this is that simplexml has problems to load HTML. DomDocument can load HTML better. That is why I have combined
yes, @hakra 's solution is better. nevertheless if you tend to use this approach be aware that you can only pass "well formed" html into the constructor. en.wikipedia.org/wiki/Well-formed_element
0

you could use preg_match_all:

preg_match_all("/<h2>(.*?)</h2>/si", $sResource, $aTitles);
print_r($aTitles[1]);

It is discouraged to parse HTML with PHP like this though, because of specific chars, newlines etc. that can intefere with your script. DOM Parser will be a good and easy alternative for this.

3 Comments

Please give me the reason why -1. I just give the OP an option, but I did add that regex is discouraged and DOM Parser is a better option.
But you've still used regex to parse html. BTW, I am not downvoter.
@PLB everybody has been a noob before, and yes, I have done it this way before. It does work if you use the right wildcards the right way, and when the page is static it won't be a problem anytime soon. Like I said, I'm just giving him an option.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.