2

I have a String which looks something like this:

$html_string = "<p>Some content</p><p>separated by</p><p>paragraphs</p>"

I'd like to do some parsing on the content inside the tags, so I think that creating an array from this would be easiest. Currently I'm using a series of explode and implode to achieve what I want:

$stripped = explode('<p>', $html_string);
$joined = implode(' ', $stripped);
$parsed = explode('</p>', $joined);

which in effect gives:

array('Some content', 'separated by', 'paragraphs'); 

Is there a better, more robust way to create an array from HTML tags? Looking at the docs, I didn't see any mention of parsing via a regular expression.

Thanks for your help!

3
  • 1
    Parsing with DOMDocument Commented Aug 12, 2016 at 20:36
  • or SimpleXML extension Commented Aug 12, 2016 at 20:39
  • 2
    DOMDocument is the best way to parse HTML, but there is also php.net/manual/en/function.preg-split.php for regex exploding Commented Aug 12, 2016 at 20:39

3 Answers 3

1

If its only that simple with no/not much other tags inside the content you can simply use regex for that:

$string = '<p>Some content</p><p>separated by</p><p>paragraphs</p>';

preg_match_all('/<p>([^<]*?)<\/p>/mi', $string, $matches);

var_dump($matches[1]);

which creates this output:

array(3) {
  [0]=>
  string(12) "Some content"
  [1]=>
  string(12) "separated by"
  [2]=>
  string(10) "paragraphs"
}

Keep in mind that this is not the most effective way nor is it the fastest, but its shorter then using DOMDocument or anything like that.

Sign up to request clarification or add additional context in comments.

Comments

0

If you need to do some html parsing in php, there is a nice library for that, called php html parser. https://github.com/paquettg/php-html-parser which can give you a jquery like api, to parse html.

an example:

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->load('<p>Some content</p><p>separated by</p><p>paragraphs</p>');
$pTags = $dom->find('p');
foreach ($pTags as $tag)
{    
    // do something with the html
    $content = $tag->innerHtml;

 }

Comments

0

Here is the DOMDocument solution (native PHP), which will also work when your p tags have attributes, or contain other tags like <br>, or have lots of white-space in between them (which is irrelevant in HTML rendering), or contain HTML entities like &nbsp; or &lt;, etc, etc:

$html_string = "<p>Some content</p><p>separated by</p><p>paragraphs</p>";
$doc = new DOMDocument();
$doc->loadHTML($html_string);

foreach($doc->getElementsByTagName('p') as $p ) {
    $paras[] = $p->textContent;
}

// Output array:
print_r($paras);

If you really want to stick with regular expressions, then at least allow tag attributes and HTML entities, translating the latter to their corresponding characters:

$html_string = "<p>Some content &amp; text</p><p>separated&nbsp;by</p><p style='background:yellow'>paragraphs</p>";

preg_match_all('/<p(?:\s.*?)?>\s*(.*?)\s*<\/p\s*>/si', $html_string, $matches);

$paras = $matches[1];
array_walk($paras, 'html_entity_decode');

print_r($paras);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.