perl extract text between html tags using regex

Question

I'm new to Perl and im trying to extract the text between all <li> </li> tags in a string and assign them into an array using regex or split/join.

e.g.

my $string = "<ul>
                  <li>hello</li>
                  <li>there</li>
                  <li>everyone</li>
              </ul>";

So that this code...

foreach $value(@array){
    print "$value\n";
}

...results in this output:

hello
there
everyone

It is not a good idea to use regex for HTML. See this answer — Jim Garrison
– Jim Garrison, Commented Sep 23, 2013 at 23:47
regex is not a horrible tool, if it fits you're need then use it, probably faster then HTML parser. With HTML parser you know its valid HTML and you can walk through the tree. — lordkain
– lordkain, Commented Sep 24, 2013 at 3:55
Yes, I think you are being too harsh on the OP. S/he is not asking for a complex html parser, but something reasonable. Just need to split the string on \n and search for something like either <li>(.+?)</li> or <li>([^<]). I would answer but I have tried too hard to forget PERL. — beroe
– beroe, Commented Sep 24, 2013 at 4:32

hwnd · Accepted Answer · 2013-09-24 01:19:55Z

Note: Do not use regular expressions to parse HTML.

This first option is done using HTML::TreeBuilder, one of many HTML Parsers that is available to use. You can visit the link provided above and read the documentation and see the example's that are given.

use strict;
use warnings;
use HTML::TreeBuilder;

my $str 
   = "<ul>"
   . "<li>hello</li>"
   . "<li>there</li>"
   . "<li>everyone</li>"
   . "</ul>"
   ;

# Now create a new tree to parse the HTML from String $str
my $tr = HTML::TreeBuilder->new_from_content($str);

# And now find all <li> tags and create an array with the values.
my @lists = 
      map { $_->content_list } 
      $tr->find_by_tag_name('li');

# And loop through the array returning our values.
foreach my $val (@lists) {
   print $val, "\n";
}

If you decide you want to use a regular expression here (I don't recommend). You could do something like..

my $str
   = "<ul>"
   . "<li>hello</li>"
   . "<li>there</li>"
   . "<li>everyone</li>"
   . "</ul>"
   ;

my @matches;
while ($str =~/(?<=<li>)(.*?)(?=<\/li>)/g) {
  push @matches, $1;
}

foreach my $m (@matches) {
   print $m, "\n";
}

Output:

hello
there
everyone

Community · Accepted Answer · 2017-05-23 10:33:45Z

1

Note: Do not use regular expressions to parse HTML.

hwnd has already provided one HTML Parser solution.

However, for a more modern HTML Parser based off css selectors, you can check out Mojo::DOM. There is a very informative 8 minute intro video at Mojocast episode 5.

use strict;
use warnings;

use Mojo::DOM;

my $html = do {local $/; <DATA>};

my $dom = Mojo::DOM->new($html);

for my $li ($dom->find('li')->text->each) {
    print "$li\n";
}

__DATA__
<ul>
  <li>hello</li>
  <li>there</li>
  <li>everyone</li>
</ul>

Outputs:

hello
there
everyone

edited May 23, 2017 at 10:33

CommunityBot

11 silver badge

answered Jun 15, 2014 at 17:12

Miller

35.3k4 gold badges42 silver badges61 bronze badges

Collectives™ on Stack Overflow

perl extract text between html tags using regex

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related