1

I'm new to Perl and im trying to extract the text between all <li> </li> tags in a string and assign them into an array using regex or split/join.

e.g.

my $string = "<ul>
                  <li>hello</li>
                  <li>there</li>
                  <li>everyone</li>
              </ul>";

So that this code...

foreach $value(@array){
    print "$value\n";
}

...results in this output:

hello
there
everyone
4
  • 2
    It is not a good idea to use regex for HTML. See this answer Commented Sep 23, 2013 at 23:47
  • Yes, regex is a horribly wrong tool for this. Commented Sep 23, 2013 at 23:50
  • regex is not a horrible tool, if it fits you're need then use it, probably faster then HTML parser. With HTML parser you know its valid HTML and you can walk through the tree. Commented Sep 24, 2013 at 3:55
  • 1
    Yes, I think you are being too harsh on the OP. S/he is not asking for a complex html parser, but something reasonable. Just need to split the string on \n and search for something like either <li>(.+?)</li> or <li>([^<]). I would answer but I have tried too hard to forget PERL. Commented Sep 24, 2013 at 4:32

2 Answers 2

7

Note: Do not use regular expressions to parse HTML.

This first option is done using HTML::TreeBuilder, one of many HTML Parsers that is available to use. You can visit the link provided above and read the documentation and see the example's that are given.

use strict;
use warnings;
use HTML::TreeBuilder;

my $str 
   = "<ul>"
   . "<li>hello</li>"
   . "<li>there</li>"
   . "<li>everyone</li>"
   . "</ul>"
   ;

# Now create a new tree to parse the HTML from String $str
my $tr = HTML::TreeBuilder->new_from_content($str);

# And now find all <li> tags and create an array with the values.
my @lists = 
      map { $_->content_list } 
      $tr->find_by_tag_name('li');

# And loop through the array returning our values.
foreach my $val (@lists) {
   print $val, "\n";
}

If you decide you want to use a regular expression here (I don't recommend). You could do something like..

my $str
   = "<ul>"
   . "<li>hello</li>"
   . "<li>there</li>"
   . "<li>everyone</li>"
   . "</ul>"
   ;

my @matches;
while ($str =~/(?<=<li>)(.*?)(?=<\/li>)/g) {
  push @matches, $1;
}

foreach my $m (@matches) {
   print $m, "\n";
}

Output:

hello
there
everyone
Sign up to request clarification or add additional context in comments.

Comments

1

Note: Do not use regular expressions to parse HTML.

hwnd has already provided one HTML Parser solution.

However, for a more modern HTML Parser based off css selectors, you can check out Mojo::DOM. There is a very informative 8 minute intro video at Mojocast episode 5.

use strict;
use warnings;

use Mojo::DOM;

my $html = do {local $/; <DATA>};

my $dom = Mojo::DOM->new($html);

for my $li ($dom->find('li')->text->each) {
    print "$li\n";
}

__DATA__
<ul>
  <li>hello</li>
  <li>there</li>
  <li>everyone</li>
</ul>

Outputs:

hello
there
everyone

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.