Extract values from html table with a RegEx in bash/Perl

Question

I want to monitor my oki printer with munin, so I am trying to adapt this plugin to my printer.

The table of pages in my printer http server is:

<table width="560" border="0" cellspacing="2" cellpadding="3">
    <tr class="sub_item_color">
        <td  class="normal" width="200" align="right" valign="bottom" rowspan="2">Media Size</td>
        <td  class="normal" width="90" align="left">Color</td>
        <td  class="normal" width="90" align="left">Color</td>
        <td  class="normal" width="90" align="left">Mono</td>
        <td  class="normal" width="90" align="left">Mono</td>
    </tr>
    <tr class="sub_item_color">
        <td  class="normal" width="90" align="left">A3/Tabloid</td>
        <td  class="normal" width="90" align="left">A4/Letter</td><td  class="normal" width="90" align="left">A3/Tabloid</td>
        <td  class="normal" width="90" align="left">A4/Letter</td>
    </tr>
    <tr class="sub_item_color">
        <td  class="normal" width="200" align="left">Total Impressions</td>
        <td  class="normal" width="90" align="right">21906</td>
        <td  class="normal" width="90" align="right">33491</td>
        <td  class="normal" width="90" align="right">2084</td>
        <td  class="normal" width="90" align="right">4460</td>
    </tr>
    <tr class="sub_item_color">
        <td  class="normal" width="200" align="left">Total A4/Letter Impressions</td>
        <td  class="normal" colspan="2" align="center"><b>Color:77303</B></td>
        <td  class="normal" colspan="2" align="center"><b>Mono:8628</B></td>
    </tr>
</table>

That munin script is doing this:

infopage=`wget -q -O - http://root:$password@$destination/printer/printerinfo_top.htm | perl -p -e 's/\n/ /m'`
echo tray1.value    `echo $infopage | perl -p -e 's/^.+Tray\ 1\ Page\ Count\:\ \<\/TD\>\<TD\ WIDTH\=\"94\"\>([0-9]+)\<.+$/$1/'`

How I could get the total impressions?

Friends don't let friends parse HTML with regular expressions. — Ether
– Ether, Commented Nov 12, 2010 at 17:09

daxim · Accepted Answer · 2010-11-12 11:26:34Z

9

Solution implemented as a Unix filter like in the question, only much more readable and declarative thanks to XPath.

#!/usr/bin/env perl
use 5.010;
use strictures;
use HTML::TreeBuilder::XPath qw();
use List::Util qw(sum);
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_content(<>);
say sum map { s[.*:][]; $_ } $tree->findnodes_as_strings('//table/tr/td[@colspan=2]/b');

wget -q -O - http://… | perl sum-total-impressions.pl

answered Nov 12, 2010 at 11:26

daxim

39.3k4 gold badges71 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

kriss Over a year ago

+1: This is a good exemple of why you shouldn't parse HTML using regex. For this question the main problem was the initial solution to use regex to parse HTML instead of XPath.

Magnetic_dud Over a year ago

i still cannot understand the code, but it looks easier to select the right one; i will study this: w3schools.com/xpath/default.asp

Collectives™ on Stack Overflow

Extract values from html table with a RegEx in bash/Perl

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related