Parsing HTML page using bash

Question

I have a web HTML page and im trying to parse it.

Source ::

<tr class="active0"><td class=ac><a name="redis/172.29.219.17"></a><a class=lfsb href="#redis/172.29.219.17">172.29.219.17</a></td><td>0</td><td>0</td><td>-</td><td>0</td><td>0</td><td></td><td>0</td><td>0</td><td>-</td><td><u>0<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>0</td></tr><tr><th colspan=3>Avg over last 1024 success. conn.</th></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td>0</td><td>ms</td></tr></table></div></u></td><td>0</td><td>?</td><td>0</td><td>0</td><td></td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 0 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m DOWN</td><td class=ac><u> L7TOUT in 1001ms<div class=tips>Layer7 timeout:  at step 6 of tcp-check (expect string &#39;role:master&#39;)</div></u></td><td class=ac>1</td><td class=ac>Y</td><td class=ac>-</td><td><u>1<div class=tips>Failed Health Checks</div></u></td><td>1</td><td>17h12m</td><td class=ac>-</td></tr>
<tr class="backend"><td class=ac><a name="redis/Backend"></a><a class=lfsb href="#redis/Backend">Backend</a></td><td>0</td><td>0</td><td></td><td>1</td><td>24</td><td></td><td>29</td><td>41</td><td>200</td><td><u>5<span class="rls">4</span>033<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>5<span class="rls">4</span>033</td></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td><span class="rls">6</span>094</td><td>ms</td></tr></table></div></u></td><td>5<span class="rls">4</span>033</td><td>1s</td><td><span class="rls">4</span>89<span class="rls">1</span>000</td><td>1<span class="rls">8</span>11<span class="rls">6</span>385<div class=tips>compression: in=0 out=0 bypassed=0 savings=0%</div></td><td>0</td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 54004 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m UP</td><td class=ac>&nbsp;</td><td class=ac>1</td><td class=ac>1</td><td class=ac>0</td><td class=ac>&nbsp;</td><td>0</td><td>0s</td><td></td></tr></table><p>

What I want is ::

172.29.219.17 L7TOUT in 1001ms

So what Im trying right now is ::

grep redis index.html  | grep 'a name=\"redis\/[0-9]*.*\"'

to extract the IP address.

But the regex doesnt seem to look at pick out the only the first row and returns both the rows whereas the IP is only in row 1.

Ive doublecheck the regex im using but it doesnt seem to work.

Any ideas ?

you are asking for trouble. take a look at python module BeautifulSoup. — Jason Hu
– Jason Hu, Commented Mar 22, 2017 at 14:01
if you must use regex change [0-9]* to [0-9]+ which ensures there is at least one number (ie. an IP address) — jjspace
– jjspace, Commented Mar 22, 2017 at 14:02

Inian · Accepted Answer · 2017-03-22 14:44:54Z

Using xpath expressions in xmllint with its built-in HTML parser would produce an output as

ipAddr=$(xmllint --html --xpath "string(//tr[1]/td[1])" html)
172.29.219.17

and for the time out value prediction, I did a manual calculation of the number of the td row containing the value, which turned out to be 24

xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html

produces an output as

         L7TOUT in 1001ms
         Layer7 timeout:  at step 6 of tcp-check (expect string 'role:master')

removing the whitespaces and extracting out only the needed parts with Awk as

xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
L7TOUT in 1001ms

put in a variable as

timeOut=$(xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'

Now you can print both the values together as

echo "${ipAddr} ${timeOut}"
172.29.219.17 L7TOUT in 1001ms

version details,

xmllint --version
xmllint: using libxml version 20902

Also there is an incorrect tag in your HTML input file </table> at the end just before <p> which xmllint reports as

htmlfile:147: HTML parser error : Unexpected end tag : table

remove the line before further testing.

abc123 · Accepted Answer · 2022-11-08 14:29:54Z

4

Here is a list of command line tools that will help you parse different formats via bash; bash is extremely powerful and useful.

JSON utilize jq
XML/HTML utilize xq
YAML utilize yq
CSS utilize bashcss
- I have tested all the other tools, comment on this one

If the code starts getting truly complex you might consider the naive answer below as coding languages with class support will assit.

naive - Old Answer

Parsing complex formats like JSON, XML, HTML, CSS, YAML, ...ETC is extremely difficult in bash and likely error prone. Because of this I recommend one of the following:

PHP
RUBY
PYTHON
GOLANG

because these languages are cross platform and have parsers for all the above listed formats.

edited Nov 8, 2022 at 14:29

answered Mar 22, 2017 at 13:59

abc123

19k7 gold badges55 silver badges84 bronze badges

Comments

Tamas Rev · Accepted Answer · 2017-03-22 14:38:01Z

If you want to parse HTML with regexes, then you have to make assumptions about the HTML formatting. E.g. you assume here that the a tag and its name attribute is on the same line. However, this is perfect HTML too:

<a
    name="redis/172.29.219.17">
    Some text
</a>

Anyway, let's sole the problem assuming that the a tags are on one line and the name is the first attribute. This is what I could come up with:

  sed 's/\(<a name="redis\)/\n\1/g' index.html | grep '^<a name="redis\/[0-9.]\+"' | sed -e 's/^<a name="redis\///g' -e 's/".*//g'

Explanation:

The first sed command makes sure that all <a name="redis text goes to a separate line.
Then the grep keeps only those lines that start with `
The last sed contains two expressions:
- The first expressions removes the leading <a name="redis/ text
- The last expression removes everything that comes after the closing "

Collectives™ on Stack Overflow

Parsing HTML page using bash

3 Answers 3

Comments

naive - Old Answer

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

naive - Old Answer

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related