2

I have a web HTML page and im trying to parse it.

Source ::

<tr class="active0"><td class=ac><a name="redis/172.29.219.17"></a><a class=lfsb href="#redis/172.29.219.17">172.29.219.17</a></td><td>0</td><td>0</td><td>-</td><td>0</td><td>0</td><td></td><td>0</td><td>0</td><td>-</td><td><u>0<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>0</td></tr><tr><th colspan=3>Avg over last 1024 success. conn.</th></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td>0</td><td>ms</td></tr></table></div></u></td><td>0</td><td>?</td><td>0</td><td>0</td><td></td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 0 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m DOWN</td><td class=ac><u> L7TOUT in 1001ms<div class=tips>Layer7 timeout:  at step 6 of tcp-check (expect string &#39;role:master&#39;)</div></u></td><td class=ac>1</td><td class=ac>Y</td><td class=ac>-</td><td><u>1<div class=tips>Failed Health Checks</div></u></td><td>1</td><td>17h12m</td><td class=ac>-</td></tr>
<tr class="backend"><td class=ac><a name="redis/Backend"></a><a class=lfsb href="#redis/Backend">Backend</a></td><td>0</td><td>0</td><td></td><td>1</td><td>24</td><td></td><td>29</td><td>41</td><td>200</td><td><u>5<span class="rls">4</span>033<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>5<span class="rls">4</span>033</td></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td><span class="rls">6</span>094</td><td>ms</td></tr></table></div></u></td><td>5<span class="rls">4</span>033</td><td>1s</td><td><span class="rls">4</span>89<span class="rls">1</span>000</td><td>1<span class="rls">8</span>11<span class="rls">6</span>385<div class=tips>compression: in=0 out=0 bypassed=0 savings=0%</div></td><td>0</td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 54004 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m UP</td><td class=ac>&nbsp;</td><td class=ac>1</td><td class=ac>1</td><td class=ac>0</td><td class=ac>&nbsp;</td><td>0</td><td>0s</td><td></td></tr></table><p>

What I want is ::

172.29.219.17 L7TOUT in 1001ms

So what Im trying right now is ::

grep redis index.html  | grep 'a name=\"redis\/[0-9]*.*\"' 

to extract the IP address.

But the regex doesnt seem to look at pick out the only the first row and returns both the rows whereas the IP is only in row 1.

Ive doublecheck the regex im using but it doesnt seem to work.

Any ideas ?

3
  • 2
    Can you use a proper HTML parser instead of grep? Commented Mar 22, 2017 at 13:55
  • 1
    you are asking for trouble. take a look at python module BeautifulSoup. Commented Mar 22, 2017 at 14:01
  • 1
    if you must use regex change [0-9]* to [0-9]+ which ensures there is at least one number (ie. an IP address) Commented Mar 22, 2017 at 14:02

3 Answers 3

4

Using xpath expressions in xmllint with its built-in HTML parser would produce an output as

ipAddr=$(xmllint --html --xpath "string(//tr[1]/td[1])" html)
172.29.219.17

and for the time out value prediction, I did a manual calculation of the number of the td row containing the value, which turned out to be 24

xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html 

produces an output as

         L7TOUT in 1001ms
         Layer7 timeout:  at step 6 of tcp-check (expect string 'role:master')

removing the whitespaces and extracting out only the needed parts with Awk as

xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
L7TOUT in 1001ms

put in a variable as

timeOut=$(xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'

Now you can print both the values together as

echo "${ipAddr} ${timeOut}"
172.29.219.17 L7TOUT in 1001ms

version details,

xmllint --version
xmllint: using libxml version 20902

Also there is an incorrect tag in your HTML input file </table> at the end just before <p> which xmllint reports as

htmlfile:147: HTML parser error : Unexpected end tag : table

remove the line before further testing.

Sign up to request clarification or add additional context in comments.

Comments

4

Here is a list of command line tools that will help you parse different formats via bash; bash is extremely powerful and useful.

  • JSON utilize jq
  • XML/HTML utilize xq
  • YAML utilize yq
  • CSS utilize bashcss
    • I have tested all the other tools, comment on this one

If the code starts getting truly complex you might consider the naive answer below as coding languages with class support will assit.

naive - Old Answer

Parsing complex formats like JSON, XML, HTML, CSS, YAML, ...ETC is extremely difficult in bash and likely error prone. Because of this I recommend one of the following:

  • PHP
  • RUBY
  • PYTHON
  • GOLANG

because these languages are cross platform and have parsers for all the above listed formats.

Comments

0

If you want to parse HTML with regexes, then you have to make assumptions about the HTML formatting. E.g. you assume here that the a tag and its name attribute is on the same line. However, this is perfect HTML too:

<a
    name="redis/172.29.219.17">
    Some text
</a>

Anyway, let's sole the problem assuming that the a tags are on one line and the name is the first attribute. This is what I could come up with:

  sed 's/\(<a name="redis\)/\n\1/g' index.html | grep '^<a name="redis\/[0-9.]\+"' | sed -e 's/^<a name="redis\///g' -e 's/".*//g'

Explanation:

  • The first sed command makes sure that all <a name="redis text goes to a separate line.
  • Then the grep keeps only those lines that start with `
  • The last sed contains two expressions:
    • The first expressions removes the leading <a name="redis/ text
    • The last expression removes everything that comes after the closing "

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.