Parse HTML snippet with awk

Question

I am trying to parse an HTML document with awk.

The document contains several <div class="p_header_bottom"></div blocks

 <div class="p_header_bottom">
    <span class="fl_r"></span>
    287,489 people
  </div>
  <div class="p_header_bottom">
    <span class="fl_r"></span>
    5 links
  </div>

I am using

awk '/<div class="p_header_bottom">/,/<\/div>/'

to receive all such div's.

How I can get 287,489 number from first one?

Actually awk '/<\/span>/,/people/' doesn't work correctly.

Why awk for parsing HTML? Use a better tool like PHP and its DOM parser — anubhava
– anubhava, Commented Nov 7, 2013 at 14:41
@anubhava because I need just few items of information from one page, and curl | awk background tasks spawned by bash script do 10000 pages in ~ one minute. PHP will be too expensive from both memory and CPU point of view. — zavg
– zavg, Commented Nov 7, 2013 at 14:46
I'm not too sure about PHP being expensive since it can do both curl part and later parsing part in same code so essentially you'll be invoking only 1 binary from command line. More importantly parsing using DOM will be accurate also. Only if you're 100% sure of the location and organization of this HTML then go for sed/awk for parsing. — anubhava
– anubhava, Commented Nov 7, 2013 at 14:54

iruvar · Accepted Answer · 2013-11-07 16:07:49Z

5

With gawk, and assuming that the only digits and commas within each <div> </div> block occur in the numeric portion of interest

awk -v RS='<[/]?div[^>]*>' '/span/ && /people/{gsub(/[^[:digit:],]/, ""); print}' file.txt

edited Nov 7, 2013 at 16:07

answered Nov 7, 2013 at 16:00

iruvar

23.5k7 gold badges58 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

iruvar Over a year ago

@glennjackman, good catch, fixed. Not sure why ** works though!

Collectives™ on Stack Overflow

Parse HTML snippet with awk

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related