3

I am trying to parse an HTML document with awk.

The document contains several <div class="p_header_bottom"></div blocks

 <div class="p_header_bottom">
    <span class="fl_r"></span>
    287,489 people
  </div>
  <div class="p_header_bottom">
    <span class="fl_r"></span>
    5 links
  </div>

I am using

awk '/<div class="p_header_bottom">/,/<\/div>/'

to receive all such div's.

How I can get 287,489 number from first one?

Actually awk '/<\/span>/,/people/' doesn't work correctly.

3
  • Why awk for parsing HTML? Use a better tool like PHP and its DOM parser Commented Nov 7, 2013 at 14:41
  • @anubhava because I need just few items of information from one page, and curl | awk background tasks spawned by bash script do 10000 pages in ~ one minute. PHP will be too expensive from both memory and CPU point of view. Commented Nov 7, 2013 at 14:46
  • I'm not too sure about PHP being expensive since it can do both curl part and later parsing part in same code so essentially you'll be invoking only 1 binary from command line. More importantly parsing using DOM will be accurate also. Only if you're 100% sure of the location and organization of this HTML then go for sed/awk for parsing. Commented Nov 7, 2013 at 14:54

1 Answer 1

5

With , and assuming that the only digits and commas within each <div> </div> block occur in the numeric portion of interest

awk -v RS='<[/]?div[^>]*>' '/span/ && /people/{gsub(/[^[:digit:],]/, ""); print}' file.txt
Sign up to request clarification or add additional context in comments.

1 Comment

@glennjackman, good catch, fixed. Not sure why ** works though!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.