23

I have a HTML with lots of data and part I am interested in:

<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>

I try to use awk which now is:

awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print $1, $2, $3 }' "index.html"

but what I want is to have:

54
1
0
0

Right now I am getting:

'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'

Any suggestions?

1
  • Is he 2nd-last zero output because there's no <b> tag at all or because there's a <td> value of 0 (0/0)? Commented Aug 18, 2014 at 15:06

11 Answers 11

30

awk is not an HTML parser. Use xpath or even xslt for that. xmllint is a commandline tool which is able to execute XPath queries and xsltproc can be used to perform XSL transformations. Both tools belong to the package libxml2-utils.

Also you can use a programming language which is able to parse HTML

Sign up to request clarification or add additional context in comments.

16 Comments

No-one said it was. Definitely can (easily) parse single pieces of data from it with awk though.
@dirkk it's really not, it might be incredibly difficult(not impossible) to parse entire segments effectively, but for retrieving small pieces of data as the question asks its actually extremely easy with regex. Everyone just jumps on the don't parse XML/XHTML/HTML bandwagon without even understanding the argument in the first place, as you can see by the all the upvotes on this "answer". Look at the accepted answer which clearly parses the data in the question.
@Jidder It is impossible to correctly parse XML using regex, not just difficult. The comment section is too short for a proof, but the chomsky hierarchie is good keyword for further research. This is scientifically proven. Just because it works in this case does not mean it is correct. The problem is that it looks correct and this is why so many people try to use regex for XML parsing - And because that is incorrect and opens you up to a world of pain so many people advice against it. Rightfully so.
@dirkk There will be no prove of that. Of course you can write an HTML parser in awk since it is Turing complete. Also you need to understand that extracting information from a text file, and fully understanding and representing a document are two different things. But hey I would still use a ready-to-use parser instead of writing a custom one again and again with awk..
@hek2mgl ah, I see. If awk is turing complete you are in fact correct (I don't now much about awk, I thought it is restricted to regular languages). So, to sum up: Don't use regular expressions to parse XML. You can use awk to parse XML, but you shouldn't (for the reason mentioned in the answer and here in the comments).
|
14
awk  -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", $3); print $3 } ' file

Output:

54
1
0
0

Another:

awk  -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
    while (getline > 0 && /<td /) {
        gsub(/<b>/, ""); sub(/ .*/, "", $3)
        print $3
    }
    exit
}' file

6 Comments

@Lenny make sure you read and fully understand all the caveats discussed at awk.info/?tip/getline before using getline. In this case there's just no need for the getline loop at all, a simple flag would do f{ subs(..); print; if (!/<td /) exit} /..Total/{f=1}
@EdMorton You have to move if (!/<td /) exit earlier. Flagging is a good approach too actually but it's easier to come up with something that doesn't sometimes. Flagging is done when you already try to make your code more slick or efficient. Yet again about getline, getline > 0 is completely safe, and safe enough if you read the manual properly. It's pretty clear how different syntaxes differ in function. The only thing to really take note of is The getline command returns 1 on success, 0 on end of file, and -1 on an error.
Yes, the test on !/<td / would come first. Consider both approaches and now add a requirement that you need to print every line from line 1 up to that /<td / line to a file named "foo" for debugging. Notice that if you use the getline approach you need to place your print > "foo" in 2 places whereas with the normal approach of just letting the awk loop do what it does you only need to put the print > "foo" in one place. Avoiding getline when it's not needed isn't only about writing safe code, its also about writing code that can be maintained and extended easily.
@EdMorton I don't agree about it being extended easily. See this code I had written a long time ago where flags (over getline) can barely apply: sourceforge.net/p/playshell/code/ci/master/tree/loader/…. The last update I made was just to make sure getline returns 1 and not just nonzero.×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes
@konsolebox I just gave a simple, common example of non-getline code being easier to extend. In any case, my comment was directed to the OP, now he is aware of the pros/cons and different opinions on the appropriate getline usage. I looked at your compiler code and it could have been written more robustly and concisely without getline. Just kidding - of course I'm not going to read hundreds of lines of awk code and try to figure out what it does and what it'd look like without getline or do any other kind of analysis on it.
|
7

HTML-XML-utils

You may use htmlutils for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:

$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>

Here is the example with provided data:

$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>

Here is the final example with stripping out <b> tags:

$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0

For more examples, check the .

1 Comment

If the HTML source is not well-formatted, try hxclean from the same package.
4
$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '$2~/[0-9]/{print $2+0}' file
54
1
0
0

2 Comments

Good answers accompany code samples with an explanation for future readers. While the person asking this question may understand your answer, explaining how you arrived at it will help countless others.
Thats fine but it takes about 15 seconds on average to crank out an answer and a few minutes to document it so I have time to do the former but not the latter for every question, especially the ones that IMHO are self-evident. If anyone has questions I'm happy to answer them.
4

You really should to use some real HTML parser for this job, like:

perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'

prints:

54
1
0
0

But for this you need to have perl, and installed Mojolicious package.

(it is easy to install with:)

curl -L get.mojolicio.us | sh

Comments

4

BSD/GNU grep/ripgrep

For simple extracting, you can use grep, for example:

  • Your example using grep:

    $ egrep -o "[0-9][^<]\?\+" file.html
    54
    1
    0 (0/0)
    0
    

    and using ripgrep:

    $ rg -o ">([^>]+)<" -r '$1' <file.html | tail +2
    54
    1
    0 (0/0)
    0
    
  • Extracting outer html of H1:

    $ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
    <h1>Example Domain</h1>
    

Other examples:

  • Extracting the body:

    $ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
    <body> <div> <h1>Example Domain</h1> ...
    

    Instead of xargs you can also use tr '\n' ' '.

  • For multiple tags, see: Text between two tags.

If you're dealing with large datasets, consider using ripgrep which has similar syntax, but it's a way faster since it's written in Rust.

Comments

3

I was recently pointed to pup, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.

cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF

Prints:

54
1
0 (0/0)
0

1 Comment

pup can handle impartial HTML, attributes without quotes, etc.
2

With , a true HTML parser, and XPath:

$ xidel -s "input.html" -e '//td[@align="right"]'
54
1
0 (0/0)
0

$ xidel -s "input.html" -e '//td[@align="right"]/tokenize(.)[1]'
# or
$ xidel -s "input.html" -e '//td[@align="right"]/extract(.,"\d+")'
54
1
0
0

Comments

1

ex/vim

For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.

Here is the command:

$ ex +"%s/^[^>].*>\([^<]\+\)<.*/\1/g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0

This is how the command works:

  • Use ex in-place editor to substitute on all lines (%) by: ex +"%s/pattern/replace/g".

    The substitution pattern consists of 3 parts:

    • Select from the beginning of line till > (^[^>].*>) for removal, right before the 2nd part.
    • Select our main part till < (([^<]+)).
    • Select everything else after < for removal (<.*).
    • We replace the whole matching line with \1 which refers to pattern inside the brackets (()).
  • After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d.

  • Finally, print the current buffer on the screen by +%p.
  • Then silently (-s) quit without saving (-c "q!"), or save into the file (-c "wq").

When tested, to replace file in-place, change -scq! to -scwq.


Here is another simple example which removes style tag from the header and prints the parsed output:

$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin

However, it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).


See also:

Comments

1

What about:

lynx -dump index.html

Comments

0

In the past I used PhantomJS but now you can do it with similar tools that are still maintained like Selenium WebDriver.

It makes it possible to use DOM API functions with JavaScript in a headless browser like Firefox or Chromium and you can call the script (written in Node.js or Python for example) in your shell script if you want to do additional treatment in the shell.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.