Parse HTML using shell

Question

I have a HTML with lots of data and part I am interested in:

<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>

I try to use awk which now is:

awk -F "</*b>|</td>" '/<[b]>.*[0-9]/ {print $1, $2, $3 }' "index.html"

but what I want is to have:

Right now I am getting:

'<td align=right> 54'
'<td align=right> 1'
'<td align=right> 0'

Any suggestions?

Is he 2nd-last zero output because there's no <b> tag at all or because there's a <td> value of 0 (0/0)? — Ed Morton
– Ed Morton, Commented Aug 18, 2014 at 15:06

hek2mgl · Accepted Answer · 2014-08-18 08:28:42Z

30

awk is not an HTML parser. Use xpath or even xslt for that. xmllint is a commandline tool which is able to execute XPath queries and xsltproc can be used to perform XSL transformations. Both tools belong to the package libxml2-utils.

Also you can use a programming language which is able to parse HTML

edited Aug 18, 2014 at 8:28

answered Aug 18, 2014 at 8:10

hek2mgl

159k31 gold badges263 silver badges279 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

user3442743 Over a year ago

No-one said it was. Definitely can (easily) parse single pieces of data from it with awk though.

user3442743 Over a year ago

@dirkk it's really not, it might be incredibly difficult(not impossible) to parse entire segments effectively, but for retrieving small pieces of data as the question asks its actually extremely easy with regex. Everyone just jumps on the don't parse XML/XHTML/HTML bandwagon without even understanding the argument in the first place, as you can see by the all the upvotes on this "answer". Look at the accepted answer which clearly parses the data in the question.

dirkk Over a year ago

@Jidder It is impossible to correctly parse XML using regex, not just difficult. The comment section is too short for a proof, but the chomsky hierarchie is good keyword for further research. This is scientifically proven. Just because it works in this case does not mean it is correct. The problem is that it looks correct and this is why so many people try to use regex for XML parsing - And because that is incorrect and opens you up to a world of pain so many people advice against it. Rightfully so.

hek2mgl Over a year ago

@dirkk There will be no prove of that. Of course you can write an HTML parser in awk since it is Turing complete. Also you need to understand that extracting information from a text file, and fully understanding and representing a document are two different things. But hey I would still use a ready-to-use parser instead of writing a custom one again and again with awk..

dirkk Over a year ago

@hek2mgl ah, I see. If awk is turing complete you are in fact correct (I don't now much about awk, I thought it is restricted to regular languages). So, to sum up: Don't use regular expressions to parse XML. You can use awk to parse XML, but you shouldn't (for the reason mentioned in the answer and here in the comments).

|

konsolebox · Accepted Answer · 2014-08-18 09:41:10Z

14

awk  -F '[<>]' '/<td / { gsub(/<b>/, ""); sub(/ .*/, "", $3); print $3 } ' file

Output:

Another:

awk  -F '[<>]' '
/<td><b>Total<\/b><\/td>/ {
    while (getline > 0 && /<td /) {
        gsub(/<b>/, ""); sub(/ .*/, "", $3)
        print $3
    }
    exit
}' file

edited Aug 18, 2014 at 9:41

answered Aug 18, 2014 at 8:16

konsolebox

76.3k13 gold badges110 silver badges114 bronze badges

6 Comments

Ed Morton Over a year ago

@Lenny make sure you read and fully understand all the caveats discussed at awk.info/?tip/getline before using getline. In this case there's just no need for the getline loop at all, a simple flag would do f{ subs(..); print; if (!/<td /) exit} /..Total/{f=1}

konsolebox Over a year ago

@EdMorton You have to move if (!/<td /) exit earlier. Flagging is a good approach too actually but it's easier to come up with something that doesn't sometimes. Flagging is done when you already try to make your code more slick or efficient. Yet again about getline, getline > 0 is completely safe, and safe enough if you read the manual properly. It's pretty clear how different syntaxes differ in function. The only thing to really take note of is The getline command returns 1 on success, 0 on end of file, and -1 on an error.

Ed Morton Over a year ago

Yes, the test on !/<td / would come first. Consider both approaches and now add a requirement that you need to print every line from line 1 up to that /<td / line to a file named "foo" for debugging. Notice that if you use the getline approach you need to place your print > "foo" in 2 places whereas with the normal approach of just letting the awk loop do what it does you only need to put the print > "foo" in one place. Avoiding getline when it's not needed isn't only about writing safe code, its also about writing code that can be maintained and extended easily.

konsolebox Over a year ago

@EdMorton I don't agree about it being extended easily. See this code I had written a long time ago where flags (over getline) can barely apply: sourceforge.net/p/playshell/code/ci/master/tree/loader/…. The last update I made was just to make sure getline returns 1 and not just nonzero.×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes

Ed Morton Over a year ago

@konsolebox I just gave a simple, common example of non-getline code being easier to extend. In any case, my comment was directed to the OP, now he is aware of the pros/cons and different opinions on the appropriate getline usage. I looked at your compiler code and it could have been written more robustly and concisely without getline. Just kidding - of course I'm not going to read hundreds of lines of awk code and try to figure out what it does and what it'd look like without getline or do any other kind of analysis on it.

|

kenorb · Accepted Answer · 2019-02-26 12:22:15Z

7

`HTML-XML-utils`

You may use htmlutils for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:

$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>

Here is the example with provided data:

$ hxselect -c -s "\n" "td[align=right]" <file.html
<b>54</b>
<b>1</b>
0 (0/0)
<b>0</b>

Here is the final example with stripping out <b> tags:

$ hxselect -c -s "\n" "td[align=right]" <file.html | sed "s/<[^>]\+>//g"
54
1
0 (0/0)
0

For more examples, check the html-xml-utils.

answered Feb 26, 2019 at 12:22

kenorb

169k95 gold badges712 silver badges796 bronze badges

1 Comment

karttu Over a year ago

If the HTML source is not well-formatted, try hxclean from the same package.

Ed Morton · Accepted Answer · 2014-08-18 15:18:46Z

4

$ awk -F'<td[^>]*>(<b>)?|(</?b>)?</td>' '$2~/[0-9]/{print $2+0}' file
54
1
0
0

answered Aug 18, 2014 at 15:18

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

2 Comments

Stonz2 Over a year ago

Good answers accompany code samples with an explanation for future readers. While the person asking this question may understand your answer, explaining how you arrived at it will help countless others.

Ed Morton Over a year ago

Thats fine but it takes about 15 seconds on average to crank out an answer and a few minutes to document it so I have time to do the former but not the latter for every question, especially the ones that IMHO are self-evident. If anyone has questions I'm happy to answer them.

clt60 · Accepted Answer · 2014-09-22 19:27:38Z

4

You really should to use some real HTML parser for this job, like:

perl -Mojo -0777 -nlE 'say [split(/\s/, $_->all_text)]->[0] for x($_)->find("td[align=right]")->each'

prints:

But for this you need to have perl, and installed Mojolicious package.

(it is easy to install with:)

curl -L get.mojolicio.us | sh

edited Sep 22, 2014 at 19:27

answered Aug 18, 2014 at 8:48

clt60

64.3k17 gold badges114 silver badges206 bronze badges

Comments

kravietz · Accepted Answer · 2019-07-15 15:26:44Z

4

BSD/GNU `grep`/`ripgrep`

For simple extracting, you can use grep, for example:

Your example using grep:

$ egrep -o "[0-9][^<]\?\+" file.html
54
1
0 (0/0)
0

and using ripgrep:

$ rg -o ">([^>]+)<" -r '$1' <file.html | tail +2
54
1
0 (0/0)
0

Extracting outer html of H1:

$ curl -s http://example.com/ | egrep -o '<h1>.*</h1>'
<h1>Example Domain</h1>

Other examples:

Extracting the body:

$ curl -s http://example.com/ | xargs | egrep -o '<body>.*</body>'
<body> <div> <h1>Example Domain</h1> ...

^{Instead of xargs you can also use tr '\n' ' '.}

For multiple tags, see: Text between two tags.

If you're dealing with large datasets, consider using ripgrep which has similar syntax, but it's a way faster since it's written in Rust.

edited Jul 15, 2019 at 15:26

kravietz

11.4k2 gold badges40 silver badges30 bronze badges

answered Feb 26, 2019 at 12:39

kenorb

169k95 gold badges712 silver badges796 bronze badges

Comments

greyfade · Accepted Answer · 2019-11-06 22:44:50Z

3

I was recently pointed to pup, which in the limited testing I've done, is much more forgiving with invalid HTML and tag soup.

cat <<'EOF' | pup -c 'td + td text{}'
<table>
<tr valign=top>
<td><b>Total</b></td>
<td align=right><b>54</b></td>
<td align=right><b>1</b></td>
<td align=right>0 (0/0)</td>
<td align=right><b>0</b></td>
</tr>
</table>
EOF

Prints:

54
1
0 (0/0)
0

answered Nov 6, 2019 at 22:44

greyfade

25.8k7 gold badges67 silver badges82 bronze badges

1 Comment

gaborsch Mar 16 at 14:47

pup can handle impartial HTML, attributes without quotes, etc.

Reino · Accepted Answer · 2021-04-04 10:06:11Z

2

With xidel, a true HTML parser, and XPath:

$ xidel -s "input.html" -e '//td[@align="right"]'
54
1
0 (0/0)
0

$ xidel -s "input.html" -e '//td[@align="right"]/tokenize(.)[1]'
# or
$ xidel -s "input.html" -e '//td[@align="right"]/extract(.,"\d+")'
54
1
0
0

answered Apr 4, 2021 at 10:06

Reino

3,4801 gold badge17 silver badges24 bronze badges

Comments

kenorb · Accepted Answer · 2019-02-26 16:23:21Z

`ex`/`vim`

For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.

Here is the command:

$ ex +"%s/^[^>].*>\([^<]\+\)<.*/\1/g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0

This is how the command works:

Use ex in-place editor to substitute on all lines (%) by: ex +"%s/pattern/replace/g".

The substitution pattern consists of 3 parts:
- Select from the beginning of line till > (^[^>].*>) for removal, right before the 2nd part.
- Select our main part till < (([^<]+)).
- Select everything else after < for removal (<.*).
- We replace the whole matching line with \1 which refers to pattern inside the brackets (()).
After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d.
Finally, print the current buffer on the screen by +%p.
Then silently (-s) quit without saving (-c "q!"), or save into the file (-c "wq").

When tested, to replace file in-place, change -scq! to -scwq.

Here is another simple example which removes style tag from the header and prints the parsed output:

$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin

However, it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).

Comments

baptx · Accepted Answer · 2023-07-27 21:41:24Z

0

In the past I used PhantomJS but now you can do it with similar tools that are still maintained like Selenium WebDriver.

It makes it possible to use DOM API functions with JavaScript in a headless browser like Firefox or Chromium and you can call the script (written in Node.js or Python for example) in your shell script if you want to do additional treatment in the shell.

answered Jul 27, 2023 at 21:41

baptx

3,9866 gold badges38 silver badges46 bronze badges

Collectives™ on Stack Overflow

Parse HTML using shell

11 Answers 11

16 Comments

6 Comments

`HTML-XML-utils`

1 Comment

2 Comments

Comments

BSD/GNU `grep`/`ripgrep`

Comments

1 Comment

Comments

`ex`/`vim`

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

16 Comments

6 Comments

1 Comment

2 Comments

Comments

BSD/GNU grep/ripgrep

Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

BSD/GNU `grep`/`ripgrep`