Parse HTML tables to variables in Bash

Question

I'm trying to solve this problem: I got HTML source code and I want to extract tables and their content to variables. For example:

<table>
content1
</table>
some more code
<table>
content2
</table>

And I would like to save first table to var1 and second table to var2 so I can write:

echo $var1

And I got:

<table>
content1
</table>

There are no identifiers how to distinguish these tables. Do you got any idea how to solve this?

Thanks

I've tried my best, but I got no idea how to make regex which can separate these tables. — Majzlik
– Majzlik, Commented Aug 10, 2016 at 9:25
Sorry, lots of work... That answer is quite nice, but too specific to that example. I don't have exact number of tables. However it helped me a lot to find a way how to think about that problem, thank you. — Majzlik
– Majzlik, Commented Aug 10, 2016 at 13:11
@Majzlik: You can modify it to point to your node and extract the information accordingly. BTW it is specific, because you gave only that file to work with. — Inian
– Inian, Commented Aug 10, 2016 at 13:12
Yes, I know, sorry, it was not ment as a criticism. Thanks for that answer once again. — Majzlik
– Majzlik, Commented Aug 10, 2016 at 13:28

Inian · Accepted Answer · 2016-08-10 10:37:45Z

I will break down the answer which I tried using xmllint which supports a --html flag for parsing html files

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1,$2}'
content1  content2

First you can check the sanity of your HTML file by parsing it as below which confirms if the file is as per the standards or throws out errors if seen:-

$ xmllint --html YourHTML.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<table>
content1
</table>
<table>
content2
</table>
</body></html>

with my original YourHTML.html file just being:-

$ cat YourHTML.html
<table>
content1
</table>
<table>
content2
</table>

Now for the value extraction part; The steps as they executed:-

Starting the file parsing from root-node to the repeating node (//html/body/table) and running xmllint in HTML parser & interactive shell mode (xmllint --html --shell)

Running the command plainly produces a result,

/ >  -------
<table>
content1
</table>

 -------
<table>
content2
</table>
/ >

Now removing the special characters using sed i.e. sed '/^\/ >/d' | sed 's/<[^>]*.//g' produces

content1


 -------

content2

Now removing the newlines from the above command using tr so that awk can process the records using the field separator as -------

content1 -------content2

The awk command on the above output will produce the file as needed; awk -F"-------" '{print $1,$2}

content1  content2

Putting it together in a shell script, it looks like

#!/bin/bash

# extract table1 value
table1Val=$(echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1}')

# extract table2 value
table2Val=$(echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $2}')

# can be extended up-to any number of nodes

Or quite simply:-

#!/bin/bash


echo "cat //html/body/table" |  xmllint --html --shell file | sed '/^\/ >/d' | \
    sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1,$2}' | \
        while IFS= read -r value1 value2
        do
            # Do whatever with the values extracted
        done

P.S:- The number of commands can be reduced/simplified with a reduced number of awk/sed command combination. This is just a solution that works. The xmllint version I have used is xmllint: using libxml version 20706

Collectives™ on Stack Overflow

Parse HTML tables to variables in Bash

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related