I will break down the answer which I tried using xmllint which supports a --html flag for parsing html files
$ echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1,$2}'
content1 content2
First you can check the sanity of your HTML file by parsing it as below which confirms if the file is as per the standards or throws out errors if seen:-
$ xmllint --html YourHTML.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<table>
content1
</table>
<table>
content2
</table>
</body></html>
with my original YourHTML.html file just being:-
$ cat YourHTML.html
<table>
content1
</table>
<table>
content2
</table>
Now for the value extraction part; The steps as they executed:-
Starting the file parsing from root-node to the repeating node (//html/body/table) and running xmllint in HTML parser & interactive shell mode (xmllint --html --shell)
Running the command plainly produces a result,
/ > -------
<table>
content1
</table>
-------
<table>
content2
</table>
/ >
Now removing the special characters using sed i.e. sed '/^\/ >/d' | sed 's/<[^>]*.//g' produces
content1
-------
content2
Now removing the newlines from the above command using tr so that awk can process the records using the field separator as -------
content1 -------content2
The awk command on the above output will produce the file as needed; awk -F"-------" '{print $1,$2}
content1 content2
Putting it together in a shell script, it looks like
#!/bin/bash
# extract table1 value
table1Val=$(echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1}')
# extract table2 value
table2Val=$(echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $2}')
# can be extended up-to any number of nodes
Or quite simply:-
#!/bin/bash
echo "cat //html/body/table" | xmllint --html --shell file | sed '/^\/ >/d' | \
sed 's/<[^>]*.//g' | tr -d '\n' | awk -F"-------" '{print $1,$2}' | \
while IFS= read -r value1 value2
do
# Do whatever with the values extracted
done
P.S:- The number of commands can be reduced/simplified with a reduced number of awk/sed command combination. This is just a solution that works. The xmllint version I have used is xmllint: using libxml version 20706