Parse HTML with CURL in Shell Script

Question

I'm trying to parse a specific content of a webpage in shell script.

I need to grep the content inside the <div> tag.

<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>

If I use grep -E -m 1 -o '<div class="tracklistInfo">', the resume is only <div class="tracklistInfo">

How can I access the Artist (Diplo - Justin Bieber - Skrillex) and how the title (Where Are U Now)?

Casimir et Hippolyte · Accepted Answer · 2020-08-06 23:30:11Z

7

Using xmllint:

a='<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>'

xmllint --html --xpath 'concat(//div[@class="tracklistInfo"]/p[1]/text(), "#", //div[@class="tracklistInfo"]/p[2]/text())' <<<"$a"

You obtain:

Diplo - Justin Bieber - Skrillex#Where Are U Now

That can be easily separated.

edited Aug 6, 2020 at 23:30

answered Apr 6, 2016 at 2:01

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

james-see Over a year ago

Amazing, had no idea about xmllint

John Over a year ago

It will fail on most realworld websites, anything xmllint considers as "not valid" will cause it to break.

Reino · Accepted Answer · 2025-03-23 20:31:28Z

2

Your title starts with "Parse HTML with CURL", but curl is not a HTML-parser. If you want to use a command-line tool, use xidel instead:

$ xidel -s "<file-or-url>" -e '//div[@class="tracklistInfo"]/p'
Diplo - Justin Bieber - Skrillex
Where Are U Now

$ xidel -s "<file-or-url>" -e '//div[@class="tracklistInfo"]/p' --output-separator=' | '
$ xidel -s "<file-or-url>" -e '//div[@class="tracklistInfo"]/join(p," | ")'
Diplo - Justin Bieber - Skrillex | Where Are U Now

edited Mar 23 at 20:31

answered Aug 9, 2020 at 11:03

Reino

3,4801 gold badge17 silver badges24 bronze badges

Comments

Martin Tournoij · Accepted Answer · 2016-03-22 14:42:00Z

1

Don't. Use a HTML parser. For example, BeautifulSoup for Python is easy to use and can do this very easily.

That being said, remember that grep works on lines. The pattern is matched for every line, not for the entire string.

What you can use is -A to also print out lines after the match:

grep -A2 -E -m 1 '<div class="tracklistInfo">'

Should output:

<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>

You can then get the last or second-last line by piping it to tail:

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1
<p>Where Are U Now</p>

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1
<p class="artist">Diplo - Justin Bieber - Skrillex</p>

And strip the HTML with sed:

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1
Where Are U Now

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1 | sed 's/<[^>]*>//g'
Diplo - Justin Bieber - Skrillex

But as said, this is fickle, likely to break, and not very pretty. Here's the same with BeautifulSoup, by the way:

html = '''<body>
<p>Blah text</p>
<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>
</body>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

for track in soup.find_all(class_='tracklistInfo'):
    print(track.find_all('p')[0].text)
    print(track.find_all('p')[1].text)

This also works with multiple rows of tracklistInfo − adding that to the shell command requires more work ;-)

edited Mar 22, 2016 at 14:42

answered Mar 22, 2016 at 14:36

Martin Tournoij

28k24 gold badges110 silver badges158 bronze badges

7 Comments

Fab ian Over a year ago

So thank you so much. now i became the following resume: Flo Rida Turn Around (5,4,3,2,1) Thats Perfect, but how can i remove the space before? And can i use utf8 because i it includes a spacial character i don't work For example: Enrique Iglesias - Nicky Jam El Perdón

Martin Tournoij Over a year ago

@Fabian Yes, this is why you don't use curl/grep/sed but a HTML parser ;-)

Fab ian Over a year ago

oh ok then i try to use BeautifulSoup. Thank you

Martin Tournoij Over a year ago

"nothing works" is not something I can provide meaningful input to other than "that sucks" ;-)

Martin Tournoij Over a year ago

@Fabian It looks like you're running it as a shell script and not as a Python script. Python is a competently different programming language... Use python test.py (or python test.sh)...

|

Ali ISSA · Accepted Answer · 2016-03-22 23:54:00Z

1

cat - > file.html << EOF
<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div><div class="tracklistInfo">
<p class="artist">toto</p>
<p>tata</p>
</div>
EOF


cat file.html | tr -d '\n'  | sed -e "s/<\/div>/<\/div>\n/g" | sed -n 's/^.*class="artist">\([^<]*\)<\/p> *<p>\([^<]*\)<.*$/artist : \1\ntitle : \2\n/p'

edited Mar 22, 2016 at 23:54

answered Mar 22, 2016 at 23:38

Ali ISSA

4063 silver badges10 bronze badges

Comments

eMPee584 · Accepted Answer · 2024-11-26 11:21:49Z

Because this will come up in searches, here are some more CLI tools to extract data from HTML:

xidel: download and extract data from HTML/XML pages using CSS selectors, XPath/XQuery 3.0, as well as querying JSON
htmlq: Like jq, but for HTML.
pup: command line tool for processing HTML … using CSS selectors
tq: Perform a lookup by CSS selector on an HTML input
html-xml-utils: hxextract (extract selected elements) & hxselect (extract elements that match a (CSS) selector)
hq: lightweight command line HTML processor using CSS and XPath selectors
cascadia: CSS selector CLI tool
xpe: commandline xpath tool that is easy to use
hred: html reduce … reads HTML from standard input and outputs JSON
parsel: Select parts of a HTML document based on CSS selectors

And here's a chart of popularity for those projects available on github:

Collectives™ on Stack Overflow

Parse HTML with CURL in Shell Script

5 Answers 5

2 Comments

Comments

7 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related