Extract specific numeric data from curl output

Question

Output of "curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | head -115 | tail -3" gives the following

<li>Balance quota:&nbsp;&nbsp;&nbsp;78.26&nbsp;GB</li>
<li>High speed data limit:&nbsp;&nbsp;&nbsp;80.0&nbsp;GB</li>
<li>No. of days left in the current bill cycle:&nbsp;&nbsp;&nbsp;28</li>

and curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | head -115 | tail -3 | awk '{gsub (/ /, " "); gsub (/\<li>/, ""); gsub (/\<\/li>/, " "); print}' gives

Balance quota:   78.26 GB
High speed data limit:   80.0 GB
No. of days left in the current bill cycle:   28

How do I extract only the numeric data from each line? Also, is there a better way to extract that data?

welcome to StackOverflow.com. Please update your profile to include your name. The name will then show up on your badge and you'll not need to include it in every question. You can also earn a badge if you complete your profile. — Chandranshu
– Chandranshu, Commented Nov 12, 2013 at 6:03

abarnert · Accepted Answer · 2013-11-12 05:39:45Z

1

Using line counts and regexps to parse HTML is very hacky and very brittle.

But if you want to extend what you're already doing, robustness be damned, all you need is a simple regexp to match numbers:

curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | 
    head -115 | tail -3 | 
    awk '{gsub (/&nbsp;/, " "); gsub (/\<li>/, ""); gsub (/\<\/li>/, " "); print} |
    grep -o -E -e '[0-9][0-9.]+'

(I can never remember if I've got the flags right to work on all grep variants. That definitely works on BSD grep; if it doesn't work on yours, the flags are -o to print only the match rather than the whole line, -E to use extended regexps instead of basic, and of course -e to specify the pattern.)

answered Nov 12, 2013 at 5:39

abarnert

368k54 gold badges626 silver badges691 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Chandranshu Over a year ago

I think there is a problem in your regexp. It allows multiple decimal points. So 9...... will also show up. (Not to mention that dot itself is a special character and will match any character) I think the correct regex would be '[0-9]*\.?[0-9]+'.

abarnert Over a year ago

@Chandranshu: Sure, but we're talking about code as brittle as head -115 | tail -3, so I think we can assume it's guaranteed to look very close to what the OP has posted, or it's going to have a lot worse problems. So it's better to just keep things simple. Meanwhile, your regexp still isn't right—it won't handle -42 or 42. or 1e6 or lots of other valid numbers.

abarnert · Accepted Answer · 2013-11-12 05:48:09Z

1

If you want something less brittle than relying on the fact that the lines you want happen to be on lines 113-115, here's some Python code using BeautifulSoup to do the same thing more nicely.

Without knowing what your source file looks like, I had to make a lot of assumptions. In particular, I'm assuming you want to extract numbers from every <li> tag in the file. If you want to extract numbers only from the <li> tags that have numbers, or only from the <li> tags under a particular <ul> tag with a nice id attribute, or accessible through some simple path from the root, or whatever, the code would be a little different.

import re
import urllib.request
import bs4

url = 'http://122.160.230.125:8080/gbod/gb_on_demand.do'
page = urllib.request.urlopen(url).read()
soup = bs4.beautifulSoup(page)
for li in soup.find_all('li'):
    print re.search('\d[\d.]+', li.text).group()

answered Nov 12, 2013 at 5:48

abarnert

368k54 gold badges626 silver badges691 bronze badges

1 Comment

Chandranshu Over a year ago

Please see my comment about the regex used in the other answer.

Jotne · Accepted Answer · 2013-11-12 07:46:45Z

1

One way to do it:

curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | awk -F"[;&<]" 'NR>115-3 && NR<=115 {print $8}'
78.26
80.0
28

PS, if you post the output of curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do we can for sure clean this more.

edited Nov 12, 2013 at 7:46

answered Nov 12, 2013 at 7:40

Jotne

41.7k13 gold badges54 silver badges58 bronze badges

Comments

Felix · Accepted Answer · 2013-11-12 08:06:38Z

1

Assuming the response is proper XML, you can use xmlstarlet to get the contents of the <li> elements:

http://xmlstar.sourceforge.net/doc/UG/xmlstarlet-ug.html#d0e270

You will have to get your head around how to define the query, but imho it is worth it, as you might find your gained knowledge helpful in future xml/html queries.

There are browser plugins to help you define the css selector you need to pick exactly the li-items you need (instead of assuming they always appear on the same lines). Unfortunately, I cannot find references right now.

From there on, use grep or sed or awk as other advised.

answered Nov 12, 2013 at 8:06

Felix

4,7262 gold badges35 silver badges47 bronze badges

Comments

j202 · Accepted Answer · 2013-11-14 14:45:27Z

0

As suggested, I tried the following and I got what I was looking for.

import urllib2
import re
from bs4 import BeautifulSoup
url = 'http://122.160.230.125:8080/gbod/gb_on_demand.do'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
data = []
for li in soup.find_all('li', limit=4):
        somevar =  re.search('\d[\d.]+', li.text).group();
        data.append(somevar)

print "DSL Number: ", data[0]
print "Balance: ", data[1], "GB"
print "Limit: ", data[2], "GB"
print "Days Left: ", data[3]

Using this python script makes more sense than using curl, for my project.

Thank you all for the help.

answered Nov 14, 2013 at 14:45

j202

1113 bronze badges

Collectives™ on Stack Overflow

Extract specific numeric data from curl output

5 Answers 5

2 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related