1

Output of "curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | head -115 | tail -3" gives the following

<li>Balance quota:&nbsp;&nbsp;&nbsp;78.26&nbsp;GB</li>
<li>High speed data limit:&nbsp;&nbsp;&nbsp;80.0&nbsp;GB</li>
<li>No. of days left in the current bill cycle:&nbsp;&nbsp;&nbsp;28</li>

and curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | head -115 | tail -3 | awk '{gsub (/&nbsp;/, " "); gsub (/\<li>/, ""); gsub (/\<\/li>/, " "); print}' gives

Balance quota:   78.26 GB
High speed data limit:   80.0 GB
No. of days left in the current bill cycle:   28

How do I extract only the numeric data from each line? Also, is there a better way to extract that data?

1
  • 1
    welcome to StackOverflow.com. Please update your profile to include your name. The name will then show up on your badge and you'll not need to include it in every question. You can also earn a badge if you complete your profile. Commented Nov 12, 2013 at 6:03

5 Answers 5

1

Using line counts and regexps to parse HTML is very hacky and very brittle.

But if you want to extend what you're already doing, robustness be damned, all you need is a simple regexp to match numbers:

curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | 
    head -115 | tail -3 | 
    awk '{gsub (/&nbsp;/, " "); gsub (/\<li>/, ""); gsub (/\<\/li>/, " "); print} |
    grep -o -E -e '[0-9][0-9.]+'

(I can never remember if I've got the flags right to work on all grep variants. That definitely works on BSD grep; if it doesn't work on yours, the flags are -o to print only the match rather than the whole line, -E to use extended regexps instead of basic, and of course -e to specify the pattern.)

Sign up to request clarification or add additional context in comments.

2 Comments

I think there is a problem in your regexp. It allows multiple decimal points. So 9...... will also show up. (Not to mention that dot itself is a special character and will match any character) I think the correct regex would be '[0-9]*\.?[0-9]+'.
@Chandranshu: Sure, but we're talking about code as brittle as head -115 | tail -3, so I think we can assume it's guaranteed to look very close to what the OP has posted, or it's going to have a lot worse problems. So it's better to just keep things simple. Meanwhile, your regexp still isn't right—it won't handle -42 or 42. or 1e6 or lots of other valid numbers.
1

If you want something less brittle than relying on the fact that the lines you want happen to be on lines 113-115, here's some Python code using BeautifulSoup to do the same thing more nicely.

Without knowing what your source file looks like, I had to make a lot of assumptions. In particular, I'm assuming you want to extract numbers from every <li> tag in the file. If you want to extract numbers only from the <li> tags that have numbers, or only from the <li> tags under a particular <ul> tag with a nice id attribute, or accessible through some simple path from the root, or whatever, the code would be a little different.

import re
import urllib.request
import bs4

url = 'http://122.160.230.125:8080/gbod/gb_on_demand.do'
page = urllib.request.urlopen(url).read()
soup = bs4.beautifulSoup(page)
for li in soup.find_all('li'):
    print re.search('\d[\d.]+', li.text).group()

1 Comment

Please see my comment about the regex used in the other answer.
1

One way to do it:

curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | awk -F"[;&<]" 'NR>115-3 && NR<=115 {print $8}'
78.26
80.0
28

PS, if you post the output of curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do we can for sure clean this more.

Comments

1

Assuming the response is proper XML, you can use xmlstarlet to get the contents of the <li> elements:

http://xmlstar.sourceforge.net/doc/UG/xmlstarlet-ug.html#d0e270

You will have to get your head around how to define the query, but imho it is worth it, as you might find your gained knowledge helpful in future xml/html queries.

There are browser plugins to help you define the css selector you need to pick exactly the li-items you need (instead of assuming they always appear on the same lines). Unfortunately, I cannot find references right now.

From there on, use grep or sed or awk as other advised.

Comments

0

As suggested, I tried the following and I got what I was looking for.

import urllib2
import re
from bs4 import BeautifulSoup
url = 'http://122.160.230.125:8080/gbod/gb_on_demand.do'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
data = []
for li in soup.find_all('li', limit=4):
        somevar =  re.search('\d[\d.]+', li.text).group();
        data.append(somevar)

print "DSL Number: ", data[0]
print "Balance: ", data[1], "GB"
print "Limit: ", data[2], "GB"
print "Days Left: ", data[3]

Using this python script makes more sense than using curl, for my project.

Thank you all for the help.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.