3

I'm trying to learn how to find/parse data from html5 webpages to use in a database. I want to learn how to find/parse the data from only the first of this '//div[@class="col-xs-12 col-sm-6 col-md-4 col-lg-3"]'

I've tried html5lib, from lxml import html and xpath but the lack of documentation for my specific use is frustrating, couldn't really find how I can achieve this.

Data to find and store:

http://csgo.steamanalyst.com/id/120565/ 
from <span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/'

And the 2 numbers from "addToCart(1852864,1108)" as id1:'1852864' and id2:'1108'

in <button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'

the html code i'm trying to learn from

<!DOCTYPE html> 

<div class='row'><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852864'>StatTrak&#8482; Desert Eagle | Conspiracy (Factory New)</a><br /><small class='text-muted'>StatTrak&#8482; Classified Pistol</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>1,108</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>1,451</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&StatTrak=1&search_item=+Desert+Eagle+%7C+Conspiracy+%28Factory+New%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1841001'>★ Karambit | Doppler (Factory New)</a><br /><small class='text-muted'>★ Covert Knife</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>155,000</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/62403692/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>30,300</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=%E2%98%85+Karambit+%7C+Doppler+%28Factory+New%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem2' onclick='addToCart(1841001,155000)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852853'>AK-47 | Redline (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>441</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/1420/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>520</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=AK-47+%7C+Redline+%28Field-Tested%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem3' onclick='addToCart(1852853,441)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852846'>M4A1-S | Master Piece (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>6,618</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120409/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>8,905</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=M4A1-S+%7C+Master+Piece+%28Field-Tested%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem4' onclick='addToCart(1852846,6618)'>Add to cart</button></center></div>
    </div>

2 Answers 2

1

Use the html parser in the lxml library. For the working example below your HTML is assigned to myhtml. There may be a more elegant way to parse the text from the button attribute, but this is a start.

>>> from lxml import html
>>> tree = html.fromstring(myhtml)
>>> mybuttons = tree.xpath('//button[@class="btn btn-orange" and @onclick]')
>>> len(mybuttons)
4
>>> for button in mybuttons:
...     (id1, id2) = button.attrib['onclick'].replace('(', ' ').replace(',', ' ').replace(')', ' ').split()[1:]
...     print id1, id2
... 
1852864 1108
1841001 155000
1852853 441
1852846 6618
>>> myurl = tree.xpath('//span[@class="market-name"]/a')
>>> for u in myurl:
...     href = u.attrib['href']
...     print href
... 
http://csgo.steamanalyst.com/id/120565/
http://csgo.steamanalyst.com/id/62403692/
http://csgo.steamanalyst.com/id/1420/
http://csgo.steamanalyst.com/id/120409/
>>> 
Sign up to request clarification or add additional context in comments.

5 Comments

This is what I'm looking for, thank you! Although for the button attribute, it returns a KeyError File "lxml.etree.pyx", line 2295, in lxml.etree._Attrib.__getitem__ (src/lxml/lxml.etree.c:59791) KeyError: 'onclick'
@MarieAnne If you are reading from a file, for example your HTML is in a file called myhtml.htm, you will need to change the tree reader line from tree = html.fromstring(myhtml) to tree = html.parse('myhtml.htm'). The posted answer parses the data as as string, but it works just as well if you parse from a file as shown in this comment.
@MarieAnne I edited the code above to work with the URL you provided by changing the selector to require the onclick attribute. You may want to delete all the scripts to make it easier to parse.
This is exactly what i was looking for, thank you. Just one more question please, is it possible to parse these strings as linked data between href, id1, id2 and the next href, id1, id2, etc, etc, instead of having 2 completely different lists ?
Yes, you should first get the buttons and urls from the xpath query, and then merge them using the zip function. See docs.python.org/2/library/functions.html#zip. In this case it would look something like this: for (button, u) in zip(mybuttons, myurl): # Operate on button and u here...
0

I have used a simpler library for a similar problem:

import re
from HTMLParser import HTMLParser

class MyParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.in_market = 0
    self.markets = {}
    self.market = None

  def handle_starttag(self, tag, attrs):
    if tag == 'span':
      if "class" in attrs and \
      and attrs["class"].indexof('market-name') != -1:
        self.in_market = 1
      elif self.in_market:
        self.in_market += 1
    elif self.in_market:
      if tag == 'a' and 'href' in attrs:
        self.market = attrs["href"]
      elif tag == 'button' and 'onclick' in attrs:
        add_to_cart_RE = re.compile(r'addToCart\((\d+),(\d+)\)')
        match = add_to_cart_RE.match(attrs["onclick"])
        self.markets[self.market] = [match.group(1), match.group(2)]


  def handle_endtag(self, tag):
    if self.tag == 'span' and self.in_market:
      self.in_market -= 1

  def handle_data(self, data):
    pass

ask me questions if the code is unclear to you.

4 Comments

Isn't regex bad at parsing html ? stackoverflow.com/a/1732454/4570549 I'm going to try and get back to you but seems like having a lot of conditions, doesn't that hinder performance as well ?
The regex was only to pull the two numbers from the onclick event. If the format is well fixed you could process it with more basic means. I should have said '^addToCart...)$' for the most efficient regex. Then it would probably be more efficient than manual manipulation. It certainly would be in V8 - not so sure for Python.
I'm going to test regex and lxml see which works best, thank you
Update, I chose to go with the lxml version for the simplicity of the code, but thank you again for this method, because of this I learnt more about regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.