Html5 find/parse specific element in page python

Question

I'm trying to learn how to find/parse data from html5 webpages to use in a database. I want to learn how to find/parse the data from only the first of this '//div[@class="col-xs-12 col-sm-6 col-md-4 col-lg-3"]'

I've tried html5lib, from lxml import html and xpath but the lack of documentation for my specific use is frustrating, couldn't really find how I can achieve this.

Data to find and store:

http://csgo.steamanalyst.com/id/120565/ 
from <span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/'

And the 2 numbers from "addToCart(1852864,1108)" as id1:'1852864' and id2:'1108'

in <button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'

the html code i'm trying to learn from

<!DOCTYPE html> 

<div class='row'><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852864'>StatTrak&#8482; Desert Eagle | Conspiracy (Factory New)</a><br /><small class='text-muted'>StatTrak&#8482; Classified Pistol</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>1,108</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>1,451</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&StatTrak=1&search_item=+Desert+Eagle+%7C+Conspiracy+%28Factory+New%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1841001'>★ Karambit | Doppler (Factory New)</a><br /><small class='text-muted'>★ Covert Knife</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>155,000</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/62403692/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>30,300</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=%E2%98%85+Karambit+%7C+Doppler+%28Factory+New%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem2' onclick='addToCart(1841001,155000)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852853'>AK-47 | Redline (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>441</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/1420/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>520</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=AK-47+%7C+Redline+%28Field-Tested%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem3' onclick='addToCart(1852853,441)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852846'>M4A1-S | Master Piece (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>6,618</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120409/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>8,905</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=M4A1-S+%7C+Master+Piece+%28Field-Tested%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem4' onclick='addToCart(1852846,6618)'>Add to cart</button></center></div>
    </div>

Thane Plummer · Accepted Answer · 2015-08-04 04:20:17Z

1

Use the html parser in the lxml library. For the working example below your HTML is assigned to myhtml. There may be a more elegant way to parse the text from the button attribute, but this is a start.

>>> from lxml import html
>>> tree = html.fromstring(myhtml)
>>> mybuttons = tree.xpath('//button[@class="btn btn-orange" and @onclick]')
>>> len(mybuttons)
4
>>> for button in mybuttons:
...     (id1, id2) = button.attrib['onclick'].replace('(', ' ').replace(',', ' ').replace(')', ' ').split()[1:]
...     print id1, id2
... 
1852864 1108
1841001 155000
1852853 441
1852846 6618
>>> myurl = tree.xpath('//span[@class="market-name"]/a')
>>> for u in myurl:
...     href = u.attrib['href']
...     print href
... 
http://csgo.steamanalyst.com/id/120565/
http://csgo.steamanalyst.com/id/62403692/
http://csgo.steamanalyst.com/id/1420/
http://csgo.steamanalyst.com/id/120409/
>>>

edited Aug 4, 2015 at 4:20

answered Aug 4, 2015 at 1:45

Thane Plummer

10.9k3 gold badges28 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Marie Anne Over a year ago

This is what I'm looking for, thank you! Although for the button attribute, it returns a KeyError File "lxml.etree.pyx", line 2295, in lxml.etree._Attrib.__getitem__ (src/lxml/lxml.etree.c:59791) KeyError: 'onclick'

Thane Plummer Over a year ago

@MarieAnne If you are reading from a file, for example your HTML is in a file called myhtml.htm, you will need to change the tree reader line from tree = html.fromstring(myhtml) to tree = html.parse('myhtml.htm'). The posted answer parses the data as as string, but it works just as well if you parse from a file as shown in this comment.

Thane Plummer Over a year ago

@MarieAnne I edited the code above to work with the URL you provided by changing the selector to require the onclick attribute. You may want to delete all the scripts to make it easier to parse.

Marie Anne Over a year ago

This is exactly what i was looking for, thank you. Just one more question please, is it possible to parse these strings as linked data between href, id1, id2 and the next href, id1, id2, etc, etc, instead of having 2 completely different lists ?

Thane Plummer Over a year ago

Yes, you should first get the buttons and urls from the xpath query, and then merge them using the zip function. See docs.python.org/2/library/functions.html#zip. In this case it would look something like this: for (button, u) in zip(mybuttons, myurl): # Operate on button and u here...

Paul Marrington · Accepted Answer · 2015-08-04 00:53:00Z

0

I have used a simpler library for a similar problem:

import re
from HTMLParser import HTMLParser

class MyParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.in_market = 0
    self.markets = {}
    self.market = None

  def handle_starttag(self, tag, attrs):
    if tag == 'span':
      if "class" in attrs and \
      and attrs["class"].indexof('market-name') != -1:
        self.in_market = 1
      elif self.in_market:
        self.in_market += 1
    elif self.in_market:
      if tag == 'a' and 'href' in attrs:
        self.market = attrs["href"]
      elif tag == 'button' and 'onclick' in attrs:
        add_to_cart_RE = re.compile(r'addToCart\((\d+),(\d+)\)')
        match = add_to_cart_RE.match(attrs["onclick"])
        self.markets[self.market] = [match.group(1), match.group(2)]


  def handle_endtag(self, tag):
    if self.tag == 'span' and self.in_market:
      self.in_market -= 1

  def handle_data(self, data):
    pass

ask me questions if the code is unclear to you.

answered Aug 4, 2015 at 0:53

Paul Marrington

5572 silver badges7 bronze badges

4 Comments

Marie Anne Over a year ago

Isn't regex bad at parsing html ? stackoverflow.com/a/1732454/4570549 I'm going to try and get back to you but seems like having a lot of conditions, doesn't that hinder performance as well ?

Paul Marrington Over a year ago

The regex was only to pull the two numbers from the onclick event. If the format is well fixed you could process it with more basic means. I should have said '^addToCart...)$' for the most efficient regex. Then it would probably be more efficient than manual manipulation. It certainly would be in V8 - not so sure for Python.

Marie Anne Over a year ago

I'm going to test regex and lxml see which works best, thank you

Marie Anne Over a year ago

Update, I chose to go with the lxml version for the simplicity of the code, but thank you again for this method, because of this I learnt more about regex.

Collectives™ on Stack Overflow

Html5 find/parse specific element in page python

2 Answers 2

5 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related