0

I am parsing html from the following website: http://www.asusparts.eu/partfinder/Asus/All In One/E Series I was just wondering if there was any way i could explore a parsed attribute in python? For example.. The code below outputs the following:

datas = s.find(id='accordion')

    a = datas.findAll('a')

    for data in a:

            if(data.has_attr('onclick')):
                model_info.append(data['onclick'])
                print data 

[OUTPUT]

<a href="#Bracket" onclick="getProductsBasedOnCategoryID('Asus','Bracket','ET10B','7138', this, 'E Series')">Bracket</a>

These are the values i would like to retrieve:

nCategoryID = Bracket

nModelID = ET10B

family = E Series

As the page is rendered from AJAX, They are using a script source resulting in the following url from the script file:

url = 'http://json.zandparts.com/api/category/GetCategories/' + country + '/' + currency + '/' + nModelID + '/' + family + '/' + nCategoryID + '/' + brandName + '/' + null

How can i retrieve only the 3 values listed above?


[EDIT]


import string, urllib2, urlparse, csv, sys
from urllib import quote
from urlparse import urljoin
from bs4 import BeautifulSoup
from ast import literal_eval

changable_url = 'http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series'
page = urllib2.urlopen(changable_url)
base_url = 'http://www.asusparts.eu'
soup = BeautifulSoup(page)

#Array to hold all options
redirects = []
#Array to hold all data
model_info = []

print "FETCHING OPTIONS"
select = soup.find(id='myselectListModel')
#print select.get_text()


options = select.findAll('option')

for option in options:
    if(option.has_attr('redirectvalue')):
       redirects.append(option['redirectvalue'])

for r in redirects:
    rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
    s = BeautifulSoup(rpage)
    #print s



    print "FETCHING MAIN TITLE"
    #Finding all the headings for each specific Model
    maintitle = s.find(id='puffBreadCrumbs')
    print maintitle.get_text()

    #Find entire HTML container holding all data, rendered by AJAX
    datas = s.find(id='accordion')

    #Find all 'a' tags inside data container
    a = datas.findAll('a')

    #Find all 'span' tags inside data container
    content = datas.findAll('span')

    print "FETCHING CATEGORY" 

    #Find all 'a' tags which have an attribute of 'onclick' Error:(doesn't display anything, can't seem to find
    #'onclick' attr
    if(hasattr(a, 'onclick')):
        arguments = literal_eval('(' + a['onclick'].replace(', this', '').split('(', 1)[1])
        model_info.append(arguments)
        print arguments #arguments[1] + " " + arguments[3] + " " + arguments[4] 


    print "FETCHING DATA"
    for complete in content:
        #Find all 'class' attributes inside 'span' tags
        if(complete.has_attr('class')):
            model_info.append(complete['class'])

            print complete.get_text()

    #Find all 'table data cells' inside table held in data container       
    print "FETCHING IMAGES"
    img = s.find('td')

    #Find all 'img' tags held inside these 'td' cells and print out
    images = img.findAll('img')
    print images

I have added an Error line where the problem lays...

2 Answers 2

1

Similar to Martijn's answer, but makes primitive use of pyparsing (ie, it could be refined to recognise the function and only take quoted strings with the parentheses):

from bs4 import BeautifulSoup
from pyparsing import QuotedString
from itertools import chain

s = '''<a href="#Bracket" onclick="getProductsBasedOnCategoryID('Asus','Bracket','ET10B','7138', this, 'E Series')">Bracket</a>'''
soup = BeautifulSoup(s)
for a in soup('a', onclick=True):
    print list(chain.from_iterable(QuotedString("'", unquoteResults=True).searchString(a['onclick'])))
# ['Asus', 'Bracket', 'ET10B', '7138', 'E Series']
Sign up to request clarification or add additional context in comments.

Comments

1

You could parse that as a Python literal, if you remove the this, part from it, and only take everything between the parenthesis:

from ast import literal_eval

if data.has_attr('onclick'):
    arguments = literal_eval('(' + data['onclick'].replace(', this', '').split('(', 1)[1])
    model_info.append(arguments)
    print arguments

We remove the this argument because it is not a valid python string literal and you don't want to have it anyway.

Demo:

>>> literal_eval('(' + "getProductsBasedOnCategoryID('Asus','Bracket','ET10B','7138', this, 'E Series')".replace(', this', '').split('(', 1)[1])
('Asus', 'Bracket', 'ET10B', '7138', 'E Series')

Now you have a Python tuple and can pick out any value you like.

You want the values at indices 1, 2 and 4, for example:

nCategoryID, nModelID, family = arguments[1], arguments[3], arguments[4]

10 Comments

You are still just printing data, not what you extracted from the onclick attribute.
Print arguments instead; the return value of literal_eval.
I am trying to understand what you have done here but there is no success.. I only want to display ('Bracket', 'ET108' 'E Series') but i get the following error shown in the edit above
@ash: Don't mess with the string, just use the code I gave you; just grab what you need from arguments. I've added an example.
Yes, i tried that initially but it threw an error saying tuple index out of range
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.