1

So I have been trying to scrape out a value from a html that is a javascript. There is alot of javascript in the code but I just want to be able to print out this one:

var spConfig=newProduct.Config({
  "attributes": {
    "531": {
      "id": "531",
      "options": [
        {
          "id": "18",
          "hunter": "0",
          "products": [
            "128709"
          ]
        },
        {
          "label": "40 1\/2",
          "hunter": "0",
          "products": [
            "120151"
          ]
        },
        {
          "id": "33",
          "hunter": "0",
          "products": [
            "120152"
          ]
        },
        {
          "id": "36",
          "hunter": "0",
          "products": [
            "128710"
          ]
        },
        {
          "id": "42",
          "hunter": "0",
          "products": [
            "125490"
          ]
        }
      ]
    }
  },

  "Id": "120153",

});

So I started by doing a code that looks like:

test = bs4.find_all('script', {'type': 'text/javascript'})
print(test)

The output I am getting is pretty huge so I am not able to post it all here but one of them is the javascript as I mentioned at the top and I want to print out only var spConfig=newProduct.Config.

How am I able to do that, to be able to just print out var spConfig=newProduct.Config.... which I later can use json.loads that convert it to a json where I later on can scrape it more easier?

For any question or something I haven't explained well. I will apprecaite everything in the comment where I can improve myself aswell here in stackoverflow! :)

EDIT:

More example of what bs4 prints out for javascripts

<script type="text/javascript">varoptionsPrice=newProduct.Options({
  "priceFormat": {
    "pattern": "%s\u00a0\u20ac",
    "precision": 2,
    "requiredPrecision": 2,
    "decimalSymbol": ",",
    "groupSymbol": "\u00a0",
    "groupLength": 3,
    "integerRequired": 1
  },
  "showBoths": false,
  "idSuffix": "_clone",
  "skipCalculate": 1,
  "defaultTax": 20,
  "currentTax": 20,
  "tierPrices": [

  ],
  "tierPricesInclTax": [

  ],
  "swatchPrices": null
});</script>,
<script type="text/javascript">var spConfig=newProduct.Config({
  "attributes": {
    "531": {
      "id": "531",
      "options": [
        {
          "id": "18",
          "hunter": "0",
          "products": [
            "128709"
          ]
        },
        {
          "label": "40 1\/2",
          "hunter": "0",
          "products": [
            "120151"
          ]
        },
        {
          "id": "33",
          "hunter": "0",
          "products": [
            "120152"
          ]
        },
        {
          "id": "36",
          "hunter": "0",
          "products": [
            "128710"
          ]
        },
        {
          "id": "42",
          "hunter": "0",
          "products": [
            "125490"
          ]
        }
      ]
    }
  },

  "Id": "120153"
});</script>,
<scripttype="text/javascript">document.observe('dom:loaded',
function(){
  varswatchesConfig=newProduct.ConfigurableSwatches(spConfig);
});</script>

EDIT update 2:

try:
    product_li_tags = bs4.find_all('script', {'type': 'text/javascript'})
except Exception:
    product_li_tags = []


for product_li_tag in product_li_tags:
   try:
        pat = "product.Config\((.+)\);"
        json_str = re.search(pat, product_li_tag, flags=re.DOTALL).group(1)
        print(json_str)
   except:
       pass

#json.loads(json_str)
print("Nothing")
sys.exit()
1

2 Answers 2

2

You can use the .text function to get the content within each tag. Then, if you know that you want to grab the code that specifically starts with "varoptionsPrice", you can filter for that:

soup = BeautifulSoup(myhtml, 'lxml')

script_blocks = soup.find_all('script', {'type': 'text/javascript'})
special_code = ''
for s in script_blocks:
    if s.text.strip().startswith('varOptionsPrice'):
        special_code = s.text
        break

print(special_code)

EDIT: To answer your question in the comments, there are a couple of different ways of extracting the part of the text that has the JSON. You could pass it through a regexp to grab everything between the first left parentheses and before the ); at the end. Though if you want to avoid regexp completely, you could do something like:

json_stuff = special_code[special_code.find('(')+1:special_code.rfind(')')]

Then to make a usable dictionary out of it:

import json
j = json.loads(json_stuff)
print(j['defaultTax'])  # This should return a value of 20
Sign up to request clarification or add additional context in comments.

2 Comments

Yesss! This right here!!! However, What if we want to create that s.text to a json if it finds of course?
@Hellosiroverthere, I've updated my comments to answer the JSON part.
1

I can think of possible 3 options - which one you use might depend on the size of the project and how flexible you need it to be

  • Use Regex to extract the objects from the script (fastest, least flexible)

  • Use ANTLR or similar (eg. pyjsparser) to parse the js grammar

  • Use Selenium or other headless browsers that can interpret the JS for you. With this option, you can use selenium to execute a call to get the value of the variable like this

Regex Example (#1)

>>> script_body = """
    var x=product.Config({
        "key": {"a":1}
});
"""
>>> pat = "product.Config\((.+)\);"
>>> json_str = re.search(pat, script_body, flags=re.DOTALL).group(1)
>>> json.loads(json_str)
{'key': {'a': 1}}
>>> json.loads(json_str)['key']['a']
1

8 Comments

Oh hmm. Because the value is there when I scrape the whole html.parse. The value is inside script type="'text/javascript". My idea was that maybe it is possible scrape it like you usually do with bs4 but I assume that is not possible?
Any way you could post more output from the test variable? Just to see exactly what the structure is.
AFAIK bs4 will not be able to interpret the JS syntax so the script body as a string is as far as it will take you.
@gtalarico I just saw you example and I thought I might give it a go but I couldn't figure out what "s" stands for in json_str?
sorry that was a typo in the example - s was script_body I have corrected the example
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.