1

I am trying to extract pricing and other attributes from this JS-Code:

  <script type="application/ld+json">
{
  "@context": "http://schema.org/",
  "@type": "Product",
  "name": "Rolex Cellini Time 50505",
  "image": [
        "https://chronexttime.imgix.net/S/1/S1006/S1006_58774a90efd04.jpg?w=1024&amp;auto=format&amp;fm=jpg&amp;q=75&amp;usm=30&amp;usmrad=1&amp;h=1024&amp;fit=clamp"      ],
  "description": "Werk: automatic; Herrenuhr; Gehäusegröße: 39; Gehäuse: rose-gold; Armband: leather; Glas: sapphire; Jahr: 2018; Lieferumfang: Originale Box, Originale Papiere, Herstellergarantie",
  "mpn": "S1006",
  "brand":{
    "@type": "Thing",
    "name": "Rolex"
  },
  "offers":{
    "@type": "Offer",
    "priceCurrency": "EUR",
    "price": "11500",
    "itemCondition": "http://schema.org/NewCondition",
    "availability": "http://schema.org/InStock",

    "seller":{
      "@type": "Organization",
      "name": "CHRONEXT Service Germany GmbH"
    }
  }
}
</script>

Alternatively this code might do it as well:

  <script type="text/javascript">
window.articleInfo = {
    'id': 'S1006',
    'model': 'Cellini Time',
    'brand': 'Rolex',
    'reference': '50505',
    'priceLocal': '11500',
    'currencyCode': 'EUR'
};

There is much more other JS code on the same page, so I am not sure how to adress this particular script with xpath.

I tried this:

response.xpath('//script[contains(.,"price")]/text()').extract_first()

but the response contains a bunch of values, while I am only looking for the price of 11500. Later on I would also try to get e.g. the name and condition.

5
  • Try """//script/substring-before(substring-after(., '"price": '), ',') | //script/substring-before(substring-after(., "'priceLocal': "), ",") """ Commented Dec 11, 2018 at 9:24
  • Getting invalid synthax. Maybe I am placig the code wrong: response.xpath('//script/substring-before(substring-after(., '"price": '), ',')').extract_first() Commented Dec 11, 2018 at 9:49
  • Try response.xpath('''//script/substring-before(substring-after(., '"price": '), ',')''').extract_first() Commented Dec 11, 2018 at 10:08
  • Nop, getting: "ValueError: XPath error: Invalid expression in //script/substring-before(substring-after(., '"price": '), ',')" Commented Dec 11, 2018 at 10:28
  • @merlin please don't forget to accept an answer if it helped you solve your question. Commented Dec 22, 2018 at 13:35

2 Answers 2

3

You have two options,

1) Using Json, but it would only works for the first case

json_data = json.loads(response.xpath('//script[@type="application/ld+json"]/text()').extract_first())
price = json_data['price']

2) Using regular expression:

response.xpath('//script/text()').re_first('price(?:local)?["\']\s*:\s*["\'](.*)'["\'])

The price(?:local)?["\']\s*:\s*["\'](.*)'["\'] regular expression means:

  • Start with price with an optional local suffix
  • Then single or double quotes
  • Then : between zero or more spaces
  • Then single or double quotes
  • Then any value (price will be here)
  • Then single or double quotes again
Sign up to request clarification or add additional context in comments.

Comments

2

For the first script, yes there is no better option than decoding that directly with json.

For the second one, of course you can always use regular expressions, but a cleaner and better solution I would recommend would be using js2xml which transforms javascript into an xpath queryable format:

$ pip install js2xml

let's say one script has the following structure:

<script type="text/javascript">
window.articleInfo = {
    'id': 'S1006',
    'model': 'Cellini Time',
    'brand': 'Rolex',
    'reference': '50505',
    'priceLocal': '11500',
    'currencyCode': 'EUR'
};
</script>

formatting it would be like:

import js2xml

...

parsed = js2xml.parse(response.xpath('//script/text()').extract_first())

You can see the structure of parsed with:

>> print(js2xml.pretty_print(parsed))
>> <program>
  <assign operator="=">
    <left>
      <dotaccessor>
        <object>
          <identifier name="window"/>
        </object>
        <property>
          <identifier name="articleInfo"/>
        </property>
      </dotaccessor>
    </left>
    <right>
      <object>
        <property name="id">
          <string>S1006</string>
        </property>
        <property name="model">
          <string>Cellini Time</string>
        </property>
        <property name="brand">
          <string>Rolex</string>
        </property>
        <property name="reference">
          <string>50505</string>
        </property>
        <property name="priceLocal">
          <string>11500</string>
        </property>
        <property name="currencyCode">
          <string>EUR</string>
        </property>
      </object>
    </right>
  </assign>
</program>

Which means now you can get the information you need like this:

parsed.xpath('//property[@name="id"]/string/text()')[0]
parsed.xpath('//property[@name="model"]/string/text()')[0]
parsed.xpath('//property[@name="brand"]/string/text()')[0]
...

I hope I could help you with this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.