1

I am trying to scrape the following html.

There are multiple divs where class="review-card".

Each of these divs always contain a script element where data-initial-state="data-always-exist" and sometimes contain a script element where data-initial-state="data-may-not-exist".

I would like to retrieve the data from both of these script elements. When the second one does not exist I want to return a specific value e.g. 0.

As you can see on my code below, I have managed to find the "retrieve-card" div elements. However, I fail to retrieve the script element that live inside each div element. My code always returns a list vs a single element. What am I doing wrong?

<html>
    <body>
        <main>
            <div class="review-list">
                <div class="review-card">
                    <article class="review">
                        <script type="application.json" data-initial-state="data-always-exist">
                        {"reviewBody":"Brilliant value","stars":5}
                        </script>
                        <section class="review__content">
                            <div class="content">
                                <script type="application.json" data-initial-state="data-may-not-exist">
                                    {"isVerified":true,"verificationSource":"invitation"}
                                </script>
                            </div>
                        </section>
                    </article>
                </div>
                <div class="review-card">
                    <article class="review">
                            <script type="application.json data-initial-state="data-always-exist">
                                {"reviewBody":"Brilliant value","stars":5}
                            </script>
                    </article>
                </div>
                <div class="review-card">
                    <article class="review">
                        <script type="application.json" data-initial-state="data-always-exist">
                        {"reviewBody":"Great","stars":4}
                        </script>
                        <section class="review__content">
                            <div class="content">
                                <script type="application.json" data-initial-state="data-may-not-exist">
                                    {"isVerified":false,"verificationSource":"invitation"}
                                </script>
                            </div>
                        </section>
                    </article>
                </div>

            </div>
        </main>
    </body>
</html>

I have tried the following:

from lxml import html
import requests

page = requests.get('http://somewebsite.com')
tree = html.fromstring(page.content)

#finds the review list
review_list = tree.xpath('//div[@class="review-list"]')

#finds all the review cards
review_cards = review_list[0].xpath('//div[contains(@class,"review-card")]')

for card in review_cards:
   
   #this part of the code does not work as intended -returns a list vs a single items.
   data_always_exist = card.xpath("//script[starts-with(@data-initial-state, 'data-always-exist')]")
   data_not_always_exist = card.xpath("//script[starts-with(@data-initial-state, 'data-may-not-exist')]")

3
  • Is it ok to use beautifulsoup? Commented Apr 30, 2021 at 18:07
  • @AndrejKesely as a last option yes, but I would prefer a lxml solution. Commented Apr 30, 2021 at 18:11
  • 1
    I've added BeautifulSoup and lxml version Commented Apr 30, 2021 at 18:24

1 Answer 1

1

A solution using beautifulsoup:

import requests
from bs4 import BeautifulSoup


soup = BeautifulSoup(requests.get("http://somewebsite.com").content, "lxml")

for card in soup.select(".review-card"):
    print("data-always-exist:")
    d = card.select_one('[data-initial-state="data-always-exist"]')
    if d:
        print(d.contents[0].strip())
    print("data-may-not-exist:")
    d = card.select_one('[data-initial-state="data-may-not-exist"]')
    if d:
        print(d.contents[0].strip())

    print("-" * 80)

Prints:

data-always-exist:
{"reviewBody":"Brilliant value","stars":5}
data-may-not-exist:
{"isVerified":true,"verificationSource":"invitation"}
--------------------------------------------------------------------------------
data-always-exist:
{"reviewBody":"Brilliant value","stars":5}
data-may-not-exist:
--------------------------------------------------------------------------------
data-always-exist:
{"reviewBody":"Great","stars":4}
data-may-not-exist:
{"isVerified":false,"verificationSource":"invitation"}
--------------------------------------------------------------------------------

Version with lxml (Use dot (.) in your XPath):

# ...
tree = html.fromstring(page.content)
cards = tree.xpath('//div[contains(@class,"review-card")]')


for card in cards:

    # this part of the code does not work as intended -returns a list vs a single items.
    data_always_exist = card.xpath(
        ".//script[starts-with(@data-initial-state, 'data-always-exist')]"
    )
    data_not_always_exist = card.xpath(
        ".//script[starts-with(@data-initial-state, 'data-may-not-exist')]"
    )

    print(data_always_exist)
    print(data_not_always_exist)
    print("-" * 80)

Prints:

[<Element script at 0x7fc202aadd10>]
[<Element script at 0x7fc202aade50>]
--------------------------------------------------------------------------------
[<Element script at 0x7fc202aadea0>]
[]
--------------------------------------------------------------------------------
[<Element script at 0x7fc202aade50>]
[<Element script at 0x7fc202aadea0>]
--------------------------------------------------------------------------------
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.