1

I'm attempting to loop through multiple pages (2 for the purposes of this example) of a website, scrape relevant customer reviews data, and ultimately combine into a single data frame.

The challenge I'm encountering is my code appears to be producing two separate data frames within a single data frame object (df in the attached code). I might be mistaken there but that's the way I'm interpreting.

Here is a screenshot of what I'm describing above:

Separate data frames within single data frame object

Here is the code that produced the screenshot results:

from bs4 import BeautifulSoup as bs
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

page = 1
urls = []
while page != 3:
    url = f"https://www.trustpilot.com/review/trupanion.com?page={page}"
    urls.append(url)
    page = page + 1

for url in urls:
    response = requests.get(url)
    html = response.content
    soup = bs(html, "html.parser")
    results = soup.find(id="__NEXT_DATA__")
    json_object = json.loads(results.contents[0])
    reviews = json_object["props"]["pageProps"]["reviews"]
    ids = pd.Series([ sub['id'] for sub in reviews ])
    filtered = pd.Series([ sub['filtered'] for sub in reviews ])
    pending = pd.Series([ sub['pending'] for sub in reviews ])
    rating = pd.Series([ sub['rating'] for sub in reviews ])
    title = pd.Series([ sub['title'] for sub in reviews ])
    likes = pd.Series([ sub['likes'] for sub in reviews ])
    experienced = pd.Series([ sub['dates']['experiencedDate'] for sub in reviews ])
    published = pd.Series([ sub['dates']['publishedDate'] for sub in reviews ])
    source = url
    df = pd.DataFrame({'id': ids, 'filtered': filtered, 'pending': pending, 'rating': rating,
                   'title': title, 'likes': likes, 'experienced': experienced,
                   'published': published, 'source': source})  
    print(df)

I've been relying on these posts as potential solutions without any luck:

Rbind, having data frames within data frames causes errors?

Analyse data frames inside a list of data frames and store all results in single data frame

Merge multiple data frames into a single data frame in python

Specifically, I'm consistently receiving the following error:

typeerror: cannot concatenate object of type '<class 'str'>'; only series and dataframe objs are valid

Certain the '<class 'str'>' bit is a clue to what the issue is but have been spinning my wheels and feel like I need to 'go pencils down' and ask for assistance. I'm relatively new to Python and my gut is telling me there is something I need to resolve upstream in my code to avoid this problem in the first place. In other words, while there may be a way to combine these two data frames into a single data frame, I feel like the root of the problem is occurring and needs to be resolved earlier on. Any assistance is greatly appreciated.

1 Answer 1

0

Here is an example how you can get dataframe from multiple pages and as final step concatenate them to final dataframe:

import json

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

page = 1
urls = []
while page != 3:
    url = f"https://www.trustpilot.com/review/trupanion.com?page={page}"
    urls.append(url)
    page = page + 1

all_dfs = []
for url in urls:
    response = requests.get(url)
    html = response.content
    soup = bs(html, "html.parser")
    results = soup.find(id="__NEXT_DATA__")
    json_object = json.loads(results.contents[0])
    reviews = json_object["props"]["pageProps"]["reviews"]
    ids = pd.Series([sub["id"] for sub in reviews])
    filtered = pd.Series([sub["filtered"] for sub in reviews])
    pending = pd.Series([sub["pending"] for sub in reviews])
    rating = pd.Series([sub["rating"] for sub in reviews])
    title = pd.Series([sub["title"] for sub in reviews])
    likes = pd.Series([sub["likes"] for sub in reviews])
    experienced = pd.Series([sub["dates"]["experiencedDate"] for sub in reviews])
    published = pd.Series([sub["dates"]["publishedDate"] for sub in reviews])
    source = url
    df = pd.DataFrame(
        {
            "id": ids,
            "filtered": filtered,
            "pending": pending,
            "rating": rating,
            "title": title,
            "likes": likes,
            "experienced": experienced,
            "published": published,
            "source": source,
        }
    )
    all_dfs.append(df)

final_df = pd.concat(all_dfs)
print(final_df)

Prints:

                          id  filtered  pending  rating                                                                            title  likes               experienced                 published                                                  source
0   660c4b524ff85128f3cd5665     False    False       5                                                                Amazing Insurance      0  2024-04-01T00:00:00.000Z  2024-04-02T20:15:47.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
1   660b08acec6384757dfabdf9     False    False       5                                                  Enrollment was quick and easy!       0  2024-03-21T00:00:00.000Z  2024-04-01T21:19:09.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
2   66098c1b0353405fb0313ae2     False    False       5                                            Extremely easy to understand website…      0  2024-03-28T00:00:00.000Z  2024-03-31T18:15:23.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
3   660b1e164e75ffb01ee011f1     False    False       2                                                                   Too expensive       0  2024-04-01T00:00:00.000Z  2024-04-01T22:50:31.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
4   66099003e9b2fe025035baef     False    False       5                                         The coverage seems really comprehensive…      0  2024-03-28T00:00:00.000Z  2024-03-31T18:32:04.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
5   660b0af515413b0620a7d617     False    False       4                                            Everything was explained to us in an…      0  2024-03-29T00:00:00.000Z  2024-04-01T21:28:54.000Z  https://www.trustpilot.com/review/trupanion.com?page=1

...
Sign up to request clarification or add additional context in comments.

1 Comment

This is exactly what I was trying to solve for, thanks a ton Andrej! First time posting here and super grateful I did. Thanks again!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.