I'm attempting to loop through multiple pages (2 for the purposes of this example) of a website, scrape relevant customer reviews data, and ultimately combine into a single data frame.
The challenge I'm encountering is my code appears to be producing two separate data frames within a single data frame object (df in the attached code). I might be mistaken there but that's the way I'm interpreting.
Here is a screenshot of what I'm describing above:
Separate data frames within single data frame object
Here is the code that produced the screenshot results:
from bs4 import BeautifulSoup as bs
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
page = 1
urls = []
while page != 3:
url = f"https://www.trustpilot.com/review/trupanion.com?page={page}"
urls.append(url)
page = page + 1
for url in urls:
response = requests.get(url)
html = response.content
soup = bs(html, "html.parser")
results = soup.find(id="__NEXT_DATA__")
json_object = json.loads(results.contents[0])
reviews = json_object["props"]["pageProps"]["reviews"]
ids = pd.Series([ sub['id'] for sub in reviews ])
filtered = pd.Series([ sub['filtered'] for sub in reviews ])
pending = pd.Series([ sub['pending'] for sub in reviews ])
rating = pd.Series([ sub['rating'] for sub in reviews ])
title = pd.Series([ sub['title'] for sub in reviews ])
likes = pd.Series([ sub['likes'] for sub in reviews ])
experienced = pd.Series([ sub['dates']['experiencedDate'] for sub in reviews ])
published = pd.Series([ sub['dates']['publishedDate'] for sub in reviews ])
source = url
df = pd.DataFrame({'id': ids, 'filtered': filtered, 'pending': pending, 'rating': rating,
'title': title, 'likes': likes, 'experienced': experienced,
'published': published, 'source': source})
print(df)
I've been relying on these posts as potential solutions without any luck:
Rbind, having data frames within data frames causes errors?
Analyse data frames inside a list of data frames and store all results in single data frame
Merge multiple data frames into a single data frame in python
Specifically, I'm consistently receiving the following error:
typeerror: cannot concatenate object of type '<class 'str'>'; only series and dataframe objs are valid
Certain the '<class 'str'>' bit is a clue to what the issue is but have been spinning my wheels and feel like I need to 'go pencils down' and ask for assistance. I'm relatively new to Python and my gut is telling me there is something I need to resolve upstream in my code to avoid this problem in the first place. In other words, while there may be a way to combine these two data frames into a single data frame, I feel like the root of the problem is occurring and needs to be resolved earlier on. Any assistance is greatly appreciated.