New to Python and still learning the ropes. Any assistance greatly appreciated.
My script pulls json data and writes it to a csv. However, there's a fair bit of html code in the resulting output that I'd like to remove. I'm attempting to do with this regular expressions, but running into difficulty how to put it all together.
The structure of the json is below. The field I'm seeking to parse html from is "message"
],
"view": [
{
"id": 109205,
"user_id": 6354,
"parent_id": null,
"created_at": "2020-11-03T23:32:49Z",
"updated_at": "2020-11-03T23:32:49Z",
"rating_count": null,
"rating_sum": null,
"message": "<b> message text1 </b>", # <<<< section of json needing html parsing
"replies": [
{
"id": 109298,
"user_id": 5457,
"parent_id": 109205,
"created_at": "2020-11-04T19:42:59Z",
"updated_at": "2020-11-04T19:42:59Z",
"rating_count": null,
"rating_sum": null,
"message": "message text2"
},
{
#json continues
The regular expression I'm using to parse the html strings is: r'<[^>]+>' (found on this tutorial).
In this case, I'm hoping the "message" field output will be "message text" instead of "<b> message text </b>"
Here's my working code for this. This script pulls data from the json and via a loop writes it to a csv. I attempted to apply the reg ex in a function remove_tags, and ran that function in the final part of the script writing the data to csv.
import csv
import json
import requests
import re
url = URL
headers = {'Authorization' : 'KEY'}
r = requests.get(url, headers=headers)
data = r.json()
file = open("Control2.csv", "w", newline="")
writer = csv.writer(file)
headers = ["user_id", "text"]
writer.writerow(headers)
message_length = len(data["view"])
user_id_list = []
for item in range(message_length):
user_id_list.append(data['view'][item]['user_id'])
TAG_RE = re.compile(r'<[^>]+>') # this is the regex for removing html code
def remove_tags(message_list):
return TAG_RE.sub('', message_list) # message_list below is the loop for writing the json data for this field to csv
message_list = []
for item in range(message_length):
message_list.append(data['view'][item]['message'])
for w in range(message_length):
writer.writerow([user_id_list[w], remove_tags(message_list)[w]]) # writes "user_id" and parsed data for "message" fields into csv
However, I receive the following error when executing this code:
TypeError: expected string or bytes-like object
Traceback:
File "c:\users\danie\appdata\local\programs\python\python39\lib\site-packages\streamlit\script_runner.py", line 350, in _run_script
exec(code, module.__dict__)
File "C:\Users\danie\Desktop\Python\Streamlit\CoCalc\level_1a.py", line 62, in <module>
writer.writerow([user_id_list[w], remove_tags(message_list)[w]])
File "C:\Users\danie\Desktop\Python\Streamlit\CoCalc\level_1a.py", line 37, in remove_tags
return TAG_RE.sub('', message_list)
I'm uncertain on where to turn next to solve this. Any guidance on where I'm going wrong, or other directions for parsing html data from this field? All input greatly appreciated.
message_listlist to the regex'ssubmethod which has the following definition:re.sub(pattern, repl, string, count=0, flags=0). You should provide a string instead.message_listin your last loop like thisfor i, m in zip(user_id_list, message_list)work? then you would call the write method aswrite.writerow(i,m)message_list.append(data['view'][item]['message']). You will get something like this:message_list.append(re.sub(r'<[^>]+>', '', data['view'][item]['message'])).for i, m in zip(user_id_list, message_list): writer.writerow([i, remove_tags(m)])