Python: Parsing html tags from json data

Question

New to Python and still learning the ropes. Any assistance greatly appreciated.

My script pulls json data and writes it to a csv. However, there's a fair bit of html code in the resulting output that I'd like to remove. I'm attempting to do with this regular expressions, but running into difficulty how to put it all together.

The structure of the json is below. The field I'm seeking to parse html from is "message"

 ],
  "view": [
    {
      "id": 109205,
      "user_id": 6354,
      "parent_id": null,
      "created_at": "2020-11-03T23:32:49Z",
      "updated_at": "2020-11-03T23:32:49Z",
      "rating_count": null,
      "rating_sum": null,
      "message": "<b> message text1 </b>", # <<<< section of json needing html parsing
      "replies": [
        {
          "id": 109298,
          "user_id": 5457,
          "parent_id": 109205,
          "created_at": "2020-11-04T19:42:59Z",
          "updated_at": "2020-11-04T19:42:59Z",
          "rating_count": null,
          "rating_sum": null,
          "message": "message text2"
        },
        {
         #json continues

The regular expression I'm using to parse the html strings is: r'<[^>]+>' (found on this tutorial). In this case, I'm hoping the "message" field output will be "message text" instead of "<b> message text </b>"

Here's my working code for this. This script pulls data from the json and via a loop writes it to a csv. I attempted to apply the reg ex in a function remove_tags, and ran that function in the final part of the script writing the data to csv.

import csv
import json
import requests
import re

url = URL
headers = {'Authorization' : 'KEY'}
r = requests.get(url, headers=headers)
data = r.json()

file = open("Control2.csv", "w", newline="")
writer = csv.writer(file)
headers = ["user_id", "text"]

writer.writerow(headers)

message_length = len(data["view"])

user_id_list = []
for item in range(message_length):
    user_id_list.append(data['view'][item]['user_id'])

TAG_RE = re.compile(r'<[^>]+>') # this is the regex for removing html code
def remove_tags(message_list):
    return TAG_RE.sub('', message_list) # message_list below is the loop for writing the json data for this field to csv

message_list = []
for item in range(message_length):
    message_list.append(data['view'][item]['message'])

for w in range(message_length):
    writer.writerow([user_id_list[w], remove_tags(message_list)[w]]) # writes "user_id" and parsed data for "message" fields into csv

However, I receive the following error when executing this code:

TypeError: expected string or bytes-like object
Traceback:
File "c:\users\danie\appdata\local\programs\python\python39\lib\site-packages\streamlit\script_runner.py", line 350, in _run_script
    exec(code, module.__dict__)
File "C:\Users\danie\Desktop\Python\Streamlit\CoCalc\level_1a.py", line 62, in <module>
    writer.writerow([user_id_list[w], remove_tags(message_list)[w]])
File "C:\Users\danie\Desktop\Python\Streamlit\CoCalc\level_1a.py", line 37, in remove_tags
    return TAG_RE.sub('', message_list)

I'm uncertain on where to turn next to solve this. Any guidance on where I'm going wrong, or other directions for parsing html data from this field? All input greatly appreciated.

You are passing your message_list list to the regex's sub method which has the following definition: re.sub(pattern, repl, string, count=0, flags=0). You should provide a string instead. — George
– George, Commented Aug 2, 2021 at 14:32
does iterating through your message_list in your last loop like this for i, m in zip(user_id_list, message_list) work? then you would call the write method as write.writerow(i,m) — braulio
– braulio, Commented Aug 2, 2021 at 14:45
braulio, thanks for the correction. the script runs with your edit, but the reg ex didn't take. Do I need to figure out how to run the function before writing it? — Daniel Hutchinson
– Daniel Hutchinson, Commented Aug 2, 2021 at 14:58
You can clean your strings when appending them to your list here message_list.append(data['view'][item]['message']). You will get something like this: message_list.append(re.sub(r'<[^>]+>', '', data['view'][item]['message'])). — George
– George, Commented Aug 2, 2021 at 15:02
I think I figured it out: for i, m in zip(user_id_list, message_list): writer.writerow([i, remove_tags(m)]) — Daniel Hutchinson
– Daniel Hutchinson, Commented Aug 2, 2021 at 15:03

dimay · Accepted Answer · 2021-08-02 14:54:03Z

1

For this you can use bs4 library:

import bs4 

#... your code

def remove_tags(message):
    return bs4.BeautifulSoup(message, "lxml").text

message_list = []
for item in range(message_length):
    message_list.append(remove_tags(data['view'][item]['message']))

for w in range(message_length):
    writer.writerow([user_id_list[w], message_list[w]])

edited Aug 2, 2021 at 14:54

answered Aug 2, 2021 at 14:33

dimay

2,8341 gold badge17 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Daniel Hutchinson Over a year ago

Many thanks for the assistance. Unfortunately, using this gives the following error: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0). Any suggestions on the issue?

braulio Over a year ago

that error comes from the json library, so it is not from the lines suggested above

Collectives™ on Stack Overflow

Python: Parsing html tags from json data

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related