0

New to Python and still learning the ropes. Any assistance greatly appreciated.

My script pulls json data and writes it to a csv. However, there's a fair bit of html code in the resulting output that I'd like to remove. I'm attempting to do with this regular expressions, but running into difficulty how to put it all together.

The structure of the json is below. The field I'm seeking to parse html from is "message"

 ],
  "view": [
    {
      "id": 109205,
      "user_id": 6354,
      "parent_id": null,
      "created_at": "2020-11-03T23:32:49Z",
      "updated_at": "2020-11-03T23:32:49Z",
      "rating_count": null,
      "rating_sum": null,
      "message": "<b> message text1 </b>", # <<<< section of json needing html parsing
      "replies": [
        {
          "id": 109298,
          "user_id": 5457,
          "parent_id": 109205,
          "created_at": "2020-11-04T19:42:59Z",
          "updated_at": "2020-11-04T19:42:59Z",
          "rating_count": null,
          "rating_sum": null,
          "message": "message text2"
        },
        {
         #json continues

The regular expression I'm using to parse the html strings is: r'<[^>]+>' (found on this tutorial). In this case, I'm hoping the "message" field output will be "message text" instead of "<b> message text </b>"

Here's my working code for this. This script pulls data from the json and via a loop writes it to a csv. I attempted to apply the reg ex in a function remove_tags, and ran that function in the final part of the script writing the data to csv.

import csv
import json
import requests
import re

url = URL
headers = {'Authorization' : 'KEY'}
r = requests.get(url, headers=headers)
data = r.json()

file = open("Control2.csv", "w", newline="")
writer = csv.writer(file)
headers = ["user_id", "text"]

writer.writerow(headers)

message_length = len(data["view"])

user_id_list = []
for item in range(message_length):
    user_id_list.append(data['view'][item]['user_id'])

TAG_RE = re.compile(r'<[^>]+>') # this is the regex for removing html code
def remove_tags(message_list):
    return TAG_RE.sub('', message_list) # message_list below is the loop for writing the json data for this field to csv

message_list = []
for item in range(message_length):
    message_list.append(data['view'][item]['message'])

for w in range(message_length):
    writer.writerow([user_id_list[w], remove_tags(message_list)[w]]) # writes "user_id" and parsed data for "message" fields into csv 

However, I receive the following error when executing this code:

TypeError: expected string or bytes-like object
Traceback:
File "c:\users\danie\appdata\local\programs\python\python39\lib\site-packages\streamlit\script_runner.py", line 350, in _run_script
    exec(code, module.__dict__)
File "C:\Users\danie\Desktop\Python\Streamlit\CoCalc\level_1a.py", line 62, in <module>
    writer.writerow([user_id_list[w], remove_tags(message_list)[w]])
File "C:\Users\danie\Desktop\Python\Streamlit\CoCalc\level_1a.py", line 37, in remove_tags
    return TAG_RE.sub('', message_list)

I'm uncertain on where to turn next to solve this. Any guidance on where I'm going wrong, or other directions for parsing html data from this field? All input greatly appreciated.

5
  • 1
    You are passing your message_list list to the regex's sub method which has the following definition: re.sub(pattern, repl, string, count=0, flags=0). You should provide a string instead. Commented Aug 2, 2021 at 14:32
  • 1
    does iterating through your message_list in your last loop like this for i, m in zip(user_id_list, message_list) work? then you would call the write method as write.writerow(i,m) Commented Aug 2, 2021 at 14:45
  • braulio, thanks for the correction. the script runs with your edit, but the reg ex didn't take. Do I need to figure out how to run the function before writing it? Commented Aug 2, 2021 at 14:58
  • 1
    You can clean your strings when appending them to your list here message_list.append(data['view'][item]['message']). You will get something like this: message_list.append(re.sub(r'<[^>]+>', '', data['view'][item]['message'])). Commented Aug 2, 2021 at 15:02
  • I think I figured it out: for i, m in zip(user_id_list, message_list): writer.writerow([i, remove_tags(m)]) Commented Aug 2, 2021 at 15:03

1 Answer 1

1

For this you can use bs4 library:

import bs4 

#... your code

def remove_tags(message):
    return bs4.BeautifulSoup(message, "lxml").text

message_list = []
for item in range(message_length):
    message_list.append(remove_tags(data['view'][item]['message']))

for w in range(message_length):
    writer.writerow([user_id_list[w], message_list[w]])

Sign up to request clarification or add additional context in comments.

2 Comments

Many thanks for the assistance. Unfortunately, using this gives the following error: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0). Any suggestions on the issue?
that error comes from the json library, so it is not from the lines suggested above

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.