Python Requests to parse HTML to get CSV

Question

So I am trying to do a POST request to a website and this website will display a CSV, however, the CSV is not downloadable only there in the form it is in so can be copied and pasted. I am trying to get the HTML from the POST request and get the CSV, export this into a CSV file, to then run a function on. I have managed to get it into CSV form as a string but there doesn't appear to be new lines i.e.

>>> print(text1)

    "Heading 1","Heading 2""Item 1","Item 2"

not

"Heading 1","Heading 2"
"Item 1","Item 2"

Is this format OK? If not how do I get it into an OK format? Secondly, how can I write this string into a CSV file? If I try to convert text1 into bytes, I get _csv.Error: iterable expected, not int, if not I get TypeError: a bytes-like object is required, not 'str'.

My code so far:

with requests.Session() as s:
    response = s.post(headers=headers, data=data, url=url)
    html = response.content
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()  # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    text1 = text.replace(text[:56], '')
    print(text1)

what code are you using to get this csv string? And what is the type of text1 — bherbruck
– bherbruck, Commented May 20, 2020 at 23:55
for saving a csv it may be easier to give csv a list of lists for each line and let csv handle all the formatting — bherbruck
– bherbruck, Commented May 21, 2020 at 0:12
@TenaciousB what would be the best way to do that? I’m fairly new to coding and I can’t think of an easy way to do this. My CSV has a set number of columns but not rows and the data in the rows varies as it’s names, times etc. So the only way I could think to do it is a list is after n number of quotes or commas but not sure I could create up a code for that? — PythonIsBae
– PythonIsBae, Commented May 21, 2020 at 0:17
is the csv here an html table? I'm a bit confused at why you're using soup.get_text() because that will give you all the text in all the html elements on the page. You can go to the table element and just scrape the text of each table row to a list of lists with items as <td> — bherbruck
– bherbruck, Commented May 21, 2020 at 0:21

bherbruck · Accepted Answer · 2020-05-21 14:18:39Z

1

I think this will work for you, this will find the element containing the csv data (could be body, could be a div, could be a p, etc) and only extract text from there so you don't need to worry about scripts or classes getting in your data:

import csv
from bs4 import BeautifulSoup

# emulate your html format
html_string = '''
<body>
<div class="csv">"Category","Position","Name","Time","Team","avg_power","20min","Male?"<br>"A","1","James ","00:21:31.45","5743","331","5.3","1"<br>"A","2","Da","00:21:31.51","4435","377","5.0","1"<br>"A","3","Timmy ","00:21:31.52","3964","371","4.8","1"<br>"A","4","Timothy ","00:21:31.83","5229","401","5.7","1"<br>"A","5","Stefan ","00:21:31.86","2991","338","","1"<br>"A","6","Josh ","00:21:31.92","","403","5.1","1"<br></div>
<body>
'''

soup = BeautifulSoup(html_string)

for br in soup.find_all('br'):
    br.replace_with('\n')

rows = [[i.replace('"', '').strip() # clean the lines
         for i in item.split(',')] # splite each item by the comma
        # get all the lines inside the div
        # this will get the first item matching the filter
        for item in soup.find('div', class_='csv').text.splitlines()] 

# csv writing function
def write_csv(path, data):
    with open(path, 'w') as file:
        writer = csv.writer(file)
        writer.writerows(data)

print(rows)

write_csv('./data.csv', rows)

Output (data.csv):

Category,Position,Name,Time,Team,avg_power,20min,Male?
A,1,James,00:21:31.45,5743,331,5.3,1
A,2,Da,00:21:31.51,4435,377,5.0,1
A,3,Timmy,00:21:31.52,3964,371,4.8,1
A,4,Timothy,00:21:31.83,5229,401,5.7,1
A,5,Stefan,00:21:31.86,2991,338,,1
A,6,Josh,00:21:31.92,,403,5.1,1

soup.find()/find_all() can isolate an html element for you to scrape from so you don't have to worry about parsing other elements.

edited May 21, 2020 at 14:18

answered May 21, 2020 at 1:29

bherbruck

2,2161 gold badge9 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

PythonIsBae Over a year ago

Thanks, finally what I was looking for. However, I have the output as: Heading1Item1. There are no commas and it only outputs the first row.

PythonIsBae Over a year ago

I think this is due to the format of the HTML:

"Category","Position","Name","Time","Team","avg_power","20min","Male?"<br>"A","1","James  ","00:21:31.45","5743","331","5.3","1"<br>"A","2","Da","00:21:31.51","4435","377","5.0","1"<br>"A","3","Timmy ","00:21:31.52","3964","371","4.8","1"<br>"A","4","Timothy ","00:21:31.83","5229","401","5.7","1"<br>"A","5","Stefan ","00:21:31.86","2991","338","","1"<br>"A","6","Josh ","00:21:31.92","","403","5.1","1"<br>

The HTML does not have newlines within it, I guess I need to add them in with the script?

bherbruck Over a year ago

I added a replacement for the <be> tags with \n to the answer

bherbruck Over a year ago

are you doing any soup processing before the replacement?

bherbruck Over a year ago

for br in soup.find_all('br'): replace_with("\n") you had 'body' in here, you want 'br' I updated the code also to fit your data format and null cells. You should be able to copy it exactly and just replace the line before soup = BeautifulSoup(html_string) with html_string = response.content and it should work

|

Collectives™ on Stack Overflow

Python Requests to parse HTML to get CSV

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related