2

I have a list of URLs in a column in a CSV-file. I would like to use Python to go through all the URLs, download a specific part of the HTML code from the URL and save it to the next column.

For example: From this URL I would like to extract this div and write it to the next column.

<div class="info-holder" id="product_bullets_section">
<p>
VM−2N ist ein Hochleistungs−Verteilverstärker für Composite− oder SDI−Videosignale und unsymmetrisches Stereo−Audio. Das Eingangssignal wird entkoppelt und isoliert, anschließend wird das Signal an zwei identische Ausgänge verteilt.
<span id="decora_msg_container" class="visible-sm-block visible-md-block visible-xs-block visible-lg-block"></span>
</p>
<ul>
<li>
<span>Hohe Bandbreite — 400 MHz (–3 dB).</span>
</li>
<li>
<span>Desktop–Grösse — Kompakte Bauform, zwei Geräte können mithilfe des optionalen Rackadapters RK–1 in einem 19 Zoll Rack auf 1 HE nebeneinander montiert werden.</span>
</li>
</ul>
</div>

I have this code, the HTML code is saved in the variable html:

import csv
import urllib.request

with open("urls.csv", "r", newline="", encoding="cp1252") as f_input:
    csv_reader = csv.reader(f_input, delimiter=";", quotechar="|")
    header = next(csv_reader)
    items = [row[0] for row in csv_reader]

with open("results.csv", "w", newline="") as f_output:
    csv_writer = csv.writer(f_output, delimiter=";")
    for item in items:
        html = urllib.request.urlopen(item).read()

Currently the HTML-Code is pretty ugly. How could I delete everything out of the variable html except the div I would like to extract?

3 Answers 3

3

Given that all of your webpages are have the same structure you can parse the html with this code. It will look for the first div with the id product_bullets_section. An id in html should be unique but the given website has two equal id's so we obtain the first one through slicing and convert the parsed div back to a string containing your html.

import csv
import urllib.request

from bs4 import BeautifulSoup

with open("urls.csv", "r", newline="", encoding="cp1252") as f_input:
    csv_reader = csv.reader(f_input, delimiter=";", quotechar="|")
    header = next(csv_reader)
    items = [row[0] for row in csv_reader]

items = ['https://www.kramerav.com/de/Product/VM-2N']
with open("results.csv", "w", newline="") as f_output:
    csv_writer = csv.writer(f_output, delimiter=";")
    for item in items:
        html = urllib.request.urlopen(item).read()
        the_div = str(BeautifulSoup(html).select('div#product_bullets_section')[0])
Sign up to request clarification or add additional context in comments.

4 Comments

Hi thank you so much for your help! Sometimes the sites in my list don't have the div I'm searching for. Python exits then with this error: File "/home/dun/_workspace/py/search-articles/save-html.py", line 15, in <module> div = str(BeautifulSoup(html).select("div#product_bullets_section"‌​)[0]) IndexError: list index out of range Do you know how I could just write a space in the row and use the next URL afterwards?
For this I would need to know the rest of you code specifically the part dealing with the file. But you can use a try-except block to write the space if a site doesn't have this div. Maybe it will look like this:try: the_div = str(BeautifulSoup(html).select('div#product_bullets_section')[0]) except IndexError: the_div = '' finally: f_output.write(the_div)
Thank you so much! I used except IndexError: div="". Everything works now.
I'm glad that I could help you.
2

In this example, you can use BeautifulSoup to get the div with a specific id:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
div = soup.find(id="product_bullets_section")

Comments

1

Why not use html.parser - Simple HTML and XHTML parser?

Example:

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)

    def handle_endtag(self, tag):
        print("End tag  :", tag)

    def handle_data(self, data):
        print("Data     :", data)

    def handle_comment(self, data):
        print("Comment  :", data)

    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)

    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)

    def handle_decl(self, data):
        print("Decl     :", data)

    parser = MyHTMLParser()

and then use parser.feed(data) (where data is a str)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.