Writing to scraped links to a CSV file using Python3

Question

I have scraped a website for html links and have a result of about 500 links. When I try to write them to a csv file, I do not get the list only the base page.

Here is my code:

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
link_set = set()
for link in soup.find_all('a'):
    web_links = link.get("href")
    print(web_links)

csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
writer.writerow([web_links])
csvfile.close()

I only get two lines in my csv file. The header 'Links' and www.census.gov. I have tried making it different by add another for loop in the csv writer area, but I get similar results.

for link in soup.find_all('a'):
    web_links = link.get('href')
    abs_url = join(page, web_links)
    print(abs_url)
    if abs_url and abs_url not in link_set:
        writer.write(str(abs_url) + "\n")
        link_set.add(abs_url)

It seems the 'web_links' definition should be where I put all the links into the csv file, but no dice. Where am I making my mistake?

Akash KC · Accepted Answer · 2017-11-19 02:46:29Z

2

In your code, you are writing two row in csv i.e.

 writer.writerow(['Links'])
 writer.writerow([web_links])

Here web_links is the last instance of retrieved href value.

I don't see the use of set instance. You can print and write in the csv without using set instance in following way :

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
for link in soup.find_all('a'):
    web_links = link.get("href")
    if web_links:
        print(web_links)
        writer.writerow([web_links])
csvfile.close()

answered Nov 19, 2017 at 2:46

Akash KC

16.4k6 gold badges41 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

James Desgrange Over a year ago

Thank you. I kept looking at my code and thinking it looked very redundant, but with my low skill level, I wasn't quite sure where things went, especially with the loops.

Akash KC Over a year ago

With more practice, you can build up your confidence. If this answer help you to solve your problem, you can accept this as an answer :)

wp78de · Accepted Answer · 2017-11-19 02:00:00Z

1

You have never added the scrapped links to your set():

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
link_set = set()
for link in soup.find_all('a'):
    web_links = link.get("href")
    print(web_links)
    link_set.add(web_links)

csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
for link in link_set:
    writer.writerow([link])
csvfile.close()

answered Nov 19, 2017 at 2:00

wp78de

19.1k7 gold badges49 silver badges78 bronze badges

2 Comments

James Desgrange Over a year ago

Thank you for showing me the link_set.add piece. I knew that the set() function would take out duplicate links, but thought what I had done was enough. Looking at it with your piece, I see what you're saying where I didn't add anything to link_set. That was a forehead slap on me. I appreciate your feedback.

wp78de Over a year ago

My pleasure. :)

Collectives™ on Stack Overflow

Writing to scraped links to a CSV file using Python3

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related