0

I have scraped a website for html links and have a result of about 500 links. When I try to write them to a csv file, I do not get the list only the base page.

Here is my code:

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
link_set = set()
for link in soup.find_all('a'):
    web_links = link.get("href")
    print(web_links)

csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
writer.writerow([web_links])
csvfile.close()

I only get two lines in my csv file. The header 'Links' and www.census.gov. I have tried making it different by add another for loop in the csv writer area, but I get similar results.

for link in soup.find_all('a'):
    web_links = link.get('href')
    abs_url = join(page, web_links)
    print(abs_url)
    if abs_url and abs_url not in link_set:
        writer.write(str(abs_url) + "\n")
        link_set.add(abs_url)

It seems the 'web_links' definition should be where I put all the links into the csv file, but no dice. Where am I making my mistake?

2 Answers 2

2

In your code, you are writing two row in csv i.e.

 writer.writerow(['Links'])
 writer.writerow([web_links]) 

Here web_links is the last instance of retrieved href value.

I don't see the use of set instance. You can print and write in the csv without using set instance in following way :

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
for link in soup.find_all('a'):
    web_links = link.get("href")
    if web_links:
        print(web_links)
        writer.writerow([web_links])
csvfile.close()
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. I kept looking at my code and thinking it looked very redundant, but with my low skill level, I wasn't quite sure where things went, especially with the loops.
With more practice, you can build up your confidence. If this answer help you to solve your problem, you can accept this as an answer :)
1

You have never added the scrapped links to your set():

import requests
from bs4 import BeautifulSoup
import csv

page = requests.get('https://www.census.gov/programs-surveys/popest.html')
print(page.status_code)
soup = BeautifulSoup(page.text, 'html.parser')
link_set = set()
for link in soup.find_all('a'):
    web_links = link.get("href")
    print(web_links)
    link_set.add(web_links)

csvfile = open('code_python.csv', 'w+', newline='')
writer = csv.writer(csvfile)
writer.writerow(['Links'])
for link in link_set:
    writer.writerow([link])
csvfile.close()

2 Comments

Thank you for showing me the link_set.add piece. I knew that the set() function would take out duplicate links, but thought what I had done was enough. Looking at it with your piece, I see what you're saying where I didn't add anything to link_set. That was a forehead slap on me. I appreciate your feedback.
My pleasure. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.