Using Python to Parse a Web Page

Question

I'm completely new to Python and could really use some assistance.

I'm trying to parse a webpage and retrieve the email addresses off the webpage. Ive tried many things that I've read online and failed.

I realized that when is run BeautifulSoup(browser.page_source) it brings the source code through however for some reason it doesn't bring the email address with it or the business profiles.

Below is my code (don't judge :-))

import os, random, sys, time

from urllib.parse import urlparse

from selenium import webdriver

from bs4 import BeautifulSoup

from webdriver_manager.chrome import ChromeDriverManager

import lxml

browser = webdriver.Chrome('./chromedriver.exe')

url = ('https://www.yellowpages.co.za/search?what=accountant&where=cape+town&pg=1')
browser.get(url)

BeautifulSoup(browser.page_source)

Sidenote: My goal is to navigate the webpages based on search criteria and parse each page for the email addresses, Ive figured out how to navigate the webpages and send keys, it's just the parsing that I'm stuck with. Your help would be greatly appreciated

Does this answer your question? Parsing Web Page's Search Results With Python — picklu
– picklu, Commented May 16, 2020 at 13:55

Rafael Setton · Accepted Answer · 2020-05-16 14:18:17Z

1

I recomend you to use the requests module to get the page source:

from requests import get

url = 'https://www.yellowpages.co.za/search?what=accountant&where=cape+town&pg=1'
src = get(url).text  # Gets the Page Source

After that I searched for email formatted words and added them to a list:

src = src.split('<body>')[1]  # Splits it and gets the <body> part

emails = []

for ind, char in enumerate(src):
    if char == '@':
        add = 1  # Count the characteres after and before
        new_char = src[ind+add]  # New character to add to the email
        email = char  # The full email (not yet)

        while new_char not in '<>":':
            email += new_char  # Add to email

            add += 1                   # Readjust
            new_char = src[ind + add]  # Values

        if '.' not in email or email.endswith('.'):  # This means that the email is 
            continue                                 # not fully in the page

        add = 1                    # Readjust
        new_char = src[ind - add]  # Values

        while new_char not in '<>":':
            email = new_char + email  # Add to email

            add += 1                   # Readjust
            new_char = src[ind - add]  # Values

        emails.append(email)

At last, you can use set to remove duplicates and print the emails

emails = set(emails)  # Remove Duplicates

print(*emails, sep='\n')

answered May 16, 2020 at 14:18

Rafael Setton

3951 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Noob Over a year ago

Thank you Rafael, I gave it a go and the same thing happened. It seems when I print the source code, it leaves out the entire first section which contains all the email addresses and only prints the last part. Any suggestions?

Rafael Setton Over a year ago

What do you mean by 'the last part'?

Noob Over a year ago

Basically there are a total of 3060 lines of code in the source code on the actual web page. When we parse the source code using Python it only takes the source code from line 1760 to 3060

Collectives™ on Stack Overflow

Using Python to Parse a Web Page

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related