0

I'm completely new to Python and could really use some assistance.

I'm trying to parse a webpage and retrieve the email addresses off the webpage. Ive tried many things that I've read online and failed.

I realized that when is run BeautifulSoup(browser.page_source) it brings the source code through however for some reason it doesn't bring the email address with it or the business profiles.

Below is my code (don't judge :-))

import os, random, sys, time

from urllib.parse import urlparse

from selenium import webdriver

from bs4 import BeautifulSoup

from webdriver_manager.chrome import ChromeDriverManager

import lxml

browser = webdriver.Chrome('./chromedriver.exe')

url = ('https://www.yellowpages.co.za/search?what=accountant&where=cape+town&pg=1')
browser.get(url)

BeautifulSoup(browser.page_source)

Sidenote: My goal is to navigate the webpages based on search criteria and parse each page for the email addresses, Ive figured out how to navigate the webpages and send keys, it's just the parsing that I'm stuck with. Your help would be greatly appreciated

1

1 Answer 1

1

I recomend you to use the requests module to get the page source:

from requests import get

url = 'https://www.yellowpages.co.za/search?what=accountant&where=cape+town&pg=1'
src = get(url).text  # Gets the Page Source

After that I searched for email formatted words and added them to a list:

src = src.split('<body>')[1]  # Splits it and gets the <body> part

emails = []

for ind, char in enumerate(src):
    if char == '@':
        add = 1  # Count the characteres after and before
        new_char = src[ind+add]  # New character to add to the email
        email = char  # The full email (not yet)

        while new_char not in '<>":':
            email += new_char  # Add to email

            add += 1                   # Readjust
            new_char = src[ind + add]  # Values

        if '.' not in email or email.endswith('.'):  # This means that the email is 
            continue                                 # not fully in the page

        add = 1                    # Readjust
        new_char = src[ind - add]  # Values

        while new_char not in '<>":':
            email = new_char + email  # Add to email

            add += 1                   # Readjust
            new_char = src[ind - add]  # Values

        emails.append(email)

At last, you can use set to remove duplicates and print the emails

emails = set(emails)  # Remove Duplicates

print(*emails, sep='\n')
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you Rafael, I gave it a go and the same thing happened. It seems when I print the source code, it leaves out the entire first section which contains all the email addresses and only prints the last part. Any suggestions?
What do you mean by 'the last part'?
Basically there are a total of 3060 lines of code in the source code on the actual web page. When we parse the source code using Python it only takes the source code from line 1760 to 3060

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.