-3

I am trying to extract all the links from a forum (https://www.pakwheels.com/forums/c/travel-n-tours) My scraper class stops after scrolling down once.

from bs4 import BeautifulSoup

sourceUrl='https://www.pakwheels.com/forums/c/travel-n-tours'

#----------------------------------Source of below code:http://stackoverflow.com/questions/32391303/how-to-scroll-to-the-end-of-the-page-using-selenium-in-python--------------------#
#----------------------- Scrolling to the bottom of page ----------------------------- ----------#

from selenium import webdriver
import time
chrome_path=r"C:\Users\Shani\Desktop\chromedriver.exe"
driver=webdriver.Chrome(chrome_path)
driver.get(sourceUrl)
updatedLenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
scrollComplete=False
while(scrollComplete==False):
        currentLenOfPage = updatedLenOfPage
        updatedLenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        print('Scrolling down')
        time.sleep(5)
        if currentLenOfPage==updatedLenOfPage:
            scrollComplete=True
time.sleep(10)
pageSource=driver.page_source

# ------------------------------------- Getting links ---------------------------------- #
soup = BeautifulSoup(pageSource, 'lxml')
# print(soup)

blogUrls=[]
for url in soup.find_all('a'):
    if((url.get('href').find('/forums/t/')!=-1) and (url.get('href').find('about-the-travel-n-tours-category')==-1) and (url.get('href').find('/forums/t/topic/')==-1)):
        blogUrls.append(url.get('href'))
        print(url.get('href'))       
print(len(blogUrls))

It gives the following error

Traceback (most recent call last):
  File "D:\LiclipsWorkSpace\NLKTLib\Scraping\scrolling.py", line 32, in <module>
    if((url.get('href').find('/forums/t/')!=-1) and (url.get('href').find('about-the-travel-n-tours-category')==-1) and (url.get('href').find('/forums/t/topic/')==-1)):
AttributeError: 'NoneType' object has no attribute 'find'

Please help

1
  • Yes you can say but, i couldn't understand the answer in that question. This is what i am trying to do and getting errors. Any suggestions? Commented Apr 8, 2017 at 7:52

1 Answer 1

1

You don't need Selenium, you can get all links from json response. This code gets urls from first 5 pages (for getting all pages simply change last 5 to 264).

import requests

for i in range(0, 5):
    r = requests.get(
        'https://www.pakwheels.com/forums/c/travel-n-tours/l/latest.json?page={}'.format(i)).json()
    topics = r['topic_list']['topics']
    for topic in topics:
        print ('https://www.pakwheels.com/forums/t/{}/{}'.format(topic['slug'], topic['id']))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.