1

I have a client that wants to web scrape this sketchy website and the loops works the first time, then the error occurs. Any help? I suggest not visiting the website, but hopefully the pays worth my time lol.

options = webdriver.ChromeOptions()
options.add_argument("--incognito")
PATH = 'C:\Program Files (x86)\chromedriver.exe'
URL = 'https://avbebe.com/archives/category/高清中字/page/5'
driver = webdriver.Chrome(executable_path=PATH, options=options)
driver.get(URL)

time.sleep(5)
Vid = driver.find_elements_by_class_name('entry-title')
for title in Vid:
    actions = ActionChains(driver)
    time.sleep(5)
    WebDriverWait(title, 10).until(EC.element_to_be_clickable((By.TAG_NAME, 'a')))#where error occurs
    actions.double_click(title).perform()
    time.sleep(5)
    VidUrl = driver.current_url
    VidTitle = driver.find_element_by_xpath('//*[@id="post-69331"]/h1/a').text
    try:
        VidTags = driver.find_elements_by_class_name('tags')
        for tag in VidTags:
            VidTag = tag.find_element_by_tag_name('a').text
        
    except NoSuchElementException or StaleElementReferenceException:
        pass
    
    with open('data.csv', 'w', newline='', encoding = "utf-8") as f:
        fieldnames = ['Title', 'Tags', 'URL']
        thewriter = csv.DictWriter(f, fieldnames=fieldnames)

        thewriter.writeheader()
        thewriter.writerow({'Title': VidTitle, 'Tags': VidTag, 'URL': VidUrl})
    driver.back()
    driver.refresh()
print('done')        

Error:

WebDriverWait(title, 10).until(EC.element_to_be_clickable((By.TAG_NAME, 'a')))
  File "C:\Users\Heage\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\support\wait.py",

line 80, in until raise TimeoutException(message, screen, stacktrace) selenium.common.exceptions.TimeoutException: Message:

1
  • 2
    You aren't using a wait when you get the elements array. This will work fine after a get(), but may not for a .back() or .refresh() Selenium waits for the page to load after a get()... I would put the get() inside the loop instead of using back() and refresh(). Commented Jun 24, 2021 at 21:47

2 Answers 2

3

You are nearly there, just missing a few pieces.

Firstly, you are fetching all the links to videos, and then navigating in a loop.

Vid = driver.find_elements_by_class_name('entry-title')
for title in Vid:
    # ...
    WebDriverWait(title, 10).until(EC.element_to_be_clickable((By.TAG_NAME, 'a')))
    # ...
    driver.back()
    driver.refresh()

What happens is that once the browser navigates to a different url, all of those elements become stale, i.e. they will throw an error when you try to click them as the browser no longer has a connection to the original elements.

So what you need to do is to read all the available links into a list and just access them using driver.get without the need to refresh the page

link_elements = driver.find_elements_by_class_name('entry-title a')
links = {link_element.get_attribute('href') for link_element in link_elements}

for link in links:
    driver.get(link) # otherwise, stale elements

Next, once you open the page, you are searching for an element with an id.

    VidTitle = driver.find_element_by_xpath('//*[@id="post-69331"]/h1/a').text

However, you have to keep in mind that ids change from page to page, so your script is likely to fail here. Instead, try to find classes that don't change. I took a look at the page and found that the video title has an tag with a entry-title class, so I used that instead

    VidTitle = driver.find_element_by_css_selector('h1.entry-title').text

Working solution


options = Options()
options.add_argument("--incognito")
driver = webdriver.Chrome(options=options)

URL = 'https://avbebe.com/archives/category/高清中字/page/5'

driver.get(URL)

link_elements = driver.find_elements_by_class_name('entry-title a')
links = {link_element.get_attribute('href') for link_element in link_elements}

for link in links:
    driver.get(link)

    VidUrl = driver.current_url
    VidTitle = driver.find_element_by_css_selector('h1.entry-title').text
    try:
        VidTags = driver.find_elements_by_class_name('tags')
        for tag in VidTags:
            VidTag = tag.find_element_by_tag_name('a').text

    except NoSuchElementException or StaleElementReferenceException:
        pass

    with open('data.csv', 'w', newline='', encoding="utf-8") as f:
        fieldnames = ['Title', 'Tags', 'URL']
        thewriter = csv.DictWriter(f, fieldnames=fieldnames)

        thewriter.writeheader()
        thewriter.writerow({'Title': VidTitle, 'Tags': VidTag, 'URL': VidUrl})

print('done')
Sign up to request clarification or add additional context in comments.

Comments

0

Put the line driver.get(URL) inside the loop. Remove driver.back() and driver.refresh().

options = webdriver.ChromeOptions()
options.add_argument("--incognito")
PATH = 'C:\Program Files (x86)\chromedriver.exe'
URL = 'https://avbebe.com/archives/category/高清中字/page/5'
driver = webdriver.Chrome(executable_path=PATH, options=options)
driver.get(URL)

time.sleep(5)
Vid = driver.find_elements_by_class_name('entry-title')
for title in Vid:
    driver.get(URL)
    actions = ActionChains(driver)
    time.sleep(5)
    WebDriverWait(title, 10).until(EC.element_to_be_clickable((By.TAG_NAME, 'a')))#where error occurs
    actions.double_click(title).perform()
    time.sleep(5)
    VidUrl = driver.current_url
    VidTitle = driver.find_element_by_xpath('//*[@id="post-69331"]/h1/a').text
    try:
        VidTags = driver.find_elements_by_class_name('tags')
        for tag in VidTags:
            VidTag = tag.find_element_by_tag_name('a').text
        
    except NoSuchElementException or StaleElementReferenceException:
        pass
    
    with open('data.csv', 'w', newline='', encoding = "utf-8") as f:
        fieldnames = ['Title', 'Tags', 'URL']
        thewriter = csv.DictWriter(f, fieldnames=fieldnames)

        thewriter.writeheader()
        thewriter.writerow({'Title': VidTitle, 'Tags': VidTag, 'URL': VidUrl})
    #driver.back()
    #driver.refresh()
print('done')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.