0

In python/selenium, how can I get the sku number in HTML code as in image? Blow code only can get text of the element, I want the content directly in the HTML. Thanks!

enter image description here

import time
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://search.jd.com/Search?keyword=%E6%9E%9C%E6%B1%81&qrst=1&wq=%E6%9E%9C%E6%B1%81&stock=1&pvid=b86735ca93754d6f96a68a4ee0e187d5&psort=3&click=0')

driver.execute_script("""
(function () {
var y = 0;
var step = 100;
window.scroll(0, 0);
function f() {
if (y < document.body.scrollHeight) {
y += step;
window.scroll(0, y);
setTimeout(f, 100);
} else {
window.scroll(0, 0);
document.title += "scroll-done";
}
}
setTimeout(f, 1000);
})();
""")
print("下拉中...")
# time.sleep(180)
while True:
    if "scroll-done" in driver.title:
        break
    else:
        print("还没有拉到最底端...")
        time.sleep(3)
skus=driver.find_elements_by_xpath("//div[@id='J_goodsList']")
for sku in skus:
    print(sku.text)

2 Answers 2

1

You can get any element attribute value with .get_attribute method.
So, here you can do something like the following:

import time
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://search.jd.com/Search?keyword=%E6%9E%9C%E6%B1%81&qrst=1&wq=%E6%9E%9C%E6%B1%81&stock=1&pvid=b86735ca93754d6f96a68a4ee0e187d5&psort=3&click=0')

driver.execute_script("""
(function () {
var y = 0;
var step = 100;
window.scroll(0, 0);
function f() {
if (y < document.body.scrollHeight) {
y += step;
window.scroll(0, y);
setTimeout(f, 100);
} else {
window.scroll(0, 0);
document.title += "scroll-done";
}
}
setTimeout(f, 1000);
})();
""")
print("下拉中...")
# time.sleep(180)
while True:
    if "scroll-done" in driver.title:
        break
    else:
        print("还没有拉到最底端...")
        time.sleep(3)
skus=driver.find_elements_by_xpath("//div[@id='J_goodsList']//li[@data-sku]")
for sku in skus:
    print(sku.get_attribute("data-sku"))
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, but the code feedback None
You possibly missing a delay. Try putting a sleep before skus=driver.find_elements_by_xpath("//div[@id='J_goodsList']") and let me know if that helped. If yes - we will improve that to remove hardcoded sleep
//div[@id='J_goodsList']/ul/li shouldn't you guys be targetting the li's
@Prophet it's ok now after adding sleep time after skus=driver.find_elements_by_xpath("//div[@id='J_goodsList']"). Thanks for your help!
I'm happy I could help you! In case this resolved your problem please accept the answer to indicate the question resolved.
0

You were pretty close. Instead of targetting the ancestor <div> canonically you can target the descendant <li>. Finally, instad of extracting the text, you need to get_attribute("data-sku")


Solution

To extract and print the sku numbers e.g. 3313643, 5327144, etc you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.get('https://search.jd.com/Search?keyword=%E6%9E%9C%E6%B1%81&qrst=1&wq=%E6%9E%9C%E6%B1%81&stock=1&pvid=b86735ca93754d6f96a68a4ee0e187d5&psort=3&click=0')
    print([my_elem.get_attribute("data-sku") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#J_goodsList ul li")))])
    
  • Using XPATH:

    driver.get('https://search.jd.com/Search?keyword=%E6%9E%9C%E6%B1%81&qrst=1&wq=%E6%9E%9C%E6%B1%81&stock=1&pvid=b86735ca93754d6f96a68a4ee0e187d5&psort=3&click=0')
    print([my_elem.get_attribute("data-sku") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@id='J_goodsList']//ul//li")))])
    
  • Console Output:

    ['3313643', '5327144', '1256816', '3127041', '100007725801', '7153462', '100020544789', '3088504', '2439951', '4602877', '100013213444', '100018630091', '10028327196597', '100005772043', '3081867', '1044735', '4323156', '100010085943', '848890', '100010783078', '4377126', '3557308', '5417682', '100020805609', '4641871', '100017615158', '100032110985', '848893', '100013210524', '100017615166']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.