1

I am sitting with a project for my masters, where I would like to scrape LinkedIn. As far as I am now, I ran into a problem when I want to scrape the education pages of users (eg. https://www.linkedin.com/in/williamhgates/details/education/)

I would like to scrape all the educations of the users. In this example I would like to scrape "Harvard University" under mr1 hoverable-link-text t-bold, but I can't see to get to it.

Here's the HTML at code from Linkedin:

<li class="pvs-list__paged-list-item artdeco-list__item pvs-list__item--line-separated " id="profilePagedListComponent-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-EDUCATION-VIEW-DETAILS-profile-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-NONE-da-DK-0">
                        <!----><div class="pvs-entity
    pvs-entity--padded pvs-list__item--no-padding-when-nested
    
    ">
  <div>
        <a class="optional-action-target-wrapper 
        display-flex" target="_self" href="https://www.linkedin.com/company/1646/">
        <div class="ivm-image-view-model  pvs-entity__image ">
    <div class="ivm-view-attr__img-wrapper ivm-view-attr__img-wrapper--use-img-tag display-flex
    
    ">
<!---->      <img width="48" src="https://media-exp1.licdn.com/dms/image/C4E0BAQF5t62bcL0e9g/company-logo_100_100/0/1519855919126?e=1668643200&amp;v=beta&amp;t=BL0HxGNOasVbI3u39HBSL3n7H-yYADkJsqS3vafg-Ak" loading="lazy" height="48" alt="Harvard University logo" id="ember59" class="ivm-view-attr__img--centered EntityPhoto-square-3  lazy-image ember-view">
</div>
  </div>
    </a>

  </div>

  <div class="display-flex flex-column full-width align-self-center">
    <div class="display-flex flex-row justify-space-between">
          <a class="optional-action-target-wrapper 
          display-flex flex-column full-width" target="_self" href="https://www.linkedin.com/company/1646/">
        <div class="display-flex align-items-center">
            <span class="mr1 hoverable-link-text t-bold">
              <span aria-hidden="true"><!---->Harvard University<!----></span><span class="visually-hidden"><!---->Harvard University<!----></span>
            </span>
<!----><!----><!---->        </div>
<!---->          <span class="t-14 t-normal t-black--light">
            <span aria-hidden="true"><!---->1973 - 1975<!----></span><span class="visually-hidden"><!---->1973 - 1975<!----></span>
          </span>
<!---->      </a>


<!---->
      <div class="pvs-entity__action-container">
<!---->      </div>
    </div>

      <div class="pvs-list__outer-container">
<!---->    <ul class="pvs-list
        
        ">
        <li class=" ">
                <div class="pvs-list__outer-container">
<!----><!----><!----></div>

        </li>
    </ul>
<!----></div>
  </div>
</div>

                </li>

I have tried the following code:

education = driver.find_element("xpath", '//*[@id="profilePagedListComponent-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-EDUCATION-VIEW-DETAILS-profile-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-NONE-da-DK-0"]/div/div[2]/div[1]/a/div/span/span[1]/').text
print(education)

I keep getting the error:

Message: no such element: Unable to locate element:

Can anybody help? I would love to have a script that loops through the educations, and save place of education and the year of educations.

1
  • May I suggest you use Playwright for your project? It's supports Python and is a pleasure to work with. Commented Aug 19, 2022 at 12:15

5 Answers 5

1

Thank you everyone!

I ended up with this code under that worked.

get_education_school = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='pvs-list ']/li//span[contains(@class, 'hoverable-link-text')]//span[1]")))]

get_education_years = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='pvs-list ']/li//span[contains(@class, 't-14 t-normal t-black--light')]//span[1]")))]

results_education_school = []
results_education_years = []
for i,j in zip(get_education_school, get_education_years):
    results_education_school.append(i)
    results_education_years.append(j)

print(results_education_school)
print(results_education_years)
Sign up to request clarification or add additional context in comments.

Comments

0

To extract the text Harvard University ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

  • Using CSS_SELECTOR:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "ul.pvs-list>li span.hoverable-link-text span"))).text)
    
  • Using XPATH:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//ul[@class='pvs-list ']/li//span[contains(@class, 'hoverable-link-text')]//span"))).text)
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

Comments

0

I would first get the list for the education section.

education_list = driver.find_element(By.CSS_SELECTOR, 'ul.pvs-list')
# loop through education_list for place and years
# would recommend relative locators for this task.
# find the image and get the first and second span with text inside of them.

I am adding further details to the code now. Please hold.

1 Comment

Hi Wonhyeong, If I use the statement written, I get the following error (I had already tried that also, to get the list). NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"ul.pvs-list"}
0

You can use below properties to identify the school name list:

ancestorClass="optional-action-target-wrapper display-flex flex-column full-width" class="display-flex align-items-center" tag="DIV"

Use these properties to identify the year list:

ancestorClass="optional-action-target-wrapper display-flex flex-column full-width" class="t-14 t-normal t-black--light" tag="SPAN"

You may use above info to compose an XPath to locate the list, or if you don't mind using other python libraries, there is a sample code in GitHub to scrape the school and year.

Comments

0

@Nadia S. you can try the following code. I have provided comments inline inside the code.

    @Test
    public void linkedInTest() {
        driver.get("https://www.linkedin.com");

        // You need to enter the credentials for your linkedin below for login
        driver.findElement(By.id("session_key")).sendKeys("");
        driver.findElement(By.id("session_password")).sendKeys("");
        driver.findElement(By.className("sign-in-form__submit-button")).click();
        driver.get("https://www.linkedin.com/in/williamhgates/details/education/");

        //Wait for the Education details to get populated. 
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(7));
        wait.until(ExpectedConditions.visibilityOfElementLocated(
                By.xpath("//div[@class = 'pvs-list__container']//div[@class = 'scaffold-finite-scroll__content']/ul")));
        
        //Take all elements showing education details in a list 
        List<WebElement> allEducation = driver.findElements(By
                .xpath("//div[@class = 'pvs-list__container']//div[@class = 'scaffold-finite-scroll__content']/ul/li"));
        //Extract details of each education item in the list. 
        //Below the details are directed to console. You can use a collection to store them.
        for (WebElement oneEducation : allEducation) {
            WebElement education = oneEducation.findElement(
                    By.xpath(".//*[contains(@class,\"mr1 hoverable-link-text\")]/span[@aria-hidden='true']"));
            System.out.print("Education - " + education.getText());
            try {
                WebElement educationType = oneEducation
                        .findElement(By.cssSelector(".t-14.t-normal span[aria-hidden='true']"));
                System.out.print("      Education Type - " + educationType.getText());
            } catch (NoSuchElementException e) {
                System.out.print("      Education Type - " + "is Not Specified");
            }
            try {
                WebElement educationYear = oneEducation
                        .findElement(By.cssSelector(".t-14.t-normal.t-black--light span[aria-hidden='true']"));
                System.out.println("        Education Year - " + educationYear.getText());
            } catch (NoSuchElementException e) {
                System.out.println("        Education Year - " + "is Not Specified");
            }
        }

    }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.