0

I have the following HTML code, I want to extract Years and names, I tried everything with no success :

<div class="Year">

<span class="date">2019</span>

</div>



<div class="cl2">
    <span class="name">name1</span>
</div>
<div class="cl2">
    <span class="name">name2</span>
</div>
<div class="cl2">
    <span class="name">name3</span>
</div>
<div class="cl2">
    <span class="name">name4</span>
</div>



<div class="Year">
    <span class="date">2020</span>
</div>

<div class="cl2">
    <span class="name">name5</span>
</div>
<div class="cl2">
    <span class="name">name6</span>
</div>

What I want to get is :

2019
name1
name2
name3
name4
2020
name5
name6

I tried the following, using xpath

years = driver.find_elements_by_xpath("//div[@class='year']")

for year in years:
    
    print(year.find_element_by_xpath(".//span[@class='date']").text)

names = driver.find_elements_by_xpath("//div[@class='name']")

for name in names:
    print(name.find_element_by_xpath(".//span[@class='name']").text)

I got :

2019

2020

name1

name2

name3

name4

name5

name6

3 Answers 3

1

You can get them using and preceding:

names = dict()
for e in driver.find_elements_by_class_name('name'):
    name = e.text
    year = e.find_element_by_xpath("(./preceding::span[@class='date'])[last()]").text
    names[name] = year

{'name1': '2019', 'name2': '2019', 'name3': '2019', 'name4': '2019', 'name5': '2020', 'name6': '2020'}

Also you can get all elements and collect using class:

names = dict()
year = None
for e in driver.find_elements_by_css_selector('.date, .name'):
    if 'name' in e.get_attribute('class'):
        names[e.text] = year
    if 'date' in e.get_attribute('class'):
        year = e.text

{'name1': '2019', 'name2': '2019', 'name3': '2019', 'name4': '2019', 'name5': '2020', 'name6': '2020'}

Sign up to request clarification or add additional context in comments.

1 Comment

Hello Sers , I posted another problem on this same question , Would you please take a look at it if you don't mind stackoverflow.com/questions/63215107/…
1

A solution is to work with a html file converted to a text file rather than working with the html file directly. This approach gives much more flexibility to extract the desired text from the given source file.

Firstly, import the import re library which will allow us to easily parse our html_text file

Then read in the text file and use .split() to split the text into a list based off of the year class. Next, iterate over the list and use re.search and re.findall to target your date and name classes within the text strings.

import re 

f = open("html_text.txt", "r")
html_text = (f.read())

text_list = text.split('<div class="Year">')

for year in text_list[1:]:
  date = re.search('<span class="date">(.+?)</span>', year)
  names = re.findall('<span class="name">(.+?)</span>', year)

  print(date.group(1))
  for name in names:
    print(name)

The output when printing out the results should look something like this

Output:

2019
name1
name2
name3
name4
2020
name5
name6

Hope this helped!!

Comments

0

I managed to find elements between div using .get_attribute("textContent") instead of .text using tip from Get Text from Span returns empty string

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.