How to get different texts from the HTML DOM through Selenium and Python

Question

In the following example:

<tr>
    <td>
    </td>

    <td>
    </td>

    <td>
    </td>

    <td>
    </td>

    <td>
        text1
        <br>
        <img>
        <br>
        text2
    </td>
</tr>

When I try to get the text in the 5th td like so:

something = elem.find_element_by_xpath('./td[5]').text

I get both texts in the same variable. I can split them but I was wondering if I can somehow get them in individual variables so I don't bother with a split. However when I try something like this:

something = elem.find_element_by_xpath('./td[5]/text()[1]')

I get the following error message:

InvalidSelectorException: invalid selector: 
The result of the xpath expression "./td[5]/text()[1]" is: [object Text]. 
It should be an element.

Can I get around this error somehow?

Because Selenium requires the return result of find_element must be Element Node, your /td[5]/text()[1] will return a Text Node, this why you get the error. For What's Element/ Text Node, you can read HTML DOM document, for node in DOM Tree, it has 3 types, Element and Text is two types of the 3 types. — yong
– yong, Commented Mar 28, 2018 at 10:32

Andersson · Accepted Answer · 2018-03-28 10:07:52Z

4

You can try below code to get two separate text nodes:

something = elem.find_element_by_xpath('./td[5]')
text1 = driver.execute_script('return arguments[0].firstChild.textContent;', something).strip()
text2 = driver.execute_script('return arguments[0].lastChild.textContent;', something).strip()

answered Mar 28, 2018 at 10:07

Andersson

52.8k18 gold badges83 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

cybera Over a year ago

Thanks you solution worked wonderfully. If you have time and don't mind I'd love it if you walk me through the code. It would help immensely. Particularly what does the script part and the strip part do.

Andersson Over a year ago

Each variable is a result of JavaScript code. arguments[0] is a placeholder for something element. So the first code means return the text value of the first child node of td, second is the same for last child node of td. Note that child nodes of td are: 1. "text1", 2. br, 3. "" 4. img, 5.br, 6. "", 7. "text2". strip() just allows you to get rid of leading and trailing new-line characters and spaces. Also firstChild/lastChild can be replaced with explicit index e.g. childNodes[0], childNodes[6]

cybera Over a year ago

My wholehearted thanks. How would it have been written if it was the second and fourth child instead of the first and last for which there are specific functions?

Andersson Over a year ago

I've updated my previous comment. You can use arguments[0].childNodes[N] to get N-th node

undetected Selenium · Accepted Answer · 2018-03-28 10:26:15Z

1

In your initial code trial when you used :

something = elem.find_element_by_xpath('./td[5]').text

You got both the elements text1 and text2 as both the text were part of <td[5]>

In your next code trial when you used :

something = elem.find_element_by_xpath('./td[5]/text()[1]')

Raised InvalidSelectorException because, though ./td[5]/text() is a valid xpath expression but currently is not supported by Selenium. Hence the error is raised.

To extract the texts text1 and text2 from the HTML you have provided you can use the str.splitlines method as follows :

text1 = driver.find_element_by_xpath("//tr//following-sibling::td[5]").get_attribute("innerHTML").splitlines()[1]
text2 = driver.find_element_by_xpath("//tr//following-sibling::td[5]").get_attribute("innerHTML").splitlines()[5]

edited Mar 28, 2018 at 10:26

answered Mar 28, 2018 at 10:20

undetected Selenium

194k44 gold badges304 silver badges387 bronze badges

3 Comments

cybera Over a year ago

Thanks for the answer its more clear than the other answer to me but as far as I understand it, it will only work with properly formatted html. If lets say the texts were at the same line as the tags, what would happen than? Would it break?

undetected Selenium Over a year ago

I am afraid. I thought you tagged Python but not JavaScript my solution is more Pythonic indeed. Perhaps you wanted to look at str.splitlines

undetected Selenium Over a year ago

@cybera Yes, you are right when you say it will only work with properly formatted html. Factually, the HTML DOM is always in a formated state. It all boils down how the end user interprets it. Incase texts were at the same line as the tags we would have fine tuned our approach but the algorithm would have been same being Pythonic.

Collectives™ on Stack Overflow

How to get different texts from the HTML DOM through Selenium and Python

2 Answers 2

4 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related