0

I'm scraping a website with Selenium / Python3, the website only uses invalid selectors like:

<input id="egg:bacon:SPAM" type="text"/>
<input id="egg:sausages:SPAM:SPAM" type="text"/>

(invalid parts are egg:bacon:SPAM & egg:sausages:SPAM:SPAM)

I did try to select these tags with:

driver.find_element_by_css_selector('input#egg:bacon:SPAM')

But of course I get selenium.common.exceptions.InvalidSelectorException


I also did try using xpath to get my tags, it works with:

driver.find_element_by_xpath('//input[@id="egg:bacon:SPAM"]')

But my code is based on a home made library based on CSS selectors. Adding XPATH support would require to add ~200 lines of code (without counting unit tests, documentation, etc..) only to handle this wrong and not generic behavior.

Plus, scraping this website is part of a bigger project where only this specific website use that kind of CSS selectors, pushing that much effort for a single website on 10 makes me uncomfortable.


I could use something like find_element_by_css_selector('.foo > input:nth-child(2)') but it's pretty tricky and any small update on the DOM could break the scraper.

Is there any clean way to handle non valid css selectors via Selenium using find_element_by_css_selector or am I doomed to use XPATH for this website?

2 Answers 2

2

They all valid. You need to escape special characters or use quotes:

driver.find_element_by_css_selector('input[id="egg:bacon:SPAM"]')
driver.find_element_by_css_selector('input#egg\:bacon\:SPAM')
Sign up to request clarification or add additional context in comments.

Comments

1

To identify an element with id attribute containing reserved characters, e.g. egg:bacon:SPAM, egg:sausages:SPAM:SPAM you can use dynamic with the following wildcards :

  • ^ : To indicate an attribute value starts with
  • * : To indicate an attribute value contains
  • $ : To indicate an attribute value ends with

Solution

You can use the following solutions:

  • To identify the element <input id="egg:bacon:SPAM" type="text"/>:

    driver.find_element_by_css_selector("input[id^='egg'][id*='bacon'][id$='SPAM']")
    
  • To identify the element <input id="egg:sausages:SPAM:SPAM" type="text"/>:

    driver.find_element_by_css_selector("input[id^='egg'][id*='sausages'][id$='SPAM']")
    

Reference

You can find a couple of relevant discussions in:

6 Comments

Super nice, it works. But I have few inputs like egg:bacon:SPAM & egg:bacon:SPAM:SPAM on the same page. As I understand your anwser it uses a kind of regex expression (^, *, $) and I fear the example I gave in this comment would not be supported with this method. Also do you have a doc or keyword so I can find doc about this? (+1 anyway)
@Arount ^, * and $ aren't regex expression as such :) but wildcards used with cssSelectors. Checkout the updated answer and let me know the status.
Thanks, very nice to know and super hepful. I will still validate Sers' anwser because it's less verbose (and a replace(':', '\\:') at the right place do the job) but I keep my upvote because it's very good answer (and yeah, wildcards.. ooops :D)
Just for record, I just had a situation where I had to use your wildcards, epic.
@Arount This answer is based on best practices which you have to adapt in the longer run.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.